Mohamed Lazzouni0:23
Thank you so much, Philip. It's such an absolute thrill to be here again and thank you for the opportunity to share what is today a really critical topic. There's a lot of buzz about it and quite frankly also a lot of confusion about it. So to really be given the opportunity to weigh in on this is absolutely delightful. Without any further ado, I wanted to perhaps begin with a little context, with your permission, which is to just share a picture or two. That will give us a chance to talk about this and then we can go from there with a little bit more detail. In the world of what's going on now on how to really deal with this problem of deep fakes, obviously people can manipulate voice, manipulate faces, manipulate full videos, can do a number of things. The one solution that fits all is still elusive, but one can break it into pieces so to speak. So we can really bring to bear capabilities that can really help thwart this problem of fraud and deep fake. My focus today is going to be exclusively on the use of face liveness as a key tool to detect deep fakes and synthetic identities in the case of payments and in other cases as well. I've seen a number of ideas emerge on how to describe this and one of the things that I'm gravitating towards, which I think is more intuitive and more inclusive of both highlighting what the problem statement is and position it well enough, are the things that people are experiencing as messages like, what are we doing to prove that you are not a robot? Proof of humanness, proof of personhood. So this is essentially what this game is all about: to prove that one is not a robot or to prove humanness or proof of personhood with a high degree of confidence. The particular approach that I'm going to be showcasing and talking about is this idea of breaking the problem through the life cycle of how the face, this unique representation of a human being, can be processed with the machine all the way from its inception until the machine renders a verdict as to whether or not that presentation of the face in front of the machine is that of a human being. In other words, a human being is really there or it is a manipulation via a number of mechanisms that we'll discuss. The approach that we have taken here is to really break it down into leveraging the best pieces that are the most potent to put them to use such that we can tell the difference between a deep fake and a real human being. It all begins with the detection. How do you detect the face? How do you do a pose analysis? How do you check the quality of the image and when available, if it's not on the web, it is on a device, how do you also leverage other types of data that come from the sensors that are in the devices in order for one to establish the proof of this humanness or the proof of personhood on the other end? Then in terms of the processing, we leverage the best of worlds. What is good at something? We use that horsepower in order to solve the problem. Things that get done on the device itself and things that get sent to the server, and the two of them will work in tandem together to analyze a sequence of signals, in this case images, to make a determination as to what we are having on the other side: a real human being or a deep fake. Under the hood, when those images get passed for analysis to a computer to make that determination, a number of deep learning models are activated to go and perform various types of analysis on these frames that arrive, to then do that by themselves or rely on other data that might arrive from the device itself where the camera is hosted and captured in that particular face. For all of these pieces then to be fed into a mechanism of analyzing and scoring where we then can fuse that particular output and finally issue an opinion: is it live or is it spoof? And we don't limit ourselves there. We can also attempt to tell you what spoof type it is, is it real or is it injected, and there also we can tell you what type of injection type that we are dealing with. So as the threat vectors abound, face swapping, image manipulation, deep fake videos, synthetic images and things of that nature, this mechanism, this pipeline or this solution put together is then put in place in order to be able to make the determination and classify where these various forms fall. Without any further ado, I thought I'd share one or two cases here and then do a live demo on one. One of the ones that is gaining in popularity is face swapping. Say that the individual in the middle of this picture here is known to pass live sessions. The individual on the left would be interested in piggybacking on this live, would take elements from this live face, overlay something on them, and then create this mix of the two faces in order to attack a system. The way that we would do that and the way we would stop this, the image that would then be swapped in this manner is injected through some mechanism and that's how the session will go. One would basically be presented with the session to capture this particular face. They will inject it in the hope that it goes through, and it doesn't because the system that we will have in place will recognize that this face was manipulated and as such it is indeed a spoof and cannot be trusted as a true image. Let's now talk about other ways of manipulations in deep fake where somebody doesn't do a complete face swap. What they might do is to apply tweaks to the image, maybe translate it a little bit, or maybe the image that they brought in was at an angle, rotate it a little bit, or maybe scale it, zoom in and zoom out, or maybe take some specific pixels into the image and manipulate maybe their grayscale or overall adjust the contrast up and down for more brightness and for less brightness. The problem why these little tweaks can become a bigger problem is that the simple detection mechanisms of trying to find them can be significantly hindered when we apply to images things like compression for instance. So you need a lot more. The pipeline that I have shown before is one that would solve this particular problem. Here is an example again. This image here was indeed manipulated via image translation. So if we now run it through the system and the system is properly calibrated, the image gets presented to a device again similar to the manner that we have seen before. As the image then gets fed into the system via injection, the system will be trusted to run the analysis on it and then render an opinion as to what it did. Here it is being presented, it is being translated to mimic some form of live things. But the system recognizes that this is a manipulation. It's not real human movement into the translation, in which case it will tell that it is a spoof. So these are models of what can be done. Now for the conclusion of something that we can do live, I'm going to attempt to do that in real time, so to speak. Here what I've done with my own browser is I made the browser here default to a camera that is a virtual camera. It's not the camera that is feeding the real stream that's coming from me. But as far as the session is concerned, it cannot tell the difference between one and the other. On the left side here, I used prompt engineering to begin with a static picture from which I generated the deep fake. The first attempt I'm going to make is to take this image that has been manipulated just a little bit to tell the difference between whether I am live and whether I'm not. Now I'm trying to detect the injection. If I say to the system go and detect this because it's not looking at life, it did recognize that I am not personally live feeding into this. Now that I realize this is the case, I'm going to try to animate this picture and via prompt engineering, I'm causing it to start moving. None of this is real movement. It's all generated and animated. Now I'm going to put the system to the test again whereby I'm going to take this particular video that is being looped, send it via a virtual image to go and attack the system. Now the system is looking at it and says there is human movement here. So I'm going to keep tracking it until I find a time to do a focal capture. As soon as it captures that, it's going to perform its analysis and then try to tell me whether this is a fake or not. At this moment here, it's going to perform the capture and it says, 'Oops, this is not a real human being. This is a fake video that is manipulated, also known as deep fake, in order for the system to be put in that.' So that's the collection of thoughts that I wanted to share today and put this in front of you so it can inform and educate and be useful for people to see on how this works in the real world.