Yann Lecun

Chief AI Scientist, Meta/Independent

IJCAI-ECAI 2018, Yann LeCun Learning World Models – The Next Step Towards AI

🎥 Jul 10, 2018 📺 IJCAI conf ⏱ 59m 👁 704 views

Yann LeCun, Facebook AI Research lab and New York University, held a keynote, “Learning World Models: the Next Step Towards AI”, at the IJCAI-ECAI 2018, the 27th International Joint Conference on Artificial Intelligence and the 23rd European Conference on Artificial Intelligence, the premier international gathering of researchers in AI.

Watch on YouTube

About Yann Lecun

Yann LeCun, the Turing Award winner and former chief AI scientist at Meta, has been publicly advocating for an alternative approach to artificial intelligence that moves beyond large language models (LLMs). In talks and interviews from 2025 and 2026, LeCun described LLMs as useful for tasks like code generation and information access but argued they are not a path to human-level intelligence, stating that they lack the ability to predict the consequences of their actions and cannot handle the "messy" real world. He has promoted his Joint Embedding Predictive Architecture (JEPA) and "world models" as a more promising direction, emphasizing that AI systems should learn abstract representations rather than generating pixel-level predictions. LeCun has also been critical of vision-language-action (VLA) models used in robotics, calling them "doomed" and asserting they do not work well without vast amounts of training data. LeCun left Meta in early 2026 and became executive chairman of a new company, Advanced Machine Intelligence (AMI) Labs, which focuses on "physical AI" for robotics and industrial control. He also serves as chief scientific advisor to the Tapestry project, an open-source AI initiative under the AI Alliance that aims to collaboratively train foundation models without pooling private data. LeCun has argued that a diverse ecosystem of AI assistants is necessary to protect cultural and linguistic diversity, and that current models produced by a handful of companies pose risks to information diversity. He has described his mission as "protecting democracy" by ensuring people have access to a wide variety of information sources.

Source: AI-verified profile updated from Yann Lecun's recent appearances. Browse all interviews →

Transcript (52 segments)

✨ AI-enhanced transcript with speaker attribution

Host0:00

Yann is the director of AI research at Facebook. He's also a founding director of the NYU Center for Data Science and the Silver Professor of Computer Science, Neuroscience, and Electrical and Computer Engineering at the Courant Institute of Mathematical Sciences at NYU. He works primarily in the fields of machine learning, computer vision, mobile robotics, and computational neuroscience. Yann is a member of the US National Academy of Engineering, the recipient of the 2014 IEEE Computer Society Pioneer Award and of the 2015 PAMI Distinguished Researcher Award. Please join me in welcoming Yann Lecun.

Yann Lecun0:38

Thank you. Thank you for coming, thank you to the organizers for inviting me. It's a real pleasure to be here. So I'm in a conference on AI, like all of you, and perhaps coming through the angle of machine learning. I've been interested in how AI can come about through learning, which is probably why I've spent more time in conferences like NIPS than in AAAI in the past. But I did give a tutorial at AAAI many, many years ago on neural nets. What I like about AAAI though is the very wide diversity of topics, so it's a real pleasure to be here.

Almost all the practical applications of machine learning today — what people now call AI (a bit of an abuse of language, but the new AI if you want) — is due to supervised learning. We all know what supervised learning is: you want to train a machine to distinguish images of cars from airplanes; you show it thousands of examples of each of those categories, you tell the machine the correct answer, and it adjusts its internal parameters so that the answer the machine produces is very close to the answer you want. This has been incredibly successful in a host of very, very useful applications that are very widely deployed and used literally hundreds of trillions of times per day for things like speech recognition, image recognition, face recognition, generating captions for photos, classifying text into topics, translation, and things like that.

What deep learning has allowed us to do is to replace the old models that date back to the perceptron and statistical pattern recognition. The way you had to build your system was by taking the raw data, building a hand-crafted feature extractor that constructs an appropriate representation of the raw input so that it can be digested by a relatively simple learning algorithm like a linear classifier, nearest neighbor, a tree, or something like that. What deep learning has allowed us to do is replace this by a cascade of modules, all of which are trainable and are trained end-to-end. By using simple gradient descent — it's really nothing complicated — we can feed the system with raw inputs, and by appropriately designing the cascade of modules, the modules can not just learn to classify everything but also learn to produce appropriate internal representations and features to achieve the task. That's really what deep learning was always about.

What's quite surprising is that the ideas for this go back a long time, back to early attempts in the 1950s by people who became very prominent in the AI community, trying to build neural nets. What's a little surprising is that you can build fairly complex systems out of essentially two basic operations: one is just a linear operation. So imagine your system is built out of vectors; you compute weighted sums of the components of those vectors with various coefficients, you pass them through a nonlinear function, and each of those elements is sort of an elementary classifier, if you want. You stack many of those classifiers, and that constitutes a neural net. So there are two kinds of operations: linear operations (take a vector, multiply by a matrix) and pointwise nonlinearities (take a vector and pass every component through a very simple nonlinearity). In modern neural nets, this nonlinearity is essentially just a half-wave rectifier, so it's very, very simple. The amazing thing is that it works at all.

There is an interesting paper by Minsky and Selfridge from the 1950s where they say, 'This idea of hill climbing (they call it 'gradient climbing', which is what we now call gradient descent) is never going to work because you're going to get trapped in local maxima (at the time; now local minima).' They have this beautiful hand-drawn drawing where you have sort of a peak in a mountain surrounded by a ridge in the shape of a ring, and they say you can never get to the maximum because of the ring. That was an interesting intuition that a lot of people have had, but it turns out to be wrong. In very high-dimensional spaces, it's actually quite hard to have bad local minima. That's one of the theoretical mysteries of neural nets: that when you make them considerably bigger than necessary, they work really well, they're very easy to optimize through gradient descent, and you never have a local minima problem. This is something that a lot of mathematicians and physicists who are interested in statistical physics are thinking about, and there's some intuition about it, but it's a bit of a counterintuitive notion.

If you don't put any structure in your system, you just stack linear and nonlinear operations — but the way we train this is, as many of you are aware, using gradient descent. You have some sort of cost function that measures the discrepancy between the answer we want and the answer we get. You compute the gradient of that with respect to all the coefficients of all the matrices in the entire system, and then take one step of gradient. We do this either sample by sample or block by block, which is the idea of stochastic gradient descent.

One thing we quickly notice is that if we want to apply this to things like images or text, we have to put some structure in the matrices here so that they're not full matrices, because the vectors are very, very large. One such idea goes back to the late 80s: the idea of convolutional networks. This is a sort of vintage early 1990s convolutional net, trying to recognize handwritten digits. What you see here are the various layers. The input is on the left, and the first layer is the first column. The operation performed by each of those layers is a convolution, which is a special case of a linear operation, and then a pointwise nonlinearity — in this case, a hyperbolic tangent. Each of those six maps that you see in the first layer has a different set of coefficients to perform the convolution, and all of those are learned using gradient descent. You stack layers of convolutions and nonlinearities, and there are special types of convolutions in between that we now call pooling, where all the weights are constrained to be equal. That has the property of basically aggregating the response of some filters in an area, and then those are subsampled to reduce the spatial resolution of the representation. As you go up the layers, you get representations that are more global and more abstract, and you force the system to build hierarchical representations.

We obtained really good results with this in the 90s for handwriting recognition. Why handwriting recognition? That's basically because that was the only thing we could get our hands on — the only dataset for which there were more than a few thousand samples (or a few hundred samples, I should say). Fairly quickly, we realized we could use this not just to recognize single objects, but also multiple objects, multiple characters. That's going to be important because that meant we didn't need to have a prior pre-processing step where objects would be segmented. We could show the system multiple objects and train it using a sliding window approach to detect individual objects that were centered in the window, while ignoring ones that were on the side. That system spontaneously can recognize, in this case, multiple characters — but as we'll see later, multiple objects — and separate figure from ground.

That was working really well. Pretty quickly, we built practical systems to recognize handwritten documents, in particular checks. By the mid-1990s, we fielded a commercial system at AT&T — this was done when I was working at Bell Labs. There's a company called NCR, which at the time was a subsidiary of AT&T, that was commercializing large systems for banks to read checks, and they fielded that system. By the end of the 1990s, the system was reading between 10 and 20 percent of all checks in the U.S., so it was a big success story of the first wave of neural nets in the early 90s.

But by the mid-90s, the machine learning community completely lost interest in neural nets for reasons that are not entirely clear — at least not entirely clear to me. Historians of technology will have to explain this one. But partly it was due to the fact that training the system at the time was very onerous in many ways. It required a bit of black art and software that had to be written from scratch. Nowadays we can use all kinds of open-source libraries; back in those days, open-source was not particularly prevalent, and companies were kind of possessive about software. Also, there were very few domain areas where there was enough data to train those very large learning models. They all required thousands and thousands of examples, if not more, so that was only practical for handwriting recognition and perhaps face detection and speech recognition, but not much else. In fact, for speech it didn't quite measure up with other methods for various reasons.

I actually stopped working on machine learning between 1996 and 2001; I worked on other things — image compression in particular — and came back to it in the early 2000s, in particular with a project on self-driving robots. This was a small robot around 2003, just when I was joining NYU. I worked together with a small company in New Jersey called NetScale Technologies, and we built this little robot that we trained by imitation learning. We basically trained a convolutional net to emulate a human driver by being shown images coming from two cameras on a little track and recording the steering angle from the human driver. Then the machine would learn to associate one to the other. It was a very short project, very quick, but it worked really well. We showed the result to DARPA program managers, who said that sounded really interesting and proposed starting a large program on machine learning for robots. That ended up being what was called the LAGR project — Learning Applied to Ground Robots — which started around 2005 and ended about 2008.

The LAGR robot was built by NREC at Carnegie Mellon — it's a robot like the one you see on the top left. The idea we used was to use convolutional nets to do semantic segmentation: basically, apply a convolutional net to every pixel in an image. The convolutional net would look at a fairly large patch around each pixel but be applied to the entire image with a sliding window. It's very efficient to do this with a convolutional net; you can do it very cheaply. Then label every pixel as to whether it's traversable or not. The problem, of course, is how do you collect training data? You could collect a bunch of images and manually label every pixel, but what happens is that we can use classical computer vision — stereo vision — to figure out if a particular pixel is on the ground or above the ground using 3D reconstruction essentially. That works except it only works up to about 10 meters. Past 10 meters, there is not enough resolution in stereo or baseline resolution to be able to tell if a pixel is above the ground or on the ground. But that's enough to label all the pixels within 10 meters, which is enough to train a neural net to figure out what is above the ground versus what is on the ground. So you use those labels automatically collected to train the convolutional net, and then you can apply the convolutional net to monocular data and apply it to the entire image. It tells you whether there's a path beyond 10 meters. That system worked quite well.

Of course, the robot was completely autonomous, so it had Pentium laptop-class computers in its belly — three of them. We could only run the neural net at about one frame per second, so we used it for long-range vision. For short-range vision, we had a low-resolution, about eight frames per second, stereo vision system particularly for handling unexpected obstacles. The system labels the entire image, then we can map the traversability indices into a map centered on the robot, and use that map to plan a path to a particular goal defined by GPS coordinates.

That's the robot in action — the video is accelerated twice. It's being annoyed by pesky grad students (who are entitled to be called pesky because they actually wrote the code for this, so they're pretty sure the robot is not going to run them over). That was quite interesting and successful. That project ended in 2008, way before deep learning was kind of fashionable.

What happened next was we realized that this idea of semantic segmentation could be used not just for a traversability index but could be used for labeling every pixel in an image with the category of the object it belongs to. By 2009-2010, some datasets appeared with a few thousand images (3000 images) that were completely manually labeled at the pixel level with on the order of 30 categories. So we trained our convolutional net on this, and in fact implemented the overall system on an FPGA so we could run it fast — it could run at about 20 frames per second. We beat the accuracy record on three different datasets. We submitted a paper to the Computer Vision and Pattern Recognition conference in 2011 (the submission was in 2010), pretty sure the paper was going to be accepted as a normal presentation. It was actually soundly rejected by all three reviewers, whose main comment was, 'We don't know what a convolutional net is, and we can't believe that a method we've never heard of could work so well.' So I advised my students to not submit further papers that are heavily slanted towards machine learning to computer vision conferences, because it would be a waste of time. But things changed very quickly thereafter.

This work inspired a few people, particularly a company called Mobileye, which now belongs to Intel — an Israeli company. They were one of the first to build systems for visual vision systems for cars for driving assistance and obstacle detection. They licensed the technology to Tesla in 2015. The 2015 models of the Tesla S actually include a convolutional net system built by Mobileye. They were extremely fast in switching from whatever technique they were using before to using convolutional nets once they realized it was working really well. The problem they had is that they had a special chip that wasn't built to run convolutional nets, so they had to shoehorn convolutional nets onto that chip. Nowadays they have chips that are more dedicated. Meanwhile, other companies jumped on the bandwagon, including Nvidia and a host of startup companies and larger companies like Uber and Baidu, and many others. Pretty much every single vision-based driving car system that you see nowadays uses convolutional nets.

That's all because in 2012, our friends from the University of Toronto managed to get very efficient implementations of convolutional nets on GPUs (graphical processing units). They were getting speedups of about 100 over running them on CPUs, and that allowed them to train a very large convolutional net for the time — something with on the order of almost a billion connections. They trained it on the ImageNet dataset, which had 1.3 million training samples, each of which is an image with a dominant object belonging to one of 1000 categories. They stunned the computer vision world a little bit by bringing down the error rate on ImageNet by a large amount. This was thanks to Alex Krizhevsky — an intern from Toronto — from Jeff Hinton's group, along with Yoshua Bengio's group and mine, and Andrew Ng's group. We had been exchanging a lot and working together towards developing new techniques for deep learning. The Toronto group was the first to have good GPU implementations.

The error rate on ImageNet was on the order of, by some measure (does the correct class rank among the top five?), about 26 percent error until 2011 using classical computer vision methods. In 2012, the Toronto group brought that down to a little over 16 percent. Since then, performance has improved steadily. The error rates now are below 3 percent, and that can be done very easily and routinely. It can be done by training in parallel in something like an hour on a network of GPUs, whereas the first version took about a month. So there's a continuous improvement in performance and a simultaneous increase in the number of layers of those networks — a kind of inflation in depth.

In 2013, one of the best performances was from the VGG network from Oxford. The next year, it was the GoogLeNet from Google — a play on the word 'LeNet,' which was the name my boss gave to the neural net we used in the 90s. Then there was ResNet proposed by Kaiming He (at the time at Microsoft, now at Facebook), DenseNet, and other things. They went from a few layers back in the 90s (seven or eight) to 12 or 13 layers, 20 layers in 2013, all the way to 100 or 150 layers nowadays. Companies like Facebook and Google routinely use networks that are anywhere between 50 and 100 layers for image recognition.

These things are used for image recognition very, very widely. To give you an idea, Facebook users upload on the order of two billion photos per day. Each of those two billion photos goes through four convolutional nets within two seconds of being uploaded. One of them tags all kinds of stuff in the image — is it a wedding, a party, a birthday party, a landscape, an indoor scene? Is there a dog of a particular breed, or a sailboat? There are a few thousand tags like this that are recognized. The second one does face recognition for automatic tagging of your friends. The third one generates captions for the visually impaired — short descriptions of the image. The fourth one basically detects objectionable content like violence and pornography. So it's very widely used, and a lot of computer cycles are spent running convolutional nets these days, along with a lot of work on hardware accelerations for them.

You might ask why we need so many layers. It's probably because the world is essentially compositional — the perceptual world is compositional in the sense that images are made of little edges or oriented contours, contours assembled into motifs, parts of objects made from combinations of those motifs, and objects made of parts. This compositional nature of the perceptual world applies to many things in the natural world, so it's naturally represented by multiple layers. It's always been a puzzle to me why it took so long to convince the pattern recognition community that having multiple layers was a good idea.

Here's a recent experiment from Facebook that shows a very interesting avenue where people use supervised learning less and less. This particular experiment was done where my colleagues took 3.5 billion images from Instagram — these were public photos posted by Instagram users. Whenever people post photos on Instagram, they very often type hashtags to index or describe the content. So what they did was take the 5,000 or so most frequent hashtags that appear in Instagram photos and trained a convolutional net to predict which hashtags would appear for a photo. It's a very weak signal, because hashtags sometimes don't represent anything about the image, but the trend works anyway. Then you chop off the last layer and retrain the last layer or retrain the entire network starting from the pretrained network on something like hashtags. You retrain it on another task, for example, ImageNet or the COCO dataset or some other dataset. The interesting thing is that they were able to beat the records on a number of different datasets this way. So there is a big advantage in learning good generic representations with convolutional nets: it helps to solve any particular task, and you either get better results or you get the same result with fewer labeled samples.

This is a theme I will come back to in just a minute. But to complete the state of the art and a bit of history, here's a more recent work from Facebook — a good demonstration of what current computer vision systems can do. This system is called Mask R-CNN and does instance segmentation. It's basically a convolutional net applied with a sliding window at multiple scales over the image. It's not just trying to produce a category for what it sees in the window, but also to produce a mask of the object — the dominant object in its viewing window. It can produce results like this: it not only recognizes that there are people, but it can tell that there are seven people, draws the outline of every person, and you can easily put a bounding box around them. It detects the baseball, the dog, the wine bottle, the glasses, the computers in the back, the people even if you can only see them partially, backpacks, umbrellas, almost completely occluded cars. It can even count sheep apparently. It works amazingly well. People in computer vision, if you had asked them five years ago, would have been hesitant to say we could do this now.

This system is trying to not just outline objects but also pinpoint key points on human bodies from which you can reconstruct a kind of skeleton of a person. Not only does that work really well, it actually works in real time on a mobile device. What you see here is not a fake video — it runs at that speed on a smartphone and can track a human body in real time, basically reconstructing the skeleton. This uses a mobile neural net infrastructure from Facebook called Caffe2Go. All of this is open source. One good thing about Facebook AI research is that most of our research is open — we publish everything and we distribute most of our code as open source. This whole system is called Detectron, and you can download it and play with it. There are two versions: one written in Caffe2 (pure C++) and another one written in Python which has a backend.

Another project took place at Facebook AI Research in Paris. It's called DensePose and consists of, in real time, running on a single GPU at 20 frames per second, mapping a 3D mesh onto any number of humans in a video. Again, this is real time on a single GPU, so you can change the clothes of someone or remove them, like in this case. This work was led by Natalia Neverova and ... at our research lab in Paris.

I've talked about image recognition, but we can apply these things to other domains like translation. One thing that has been stunning over the years is the merging of the underlying techniques used by several different communities. It used to be that speech recognition, computer vision, and natural language processing used completely different methods, but now they basically all use neural nets, and many of them use convolutional nets. This is a system developed at Facebook for translation based on convolutional nets — a particular type called gated convolutional nets, which I don't have time to go into. It works really well. It's also open source, called Fairseq, and it's a sequence-to-sequence convolutional net. It can be used for other things like text generation and various other applications.

There are tons of applications of convolutional nets and deep learning in general: medical image analysis, driving assistance or autonomous driving, accessibility, translation, content understanding for search filtering and ranking, games, and a growing number of applications to science, particularly in high-energy physics, astronomy, genomics, and biology. Of course, everybody would want to build virtual assistants, but they don't quite work yet. What's missing there is reasoning. We're at an AI conference, and a lot of AI is about reasoning. What's missing from everything I talked about here is reasoning — it's basically just perception, sophisticated perception, which is just perception. So how do we get neural nets to do reasoning? How do we marry deep learning with reasoning? There's a bunch of really interesting work in that direction.

Augmenting neural nets with memories has been done in part by some of my colleagues at Facebook, like Jason Weston and colleagues. Something called Memory Networks, where you basically have a piece of the network that is dedicated to being a memory, a little bit like a RAM, except that it's differentiable. The system can use this as a scratch pad memory to solve particular problems, like hold a dialogue and answer questions about movies. These systems are basically asked a question, then they access a memory multiple times and produce the answer. They are trained end-to-end in supervised mode by being fed answers. If you use these models to predict what the subsequent answer of the human in a dialogue will be, you can actually accelerate the learning of the system to produce the right answer. So if the system can predict what the next answers are going to be, it can train itself faster to answer the correct questions.

Another interesting work in that domain is something more recent, also at Facebook, led by Drew ... and ... It works by dynamically building a neural net by assembling trainable, differentiable modules in a kind of input-dependent way. For example, you want to ask the system a question. You show it an image and ask a question like, 'Is the shiny object to the right of the gray metallic cylinder the same size as the large rubber sphere?' Even as humans, to answer this kind of question, we sort of have to configure our frontal cortex or perceptual system to pinpoint the right features to look at, count the number of objects, figure out if they are near each other. That's exactly what this does: it actually synthesizes a network depending on the question — an arrangement of differentiable modules meant to compute the answer. The beauty of this is that you can train this end-to-end; you can train the system to produce the right set of modules on the fly. That is some sort of dynamic network determined by a program. Traditionally, a neural net was something you would define the architecture of in a static way, then run it and train it. But this thing — you write a program, and the output of the program runs differently every time depending on the input. In the background, there is a software infrastructure that figures out how to compute the gradient of the output with respect to all the free parameters.

This has led some people to call this 'Software 2.0' or 'differentiable programming.' Basically, it's a new way of programming where the instructions are not really completely defined, but they are parameterized modules. You write your program, and the actual function of the program is finalized by training it on data however you want.

There is a lot of open-source projects here. PyTorch is the framework we use at Facebook for deep learning, and it's open source. It's actually a community project; Facebook is just a main contributor. There are a lot of other systems: FAISS for fast similarity search, distributed reinforcement learning, and there is actually a professional-level Go player called AlphaGo that has been built with this, and it's completely open source. Others are already mentioned.

Speaking of reinforcement learning, it's been making a lot of progress, as we saw with the prize that was accepted today. It's really impressive what's happened, but reinforcement learning falls short of learning things at the same speed as humans or as efficiently as humans. A lot of the successes of reinforcement learning have been in the domain of games or virtual environments, or situations where you have so much data that the number of trials is immaterial, like ranking and things like that. To give an example, this figure from a recent paper from a group at DeepMind shows the best combination of algorithms to train a system to play Atari games and reach human-level performance. The best currently known reinforcement learning algorithms take about 100 hours of real-time training — because you can run the games faster, but if you figured out how long it would take if you played the game at normal speed, it would be about 100 hours. A human can reach that performance in a few minutes of playing. So that's quite different.

This was supposed to be a video of an agent playing Doom. RL works really well for games, there's no question about it, because you can run games really fast, faster than real time, in parallel. A lot of people are working on StarCraft these days, which is much more challenging because it's on several timescales. But RL — classical reinforcement learning or what people call model-free RL — is hard to use in the real world because it requires many, many trials to learn anything. If you were to use current versions of reinforcement learning to train a car to drive itself, it would have to run off a cliff a few thousand times before it figures out that's a bad idea, and probably run off the cliff a few more thousand times to figure out not to do that. This would also apply to running over pedestrians, driving on the left, and other things. Humans seem to be able to learn to drive with about 20 hours of training for most of us, without ever questioning it. So we're missing something very fundamental.

What are we missing to get to real AI? That's the question: how do we get machines to learn as efficiently as humans and animals? Supervised learning requires too many samples, reinforcement learning requires too many trials, and machines don't have common sense. Common sense is the red herring of AI, if you want. Why is the sample efficiency of our learning so terribly bad? Actually, most of the systems we use today have no task-independent background knowledge about the world, like most humans do. They have no common sense (which is a bit of the same thing), no ability to predict the consequences of their actions — they have to try to see if an action results in a good outcome. They have no ability for long-term planning or reasoning without actually interacting with the world. In short, what they lack is world models.

If you have in your head a model of the world, you can plan in advance; you can think about the consequences of your action without actually acting and finding out if the world is happy with it or not. You can kind of run this in your head. The absence of a world model prevents us from building things we want to build. We'd like to have machines with common sense because that would be the basis for dialogue systems and virtual assistants that would really help us in daily life. It would also probably be the basis for dexterous robots. We can't actually build robots today that can fill a dishwasher, and we certainly don't have many ideas how to get to general intelligence. So this program of modeling the world is a big issue, and that was the title of my talk.

So how do humans and animals learn? How do you learn so quickly? If you talk to developmental cognitive scientists like Emmanuel Dupoux, who is now at Paris, they do experiments like the one you see on the top left: you take a little cart, put it on a platform, and push the little cart, and the cart doesn't fall (it's a trick, it's held from the back). The baby is being shown this but can't see it. Babies younger than six months look at this and shrug — sure, that's the way the world works, no problem. Babies after the age of ten months or so react like the little girl on the bottom left — they are extremely surprised that the object isn't falling, because between six months and nine months, they've learned a concept of gravity. They know that an object that is not supported is supposed to fall. How does that happen?

I put together this chart that indicates at which age babies learn various concepts. Things like face tracking happen very early, biological motion (the difference between animate and inanimate objects), and things like gravity and inertia occur around nine months or so. Most of this is learned purely by observation — there is very little interaction of babies with the world because before six months, they basically can't do anything right, so they mostly observe and amass an enormous amount of background knowledge about how the world works just by watching. How do we do this with machines? It's not just humans — it's also the case for other animals.

Here is a baby orangutan being shown a magic trick. The animation is supposed to play. It's a baby orangutan shown a magic trick, and then it rolls on the floor laughing because it breaks its model of the world. It actually loved it, laughing.

Here is a concept that might help on the way to getting machines to accumulate a lot of background knowledge about the world: the concept of self-supervised learning. Basically, the idea is to predict any part of an input from any other part. Some people used to call this unsupervised learning, but I think that can be a confusing word. Imagine you have a snippet of video. You've observed the past — the purple stuff is the past — and the blue stuff is what the machine is supposed to predict. Predicting the future from the past is an example of self-supervised learning. Of course, you can wait to observe the future so that you can use it as a training feedback signal to train the predictor. You can also try to predict the future from the recent past, pretending you don't remember the distant past. You can try to predict the past from the present, pretending you don't remember the past and trying to predict what happened. Or you can try to predict the left half of the image from the right part, or the top from the bottom — any part of the percept that is occluded or not currently visible, as long as you can observe it at some point.

The basic idea of self-supervised learning: pretend that there is a part of the input you don't know and predict it. The amount of information you ask the machine to predict is very large. In reinforcement learning, you give the machine one scalar value once in a while — the reward. In supervised learning, you tell it the correct answer, so it's a few bits. In self-supervised learning, you ask the machine to predict everything in the world from everything it observes, so it's a lot more information. Therefore, you can presumably constrain a much larger system to learn much more complex, task-independent stuff about the world.

This led me to a completely obnoxious slide that has become a bit of a meme in the machine learning community: if you assimilate intelligence to a cake, the bulk of the cake (to generalize) is self-supervised learning. That's where we learn everything — almost everything we learn is learned just by observation. A little bit with supervised learning (maybe when we go to school), and a tiny amount with reinforcement learning. So that is the cake analogy: reinforcement learning is the cherry on the cake. But note that this is a Black Forest cake, and the cherry is not optional.

This idea of unsupervised self-supervised learning being central is not my idea. Jeff Hinton has been trumpeting this for 40 years or so; I was skeptical at first but sort of rallied to his opinion in the last couple of decades. Self-supervised learning allows us to fill in the blanks, which is a very important aspect of common sense. Whatever the next revolution in AI will be, it will certainly be based on machine learning and will have some component of deep learning of course.

Could self-supervised learning lead to common sense? There's a huge amount of things that we infer about the world, maybe based on all the background knowledge we've accumulated. For example, if I say 'John picks up his briefcase and leaves the conference room,' there's a lot you can infer about John and the scene — things due to our knowledge of human society (John is a man, he probably works), as well as things that seem completely obvious to us but are not obvious to a machine: the fact that John is not going to dematerialize from the room, but will actually walk to the door, not go through the wall, not fly — basic things about physics that we've learned. There's a huge amount of information we can derive from a few words. Common sense is what allows us to interpret pictures like the one at the bottom right: a current event that happened just yesterday evening, President Macron after the French victory in the World Cup.

We need to learn predictive models of the world. This is something that people in optimal control have been doing for many years: you have a system you want to control, you build a model of it (system identification), identify the parameters, then use it to roll out a sequence of commands, and optimize that sequence to minimize a particular objective function. Any AI would work the same way: you have a world model in your head of how the world works, and you can play a sequence of actions in your head, see what the result is, and optimize the sequence of actions in advance. In reinforcement learning, this is called model-based RL. The problem with model-based RL is that it's very hard to make it work because it's hard to build models of the world that are accurate.

In the interest of time, I'll skip a couple of things. The main issue is that the world is not entirely predictable. If I put a pen on a table and show this little video to a machine, and ask it to predict what's going to happen next — predict the next few frames — the machine can predict that the pen is going to fall. But you can't predict in which direction. So a model of the world has to predict not just one point, but a whole set of potential plausible futures. That means this model has to have access to a latent variable that essentially parameterizes this set of potential plausible futures. When you vary this variable, the predictor spans the entire set. That's a latent variable model.

Now we need to train this machine with an objective function that tells us if the output of the network is on the red ribbon (the set of plausible futures) or outside. It should give us an indication of which direction to change the output so that it gets closer to the red ribbon. The problem is that we don't know what the red ribbon is, so we have to train another neural net to tell us if we're on the red ribbon or not. This neural net learns a contrast function that looks a bit like this: it gives low energy to things you observe (plausible predictions) and high energy to everything else. Training these two networks simultaneously is the idea of adversarial training. You have a discriminator that basically tells you if you are on the manifold of plausible predictions or outside. You train it by showing it examples of constant data (making it low energy) and then showing it predictions by a predictor (initially bad, so making it high energy). The generator trains itself to bring its outputs closer to the manifold. This idea of adversarial training has had enormous success. These are completely non-existing fake bedrooms generated by an adversarial net. This comes from a paper called 'DCGAN' from a few years ago. This is more recent work: a generative adversarial network trained on photos of celebrities, from Nvidia, published earlier this year. These are non-existing celebrities — a de-convolutional net (a convolutional net backwards, if you want) that is fed random numbers to generate an image of an alleged celebrity. These don't look like any samples in the training set.

We can use these for video prediction. If you train with just L2 (mean squared error) — you take a large convolutional net, show it a few frames, ask it to predict the next few frames — you get blurry predictions. The machine cannot decide among all the possible futures which one to produce, so it produces the average of all possible images. If you train with adversarial training, you get predictions like the bottom image: the first four frames are observed, the last two are predicted. This is another example trained on short snippets of videos from New York apartments. As the camera rotates, it has to invent what the rest of the apartment looks like — it invents the bouquet, the couch, and everything.

If you are interested in self-driving cars, it might be more interesting to work on videos from dashboard cameras in cars. Some work also done in Paris by Pauline ... and ... : we run the Mask R-CNN engine on the images, and then run a predictor that tries to predict those semantic segmentation maps. The initial frames are observed, and the last three are predicted. We can predict that a pedestrian crossing the street will keep crossing, and a car in front turning left will keep turning left. That is very useful for a driving car to predict what's going to happen.

Lastly, I want to show you latent variable models — not adversarial training, but another way of solving the problem of multiple predictions. The architecture here is an encoder-decoder: you show an input to the system, it produces a code, then you run through a decoder that tries to predict the future. You can't predict the future entirely because of uncertainty, so you introduce a latent variable Z which is additively combined with the code. Z is sometimes set to zero, sometimes predicted from a deterministic function that considers either the observed future or the prediction error. It's cheating a little, but it limits the information content in Z.

One application is for self-driving cars. You have one of these prediction systems that sees the environment of the car — it shows our car in the middle of a little image and all the cars around it, as well as the speed and position of our car. We train a world model that, given an action (steering angle and acceleration), predicts the next state: what the environment around our car will look like a fraction of a second from now. We can use the real world for training. If we train a deterministic predictor on this, we get a blurry prediction on the left (what happens in the world) versus on the right (prediction from the deterministic model). So bad idea. Now replace this with a latent variable model: we can sample different latent variables to get different predictions. These are completely made-up, dreamed future scenarios from different drawings of the latent variable, and they all look reasonable. We can use these models for prediction and planning.

I'm out of time, so I'll skip to the conclusion slide. The quest for understanding intelligence, whether it's human or machine, is probably one of the biggest scientific challenges of our times, and possibly one of the biggest technological challenges. The next few decades are going to be very exciting. It may take 30 years or more to reach human-level AI. One interesting thing is that in the history of science, there have been many domains where the science followed the technology: the telescope was invented before optics; optics was invented to explain how telescopes worked. The steam engine was invented before thermodynamics; thermodynamics was invented to explain the limits of heat engines and became a fundamental branch of physics. Aerodynamics was mostly invented after the airplane. Computer science resulted from the invention of the calculator. So what is the equivalent of thermodynamics for intelligence? Is there a science of intelligence that we haven't figured out? That's my scientific project for the next decade and a half. You can do useful work in research, hopefully a little more. I am really looking forward to this community making significant contributions to this. Thank you very much for your attention. My apologies to the organizers for going way over time.