Back
Yann Lecun
Chief AI Scientist, Meta/Independent

Yann LeCun: World Models: Enabling the next AI revolution

🎥 Jun 01, 2026 📺 Computer Vision and Geometry Group, ETH Zurich ⏱ 58m
Talk given by Yann LeCun at ETH Zürich during "Frontiers of Embodied AI".
Watch on YouTube

About Yann Lecun

Yann LeCun, the Turing Award winner and former chief AI scientist at Meta, has been publicly advocating for an alternative approach to artificial intelligence that moves beyond large language models (LLMs). In talks and interviews from 2025 and 2026, LeCun described LLMs as useful for tasks like code generation and information access but argued they are not a path to human-level intelligence, stating that they lack the ability to predict the consequences of their actions and cannot handle the "messy" real world. He has promoted his Joint Embedding Predictive Architecture (JEPA) and "world models" as a more promising direction, emphasizing that AI systems should learn abstract representations rather than generating pixel-level predictions. LeCun has also been critical of vision-language-action (VLA) models used in robotics, calling them "doomed" and asserting they do not work well without vast amounts of training data. LeCun left Meta in early 2026 and became executive chairman of a new company, Advanced Machine Intelligence (AMI) Labs, which focuses on "physical AI" for robotics and industrial control. He also serves as chief scientific advisor to the Tapestry project, an open-source AI initiative under the AI Alliance that aims to collaboratively train foundation models without pooling private data. LeCun has argued that a diverse ecosystem of AI assistants is necessary to protect cultural and linguistic diversity, and that current models produced by a handful of companies pose risks to information diversity. He has described his mission as "protecting democracy" by ensuring people have access to a wide variety of information sources.

Source: AI-verified profile updated from Yann Lecun's recent appearances. Browse all interviews →

Transcript (72 segments)
✨ AI-enhanced transcript with speaker attribution
Y
Yann Lecun0:00
I'll talk about world models, possibly the enabler for the next AI revolution. So, there's a lot of machine learning people in the room perhaps. I have bad news for you. Machine learning sucks.
Basically, when we compare the learning abilities of machines with humans and animals, clearly there is a big gap. People and animals can learn new tasks extremely quickly and with very few trials, very few samples. People have common sense, animals too, physical common sense. There's a lot of tasks that we can accomplish zero-shot even if we've never faced them before.
How do we do this with machines? We have very powerful AI techniques that everybody is using, but they don't really handle the real world. They don't handle continuous, high-dimensional, noisy data. Language is easy by comparison. The real world is messy. Language is simple.
This connects with what Vladin said earlier and Gitandra as well. The Moravec paradox: things that are simple are difficult for computers, and things that are complicated for humans turn out to not be that difficult for computers, like playing chess, computing integrals symbolically, solving equations, proving math theorems, etc.
How is it that a 10-year-old can basically do what you would like a domestic robot to do and do most of those tasks without actually being trained to do them? The first time you ask them, they can do it. They may not want to do it, but they can. How come any teenager can learn to drive a car in a few hours of practice, yet the self-driving car companies have literally millions of hours of training data? And despite that, they can't use those millions of hours of training data to get a machine to just imitate humans to drive at the same level of reliability.
Otherwise, we'll have level five self-driving cars, and we don't have them. At best in the consumer car business, we have level two or three, and the robo-taxis are engineered, very heavily engineered with various sensors and other things. So, we keep bumping into this Moravec paradox, and we really have to go beyond this.
If you believe that intelligence requires grounding, of course some philosophers and certainly some language people don't believe that's necessary, but I think it is. Like Vladlin, we're in Switzerland, outside Jean Piaget. He was a big influence on me. He had a debate with Noam Chomsky in France in the late 1970s where they were debating whether language was innate or learned.
There were transcriptions of that debate with people participating in it. And one of them was a guy who had worked with Jean Piaget, who was a professor at MIT and was talking about the perceptron, kind of saying there's those simple machine learning models that are capable of learning surprisingly complex tasks, and that may be kind of evidence for the fact that learning is possible, contrary to what Chomsky was saying. This guy was Seymour Papert. He was a professor at MIT, and 10 years before that he had written a book that basically killed the entire field of neural nets, basically pointing out the limitation of the perceptron. But here he was 10 years later arguing for the fact that those things were actually interesting to study.
Anyway, so Jean Piaget says intelligence is not what you know, it's what you do when you don't know. In fact, he never actually said this. This is apocryphal. But there are other psychologists who basically kind of distilled his thinking into this sentence which he never said. So he's kind of quoted as saying that. Intelligence is not an accumulation of declarative knowledge. LLMs are an accumulation of declarative knowledge. Not just, but the main reason they're useful is because they can accumulate a lot of declarative knowledge. Intelligence is not a collection of skills.
You can probably build a machine to accomplish sort of any task if you spend enough resources on it, including things like self-driving. But that's not really what intelligence is. Intelligence is the ability to learn to drive in about 20 hours, or to learn any new task with very little training, or accomplish new tasks. That's really what intelligence is, and that's really what AGI means.
What that means is that we're not going to have any simple measures of intelligence, because any particular task, you can always, if you spend enough effort and time, you can always kind of crack it. So it's more how adaptive you are. And this connects to something Vladin said: the notion of AGI is complete nonsense. Human intelligence is specialized. The characterization of human intelligence is that it's very quickly adaptive and we can learn new tasks. We all know different sets of knowledge and have different skills.
It's because we've been exposed to different environments and we've had to solve different problems. We're adaptive. That's really what intelligence is.
Okay. So, how do humans learn, and animals for that matter? There's a lot of learning that takes place in the early months of life, mostly by observation. So, a two-month-old baby can gesticulate, can develop a dynamical model of its own limbs, but basically cannot affect the world. It can't move an object or anything. But it can learn a lot of things about the world. One thing a baby can learn really quickly is that the world is three-dimensional.
Why? Because the fact that an object has a distance, every point in the world has a distance from us, is the best way to explain how our view of the world changes when we move our head. And of course babies don't necessarily move their head, but they are being moved. So they see parallax and sort of derive from this the fact that the world is three-dimensional. We can do this with learning machines today. They learn that the world is three-dimensional only by being exposed passively to videos. So that's an interesting thing.
And I'm going to silence my phone because it's actually okay.
So basic concepts like object permanence is learned really quickly. Notions of stability, rigidity and things like that. But then what we would consider intuitive physics, things like inertia, gravity, that actually takes nine months for human infants, shorter for most animals.
If you put an eight-month-old on a high chair and you put a bunch of toys, eight or nine months old, the child would most likely systematically take all the toys and throw them on the floor and watch the result. They're doing the experiment that gravity actually applies to everything.
So that takes a long time. How does that happen? What type of learning is taking place there? They're doing the experiment, but they can learn about gravity just by observation as well. So if you show the scenario here at the bottom where a car is on a platform, you push it off the platform, it appears to float in the air. A six-month-old will barely pay attention, hasn't learned about gravity yet. A 10-month-old will go very surprised like the little girl. And that's actually how psychologists measure whether a baby has learned a particular concept about the world, which is the violation of expectation. And we can actually use those techniques to test whether machine learning systems have acquired some notion of common sense. So there's a lot that can be said about this. Genra and I collaborated on a paper here, mostly written by Emmanuel Dupoux actually, and Genra had very little contributions to it on this whole kind of set of questions.
Okay. But what is intelligence really if it's not an accumulation of skills nor an accumulation of declarative knowledge? It's the ability to accomplish new tasks, as I said, solve new problems without prior training. And again, AGI makes no sense as a phrase. Human intelligence is specialized, and the question is not do you know how to do everything, it's can you learn quickly how to do anything or a wide spectrum of things. This is a little kind of somewhat philosophical paper here at the bottom written by some of my young colleagues.
So here is a simple calculation. There's still a lot of people, particularly on the west coast of the US, who believe that we're going to reach what they call AGI by scaling up LLMs, maybe training them on synthetic data, maybe using a few tricks in post-training and reinforcement learning. And I think that's impossible. I'm a believer in sort of grounded intelligence if you want. But you can do this simple calculation: a typical LLM of today is trained on something like 20 trillion words, that corresponds to about 30 trillion tokens, and each token is three bytes, something like that. So the data volume is about 10^14 bytes. This would take about 400,000 years for any human to read.
Then compare this with what a four-year-old has seen during his or her life. That's about 16 hours of wake time, and which by the way is a small amount of video. It's about 30 minutes of YouTube uploads. And we have two million optic nerve fibers carrying about one byte per second each. So the data volume that a four-year-old has seen through vision, and probably through touch as well, is about 10^14 bytes. So a four-year-old through vision, same amount of data as 400,000 years through text with all the human-produced text available publicly on the internet. We're not going to get to anything like human-like intelligence by just training on text. It's just not going to happen.
So of course you're going to say, well, video is much more redundant than text. But in fact, that's a feature, not a bug. If you want to train a system, particularly using self-supervised learning, you need redundancy in the data. If you don't have redundancy, you can't learn anything. So redundancy is a good thing. You don't want too much of it though.
Okay. So then there is another question about what are the right properties of intelligent systems. And in my opinion, an important property of an intelligent system is the mode of inference. Does it compute its output by propagating through a fixed number of layers of some neural net? Or consider the alternative. The alternative is computing the output of a system by searching for an output that is most compatible, if you want, with the input.
So you observe a situation that runs through some perception module that produces some sort of representation of the current state of the world as you observe it. You can directly produce an action. Okay, that's a reactive system if you want. Or you could imagine an action and then have an intelligent system figure out is this a good action for this observation. Is this something that will accomplish the task I want? So the objective here characterizes whether the task the system wants to accomplish has been accomplished or not. Think of it as a cost function. It's not used for learning. It's used for inference. Think of it as negative likelihood in a probabilistic inference model, or as I prefer to think of it, an energy function. So basically there the inference is a process by which you search for an output that minimizes some energy function at inference time. Okay, that's intrinsically more powerful computationally than just propagation through a fixed number of layers.
And then contrast the model on the left, which is sort of LLM-like, right? Take a window of inputs, run this through a fixed number of layers of some big neural net with a few hundred billion parameters, produce one token, okay, then shift that token in the input and then produce the second token, etc., etc. That's autoregressive prediction, and every token involves the computation of a fixed amount of computation running through a fixed number of layers of some neural net. This is not a good model. It's not a good model of reasoning. The way you coerce an LLM to do reasoning is that you trick it into generating more tokens. But that's not the way we reason. We reason internally. We don't reason in token space in language even.
Compare this with the model on the right, which is a slight specialization of the previous one. You perceive the world or your environment. You get some idea of the state, the current state of the world, and then you imagine a sequence of actions, a proposal for an action. You feed it to an internal world model for the system, and the world model predicts the outcome and then feeds this outcome to an objective that measures to what extent a task has been accomplished or not. Okay. Then by optimization you search for an action sequence that optimizes this objective, minimizes this energy at inference time. I haven't talked about learning yet.
In my opinion that's a much more powerful model. But you need a world model. Now if you do have, so I've sort of settled on this kind of idea or architecture about five years ago. I wrote a long paper about it that I put online in 2022 with some general architecture, etc. If you want to take pictures, here are QR codes, you can get to it. And it's relatively easy to read but kind of long.
And it's really based on this idea that reasoning and planning are essential and they basically proceed by energy minimization rather than forward propagation, and that for this to work you need some world model. Okay. So same process as I described before, there is a few additional tricks. You observe the environment, perception module produces a representation of the initial state of the world, but only a representation of what you currently perceive. So you may have to combine this with the content of a memory to get a complete idea of the state of the world, or what you know about it at least. Then you feed this to your world model together with a proposal for an action sequence, and your world model predicts the outcome of that action sequence. You feed this to an objective, an energy function that measures to what extent a particular task has been accomplished. So this function outputs zero if the task is accomplished and some positive number if the task is not accomplished, and perhaps measures some distance to the task being accomplished.
So in addition, you can have another set of objectives that are guardrails that would ensure that whatever state sequences the system is going to take the world through is not going to kill anyone or hurt anyone or have any kind of deleterious effect. And so a system constructed this way can be made intrinsically safe because it has to obey and optimize the guardrail objective with every output it produces. This is not the case for an LLM. An LLM, the only way an LLM can be made safe or non-toxic or whatever you want to call it is by fine-tuning it. And there is always a way to break the conditioning if you want to jailbreak the system. Here, you can't jailbreak a system like this. It can do nothing but optimize the guardrail objectives and the task objective.
Of course, if you have a world model, there are certainly a lot of roboticists and optimal control people in the room. You can apply this world model multiple time steps, and each action sequence can be decomposed into a sequence, the guardrails can be applied to all the steps in the sequence. Okay, that's the way you would use a world model, and the way you plan by optimization there is akin to model predictive control, MPC, very classical stuff in optimal control going back to the 1960s.
Ultimately what you want though is something that can do hierarchical planning. All of us do hierarchical planning, animals do hierarchical planning. What is hierarchical planning? Let's say that I'm sitting in my office at NYU and I want to be in Paris tomorrow. There's no way I can plan my entire trip to Paris in terms of kind of muscle actions 10 millisecond by 10 millisecond, which are the elementary actions that humans can do. I can't do that because first of all, it's too long, but second of all, I don't have the information.
I don't know if when I'm going down on the street, how long I'm going to have to wait before a taxi stops, right? So, there's no way I can plan the entire thing. I have to do hierarchical planning. So, what I have to do is at a high level, I have to say, well, I don't know how long it's going to take me to go to the airport, but maybe roughly an hour, an hour and a half. So, I need to get to the airport and catch a plane. Okay, that's a two-step high-level plan. I don't need to know many details to make that plan. And now I have a sub-goal which is being at the airport. I'm in New York, so going to the airport involves going down on the street and hailing a taxi and going to the airport. Now I need to go down on the street. I'm in an NYU building, that involves walking to the elevator, pushing the button, getting down and walking out the door. Now I have a sub-goal getting to the elevator, etc. So you can sort of go down this entire hierarchy. And at some point you get to a point where the action you need to take is very simple. It's something that you are familiar with. You may not have to use your full mental power to kind of plan the action. You can probably stand up from your chair without having to think about it. That could be just a policy.
But essentially, ultimately we want systems to do hierarchical planning. How do we solve that? This is an unsolved problem. If you're a roboticist or an AI for robotics kind of person or agentic AI kind of person, if you're starting a PhD on this topic, this is a great topic. It's completely open. Nobody knows how to do this or nobody has proved that they know how to do this.
Okay. So now the big question is how are we going to train those models? Okay. Hierarchical or not, let's say non-hierarchical to start. So first of all we have to figure out what architecture to give them. And a natural instinct in these days of AI is to train a generative model. And in fact I've been working on sort of trying to train world model-like things for about 15 years, mostly failing for the first 10 because I was trying to train generative models.
Okay, what's a generative model? Self-supervised learning has been incredibly successful, astonishingly successful in the context of language, right? You take a string of words, you remove some of the words, you corrupt the input, and then you run the corrupted input through some big neural net and you train it to recover the missing parts. Okay, that works amazingly well for text. So there are original models like BERT that used to do this, and LLM is a special case of this where the only word you remove is the last one. So the entire system is trying to just produce the next word in a sequence. Okay, but it works amazingly well if you, and it scales if you do it right. It doesn't work if you apply it to video.
So if you take a video and then you show the initial segment of the video to the system and you ask it to predict what's going to happen next at a pixel level, it doesn't really work. The representations you get out of the system for your video are not particularly good. And the reason is you simply cannot predict everything that takes place in a video. There's an infinite number of plausible things. In text, it's easy because there is only a finite number of words and so you can get the system to produce a probability distribution over all possible words or tokens in your dictionary. But you can't do this with video, right? It's just an incredibly large number of possible video frames.
Let me take an example. If I take a video of this room, right? I start here and I kind of slowly rotate the camera. I stop here and ask the system continue the video. You know, it's probably going to predict, you know, we are in some sort of classroom, auditorium, and the room has a finite size. There might be windows on this side and things like that. There's absolutely no way the system can predict what all of you look like or which chairs are unoccupied. It's just impossible. You just don't have the information. So when you train a system to make this kind of prediction, you kill it.
Now of course you're going to tell me, oh but we can train systems to produce cute videos, right? Video generation. Yes. But this prediction usually is done in representation space, not in pixel space. It's only a second stage that actually turns the predictions into high-resolution, high-frame-rate videos, and the system only needs to produce one cute-looking video. It doesn't need to actually represent all plausible videos. So which is a much simpler problem.
Okay. And as I said, I've been kind of attempting to work on this for the better part of the last 15 years. So this is a 10-year-old paper where we tried to train some neural net to predict short video clips, two frames from four frames of context. You get blurry predictions because the system predicts the average of everything that can happen. Of course you can correct that with latent variable models like diffusion models, which we didn't know at the time. We tried to use GANs and stuff like that, wasn't too successful. But perhaps using latent variable models would help, diffusion models in particular, which of course produce cute videos. Do they actually understand the world? The evidence is no.
So here's my solution. My solution is an architecture called joint embedding, or more precisely, joint embedding predictive architecture, JEPA, which is shown on the right. Okay. On the left you have generative architecture. You observe X, maybe you observe A, an action that is taking place, and you observe the result Y, and the system is trying to reconstruct Y in its most minute details. With JEPA, you observe X and Y and A, but you encode both X and Y, and the prediction takes place in that representation space.
Okay, major difference. What the system can do is essentially eliminate from the input, by constructing a representation of Y, it can eliminate all the information about Y that is simply not predictable, right? And that makes the prediction more abstract with fewer details but more accurate in a way.
So how do you train a generative model? It's easy to train a generative model because the cost is just a reconstruction cost. You're just training it to reconstruct. You can train it as an autoencoder, but then you need to restrict the information content in the code, or as a denoising autoencoder, which is what a lot of techniques have attempted to do, like masked autoencoders and things like that. So that means taking an input, corrupting it in some ways, and then training an autoencoder to recover the initial one. And by the way, diffusion models are a bit of a special case of this sort of general thing of denoising.
So the bad news is when you train systems of this type to learn representations of images, you don't get good representations. If you use the representation of images obtained this way, you feed it to a downstream task that you train supervised, okay, you train head supervised, the results you get are not great.
To get good results, you have to use joint embedding architectures. All the best systems that use self-supervised learning to train an image or video representation system all use joint embedding. None of them uses reconstruction. Okay, all the best ones.
Either you, let's say you apply this to images. Either you have two views of the same scene and you train a neural net to produce representations and you tell the system I want those two representations to be identical. Or you use this corruption technique. You take an input, you corrupt it or transform it in some ways, and then you train this JEPA architecture to predict the representation of the original image from the representation of the corrupted version. Okay, there's a big issue with this which is that the system can collapse.
Now the generative models can actually collapse to some extent, like if you train an autoencoder without a restriction on the information content of the code, your autoencoder is just going to learn the identity function and that's a collapse. It's not going to learn anything useful. Similarly, a system like this can collapse, and how can it collapse? It can essentially completely ignore the inputs, produce constant representations, and now the prediction problem is trivial. So if you just train a system of this type to minimize the prediction error, it's going to collapse. It's not going to do anything useful for you. So the whole trick of how you do self-supervised learning for joint embedding systems is how you prevent collapse.
And there is my favorite concept for this, I'll talk about other ways to do this later, but my favorite concept to prevent collapse is information maximization. Okay. So you basically come up with some objective function that measures some sort of information content of the representation that comes out of your encoders, and you try to maximize that information content. Okay. So your cost function is minus the information or...
So there's a bunch of techniques for this since like the last six or seven years with names like MCR, MCR squared, VICReg, VICRegL, and BYOL. The BYOL, VICReg come from people working with me. The other ones from other groups. MCR comes from Berkeley and MCR squared from a colleague at NYU in neuroscience. There was some challenge, but this idea of JPAE is gaining popularity. There's about 1,700 papers that mention joint embedding predictive architecture spelled out on Google Scholar.
So there's an issue with this type of method which is how do you measure information content? We need to have a cost function that is a differentiable measure of information content so we can back propagate gradient and maximize it. And the bad news is first of all we don't actually have objective measures of information content because all the proper definitions are based on knowing the distribution of the vectors or whatever that you want to measure the information content of and we don't know the distribution. We only have samples coming out of an encoder. So how do you compute information content from a finite number of samples? That's the first problem. Second problem is to maximize something you would need a lower bound on information content so that when you maximize you push the actual information content up. Problem is every empirical measure that we have are all upper bounds. So what do we do? We come up with a good upper bound and we cross our fingers and we show some theorems and whatever.
So this technique and many others on the way to properly explain how you can train self-supervised learning systems and every learning system really is a framework I call energy-based models that I've been advocating for 20 years or so. It's basically the basic idea is like this. If you want to capture the dependency between two variables X and Y and there is no real functional relationship between X and Y. So you cannot, there's no single Y for a given X, right? It's just a dependency but it's not a function, like it's a relation of some kind of mapping but not a function. So indicated by the diagram on the right here you have a bunch of data points, the black dots, and so they indicate some sort of dependency between X and Y. How do you capture this dependency given that you cannot run a function that computes Y from X? So one way to do this is to learn or build a contrast function, energy function that tells you a point in this XY space is near the training data or not. So think of it as some sort of landscape where the black dots are in the valley. In Switzerland there would be a lake and then you get like level curves, right? As you move outside of those regions the altitude goes up. The energy goes up. Now if I give you a value for X you can infer, you can give me a bunch of values for Y that are compatible with X. There are values of Y that minimize the energy, right? So it's the kind of inference I was talking about earlier, inference by optimization not by forward propagation. But you can also possibly do it the other way around if I give you a Y you can infer X from Y and you can give me multiple answers. So in situations like video prediction where there is basically an infinite number of possible answers the proper way to train a system of this type is to think of it in terms of energy-based models. And by the way probabilistic models are a special case where your energy has particular form and the way you train it has particular loss function. So it's a slightly more general framework if you want than probabilistic inference and learning.
So to train an energy-based model you have to prevent collapse. The collapse problem I was telling you about before would be manifested by the energy function being flat everywhere. You train the system to minimize the energy for a bunch of training samples and what the system gives you an energy function that is zero everywhere. That's what an autoencoder that learns the identity function does. That's what it does to you. A JPAE that ignores the input and produces constant representation at zero prediction error for everything. So it's a collapse. To prevent collapse, you need to do one of two things. One is contrasting methods. You generate points outside the region of data and you push the energy up. You come up with some cost function that makes sure the energy of the data points come down and the energy of other points is higher and there's a whole bunch of them. And there is another set of methods which I have come to prefer, regularization methods which work by minimizing the volume of space that can take low energy. So if you push down the energy of certain regions the rest has to go up because there is only a small volume of low energy to go around.
So in practice how do you sort of reduce this to practice? Those are one of those two methods. So let's go back to this idea of information maximization. So I want to train this JPAE with some measure of information. Let's say I run a batch of samples through one of the encoders. I get a matrix where each row is the representation for one sample. Each column is the value of one variable in the representation for all samples. There's two ways to make that matrix informative. One way is to make sure all the rows are different. Another way is to make sure all the columns are different. You want to make sure the columns are different because if all the columns are the same that means every variable in the representation carries the same information and of course that's not very informative. So you want each variable in the representation to be maximally disentangled from the other ones to give you independent information from the other variables. So that would be an example of what we can call dimension contrastive methods which is a form of regularization method and then at the bottom the type of criterion that makes the rows all different those are contrastive methods or sample contrastive methods. Sample contrasting methods are very popular for certain applications. A lot of the perceptual pipelines in LLMs are trained with a technique called CLIP which basically is a contrastive method that does joint embedding between images and text. But I prefer the other one.
So this idea that you need to find an abstract representation of an input to be able to make prediction is actually very natural. We do this all the time as humans. We do this all the time as scientists and engineers. Animals do it too. Let me explain why. In principle I could explain or simulate everything that takes place in this room at the moment at the level of quantum field theory or particle physics, right? Could simulate the trajectory of every particle in this room and that would go down to actually simulating all of our brain processes and everything. And so in principle running the simulation I could figure out if any of you actually understands the word I'm saying or not. Or if you are sleeping right now or if you're like totally bored. But of course that's completely impractical. And what we do in science is we invent abstractions to allow us to make predictions and those abstractions ignore a lot of details about the underlying state of the system. So we invent those abstractions from quantum field to particles, atoms, molecules, proteins, organelles, cells, organisms, individuals, societies, ecosystems. Every level in this hierarchy is a particular level of abstraction with which we describe the world which allows us to make longer range predictions than the levels below by ignoring a lot of details about the level below. Which is why the way to understand what goes on in this room at the moment is more at the level of psychology than at the level of particle physics. Now, of course, physicists always make fun of everyone saying like, you know, you just applied physics, right? Even psychology applied physics to some extent. But in fact, there is specific knowledge about chemistry that does not derive directly from physics. So this abstraction actually kind of contains new knowledge or information or structure that was not apparent at the level below. So this idea of JPAE really kind of constructs on this concept that you need to find an abstraction to be able to make predictions.
Let's say you want to design an airplane, you need to design the airfoil for the airplane, you do computational fluid dynamics, right? You simulate the flow of air around the wing. You model the state of the air in every little cube around the wing by basically the velocity and the density and things like that and then you solve Navier-Stokes partial differential equations and that simulates the flow of air but in fact it's ignoring a huge amount of details in the underlying mechanism. The underlying mechanism is molecules of air bumping into each other and bumping on the plane. But you never simulate fluids at that level. It's just too complicated and also it would diverge from reality really quickly because it has too many details. So you have to ignore details to be able to make accurate long-term predictions. And so we do this in science all the time. And so world models should not be simulators, right? They should work in abstract space. They should not be digital twins, you know, that's a buzz phrase. They should definitely not be generative models as I just explained. And they should not be video generation. So, a lot of people are working on video generation and they call this world models. They're not world models. They are video generation systems. So one big message from my talk is that if you want to use world models, do not work on video generation. This is a different problem. If you want to produce cute videos, work on video generation. But if you want to like control robots or industrial processes or understand the world, do not work on generation.
You want models to control complex systems where you cannot model the dynamics of the system by writing a bunch of equations. If you have a humanoid robot or any kind of robot, you can just write down the dynamical equations and then simulate the dynamics of the robot and you can get your humanoid robot to do some acrobatics and kung fu and whatever, right? That's simple. As soon as a robot starts to interact with the real world, that's a lot more complicated. And that is actually more difficult to reduce to simple equations. But then think about a complex system like say a turbojet or a chemical plant or a patient or a robot but a robot that interacts with the real world in complex ways. You cannot reduce this to a small number of equations. What you have to do is basically learn a phenomenological model of the whole system, the system you control and its interaction with the environment so that you can make predictions and you can plan a sequence of actions to arrive at a particular outcome. So that's a world model. I mean the concept is very old. It goes back to the 1960s. It's the root of optimal control.
And okay so now I come down to a particular technique that I'm very fond of which I think we're going to expand over the next few months and years to do this information maximization that I was telling you about earlier and it's called VICReg. That means Variance-Invariance-Covariance Regularization. The trick here is the following: you run a batch of samples through your encoders and what you get is a bunch of points in the vector space of dimension whatever the dimension of your representation space is. We're going to try to make the distribution of those points isotropic Gaussian with the same variance in all dimensions. Why? Because an isotropic Gaussian is a distribution where all the variables are independent. So they're maximally informative individually. And it's also the distribution that has maximum entropy for a given variance but we don't really care about that. What's interesting is that it makes the variables independent of each other. So how do we do this? Now of course we don't have the distribution we just have a bunch of points in that space and it may be a high dimensional space like 2,000 dimensions and we may have a few hundred or a few thousand points like how can we make sure this is a Gaussian. So here's a trick. The trick is you project the individual points along a single direction and what you get is a marginal distribution. Now of course you still have discrete points. You don't have a continuous density. You have discrete points. So one trick you can do is compute the cumulative distribution that those points give you. So it's a staircase because you have discrete points in one dimension. And then what you can do is you can ask what is the distance between the staircase, the empirical cumulative distribution of my points and the cumulative distribution of let's say a Gaussian. You can do that because you know what the Gaussian looks like and for every point on the staircase you can tell if it's to the left or to the right of the ideal Gaussian. And so that gives you a gradient like do I move the point this way or that way in that projection.
It gives you a gradient now for every training sample in your batch. Now if you make the distribution, by gradient descent like optimizing this cost function, it's going to make the distribution Gaussian along the marginal of that distribution along this projection. But now there's a theorem that says if you do this along lots and lots and lots of directions in the limit your joint distribution is actually isotropic Gaussian. So what we need to do now is do many many projections for all of those projections. Compute those gradients, move the points or back propagate through the network change the weights so that the points move so that the overall distribution gets more Gaussian. And if you apply this to a distribution like the one on the top left here like an X, these are actually two dimensions among 2,048, and then you do gradient descent you just move the points here. You don't train a neural net, the technique I'm advocating for is on the left you get something that's sort of Gaussian-ish. And this really works in practice. We've actually applied it to training world models that are action-conditioned and we've used them for planning and it works decently. It's very, the source code is available, it's very simple, you can train it on one GPU. And what we need to do with this technique is scale it up basically. There's a few other things that we need to do but that's the main one. And so in simple cases you can train this world model and you can use it to plan simple actions in like a push task or like simple robotic situation in simulated environments. So that needs to be scaled up but it's sort of good work.
There is a theoretical paper that we put out just a few days ago where if you make the hypothesis that the underlying distribution of your data is actually an isotropic Gaussian and if you assume that the observations you get from the world are some sort of complicated nonlinear transformation of those points like in this case like some sort of spiral transformation, you apply, you train a neural net with VICReg on it, it will recover the original Gaussian in the representation space. So it's not a general proof that it works in every case, but it's a proof that if your original explanatory variables are Gaussian, the system will recover those variables up to a rotation.
So we can use those techniques in the context of self-supervised learning to train an image recognition system and there is another set of techniques which I should mention because they work really well and they are the ones that have been scaled up so far. It's conceptually my favorite method but it's very recent and we haven't scaled it up whereas those other methods that are based on distillation we've scaled them up and we got really good results both for images and video with techniques like I-JEPA and V-JEPA. So what's the basic idea of those distillation methods? You still have those two encoders. So it's a JPAE architecture. You take an input, you transform it or corrupt it or mask it or something and then you train the system to predict in representation space but you don't propagate gradient through the encoder on the right. Those are two encoders with identical architectures and they kind of share the weights. But the funny thing is that the encoder on the right uses an exponential moving average over time of the weights of the encoder on the left. The encoder on the left gets gradient and gets updated all the time. The encoder on the right gets updated slower essentially and shares the weights. This is derived from some intuitive ideas of some people at Google DeepMind who are using techniques like this to stabilize the variance in reinforcement learning and they realized you could apply this to self-supervised learning from images. They call this BYOL, Bootstrap Your Own Latent. And there is a whole bunch of methods from Meta in particular, SIMSiam, etc. that use this exponential moving average idea and a particular method called I-JEPA which I show here, it produced really really good results and what we're able to do with I-JEPA is compare the results of I-JEPA with a generative approach called MAE, Masked Autoencoder and it's not only better but it's much faster to train.
Another technique is called DINO. Many of you, I'm sure, have heard of it. I know some of you have used it because there were projects in the robot demos that actually use DINO. So this is done by some of our former colleagues at Meta in Paris and it's completely self-supervised. It's a joint embedding architecture. It's using distillation but with various tricks which I'm not going to explain. There's a lot of engineering that goes behind it and those systems basically at this time produce the best generic representations of images. If you have any type of vision task that you want to do that's probably the best technique, the best encoder for images. But what we've done is among the things use DINO as an encoder and then train a world model and do planning. Let me show you just a cute video on this if I can. So you have an initial state here of a kind of simulated environment that has pretty complex dynamics and you have goals at the top and at the bottom what you see is the sequence of actions of a planner that uses this trained world model to get the world to a configuration as close as possible to the original one in less than 25 steps. And this has been applied to a number of different scenarios like double pendulum and pushy and whatever. It works really well.
So we more recently applied it to video. So there you take a video, you mask a big chunk of it and you train the V-JEPA to again produce a good representation so that you can predict the representation of a full video from the representation of a partially masked one. Once the system is trained, you use the encoder as a way to extract features from the video and you train a head on top of it to accomplish some task and it works like really well at state-of-the-art for a lot of traditional vision tasks particularly from video like action recognition, action prediction and stuff like that. But one interesting thing that I want to mention instead of boring you with a table of results is that those systems, V-JEPA in particular, has learned some level of common sense. So one thing we can do with V-JEPA because we train it to predict what's going to happen next in a video. We can train a predictor to do that. We can measure its internal prediction error. We can show it a video and monitor the internal prediction error at every time step. This system takes a window of 16 frames. So we just slide those frames on a video and measure the prediction error for the next 15, 16 frames. And the cool thing is that if you show it a video where something impossible occurs, something unphysical, the prediction error shoots to the roof. So it's like the little girl in one of the early slides that looks at the scene of the car not falling. Same thing. You have a video of a ball being thrown and the ball disappears. Prediction error shoots through the roof. So that's interesting because it's the first time, at least from my point of view, that I've seen a completely self-supervised system acquire some level of common sense. Tell you what's possible, what's not possible.
Let me skip this. It's cute but it just says V-JEPA can be used for planning and there are new versions of this that do a better job at planning and everything but here is an interesting thing. Remember I told you the way babies learn that the world is three-dimensional is because it's the best way to explain how your view of the world changes when you move your head. So we took the representation learned by some version of V-JEPA called V-JEPA 2.1 and then we train a head on top of it to predict depth from a single image and it does a really good job. It's produced really good results in fact better than DINOv2. And what that shows is that this system by just being trained to predict, to fill in the blanks in videos at a representation level basically understands that the world is three-dimensional. I mean understands with double quotes, understands the notion of object. If you use the representation as input to a segmentation system, it works decently well. And for various other things.
Okay, let me conclude. So it's funny, huh? So abandon generative models. I mean if you work on LLM, of course, but you should not work on LLM. At least if you're in academia you should absolutely not work on LLM, there's nothing you can bring to the table. So abandon generative models in favor of joint embedding architectures if you are interested in intelligence. Abandon probabilistic models in favor of those energy-based models. I didn't have time to really kind of explain why. I made an argument in favor of those regularization methods or information maximization through variables instead of samples. So contrastive methods which again have a lot of practical applications. I've been saying forever to abandon reinforcement learning. I don't really mean abandon. I mean minimize its use because it's so horribly inefficient in terms of sample efficiency. And I know there are people here who work on this but RL is like what you do when you're desperate and there is nothing else you can do. You have to do most of the learning by observation, learning world models, blah blah blah, and once you have good representations you can use RL on top of it because you already have the good representations. You won't require too many samples. Sometimes you can't avoid it. And certainly if you're interested in making real progress in AI, in sort of grounded AI for the real world, if you want physical AI, don't work on LLM, don't work on generative models either. So as you can probably guess this does not make me very popular in Silicon Valley.
And so I left Meta as many of you probably know at the end of last year and formed a new company called EMI Labs and the purpose of EMI Labs is sort of AI for the real world, physical AI. Robotics is a use case but it's not just that, it's control of industrial processes, anything that is high-dimensional, continuous and noisy for which LLMs are completely helpless. This is the kind of problems we're working on. And that's it. Thank you very much.
M
Moderator57:12
Okay. So, I know there's many questions. Maybe we'll take, you know, one or two, but then we have to wrap up. So, and please quick questions and quick answers.
A
Audience Member57:25
Thanks for the talk. I wanted to ask about the guardrails that you mentioned on one of the earlier slides where you also talked about MPC. Engineers love MPC because they can put in their constraints, describe them in state space like 3D space. But from what I understand in your system everything works in representation space. How do I even get a constraint like don't bump into the wall into this representation space? Do you envision the system learning the constraints by itself or can engineers really put them in?
Y
Yann Lecun57:56
No, you would have to learn a very small head on top of your representation that maps that to the constraint that you're interested in. So that part has to be trained but you can train it with a very small number of samples because it's a tiny, basically just a projection.
A
Audience Member58:14
But you need a different encoder for each kind of constraint that you might want to put in.
Y
Yann Lecun58:18
Oh well, you need a different projector for each constraint, right? So if your task is to like open a door, I'm not talking about constraint. I'm talking about a task objective. You need some cost function to tell you like is the door open or not. And so that might have to be trained when you train to accomplish the task. But basically that requires two samples.
A
Audience Member58:40
All right. Thanks.
M
Moderator58:41
Okay. I think we'll have to leave it here. Thank you, Yann, very much.
Y
Yann Lecun58:44
All right. Thank you.