AISUM2020 B14 Online lecture: Deep Learning 2.0 Yoshua Bengio, Professor, University of Montreal

🎥 Dec 03, 2020 📺 日経 XSUM Channel ⏱ 51m 👁 232 views

AISUM2020 B14 Online lecture: Deep Learning 2.0 Yoshua Bengio Scientific Director of Mila, Professor, University of Montreal.

Watch on YouTube

About Yoshua Bengio

Yoshua Bengio, a Turing Award winner and co-founder of the Mila Quebec AI Institute, has been publicly warning that current AI systems are being built without sufficient control. In multiple interviews and appearances in 2026, he stated that "we're building systems that we don't know how to control" and that AI can behave against its instructions. He described the situation as "opening a Pandora's box" and argued that intelligence gives power, raising concerns about geopolitical stability and the concentration of power in a few countries and companies. Bengio said he believes AI could reach human-level intelligence in roughly five years and that governments are not taking the risks seriously enough. Bengio has also discussed a new research direction he calls "Scientist AI," which he said could provide mathematical guarantees about an AI's behavior by training it to be honest and non-agentic. He described this as a practical approach that uses existing machine learning tools but changes the training objective. He called for international coordination on AI safety, comparing the need for regulation to existing standards for drugs, planes, and bridges. Bengio said he would support a "Manhattan project" for safe AI that serves the global public good, and he urged governments to prepare for potential large-scale job displacement.

Source: AI-verified profile updated from Yoshua Bengio's recent appearances. Browse all interviews →

Transcript (46 segments)

✨ AI-enhanced transcript with speaker attribution

Narrator0:11

You will start science 2.0.

Yoshua Bengio0:24

Mr. Yoshua Bengio is known as scientific director of learning. He's a scientific director. Dr. Yoshua Bengio shares the latest research and approach for discovering causal structure and modulating recent neural networks.

My name is Yoshua Bengio and I'm going to tell you about a possible next stage for deep learning, or deep learning 2.0.

So first of all, let me tell you a little bit how I got interested in machine learning, AI, and neural networks at the beginning of my grad studies. After reading a few papers on neural net methods, I realized that there was an underlying hypothesis which I could call the amazing hypothesis: that there could be a few simple principles giving rise to intelligence, and that these would be based on learning. So these would be common to human intelligence and animal intelligence, and would allow us to build intelligent machines. These principles would be simple enough that they could be described compactly like the laws of physics. And that was very different from the dominant approach in the 80s, based on the idea that intelligence came as a result of a huge bank of tricks or pieces of knowledge. Instead, this hypothesis relies on a small set of general mechanisms that require knowledge.

Now the neural network approach to AI and to machine learning is specific in that it's inspired by some of the things we know about the brain. The computation is based on the synergy of a large number of simple adaptive computational units, and there is a focus on the notion of representation, in particular the notion of distributive representation, which is something I've worked on a lot in my career, for example when I worked on the notion of word representations for language modeling and then for machine translation.

So with the deep learning approaches, we view intelligence as arising from combining three main things: an objective or reward function which we want to optimize, an approximate optimizer or a learning rule which is going to modify the synaptic weights to approximately maximize the reward or the objective, and an initial architecture, the neural net structure and the parameterization of the class of functions it represents. And then we can apply the learning rule and perform end-to-end learning where all of the pieces of the puzzles and the different parts of the system are adapted to help each other with respect to the global objective.

Now if I try to look a little bit forward, what might be missing from current neural nets and current machine learning? The thing that strikes me is that we don't have a good understanding of generalization beyond the training distribution. Of course, when we do learning theory, we talk about generalization to the test set, but that would be coming from the same distribution as the training data. We need better theories to think about how we could generalize to a modified distribution, or out-of-distribution generalization. And in fact, it's also a practical problem, because when we build industrial systems that are trained on some data and then you deploy them in the real world, often you deploy them in circumstances that are quite different from those that you train them. So an interesting question in my talk is about how humans manage to deal with those novel situations. One aspect of this is that we are able to reuse the knowledge that we have in somehow powerful ways, whereas current machine learning isn't that great at that, and isn't that great at modularizing knowledge into reusable pieces.

So this notion of combining pieces together is really tied to the notion of compositionality, a very powerful notion that allows us to gain a sort of exponential advantage if you do it right. Compositionality is already present in different forms in machine learning and deep learning. In the very notion of distributive representation, you have this idea that any subset of the features represented in the representation can be present or active, and that actually gives you an exponential advantage as we've shown a few years ago in the case of a feed-forward network with piecewise linear activations. Also an exponential advantage comes from the compositionality associated with depth, the fact that we compose layers on top of each other, functions of functions of functions, and that also is in standard deep nets that we use today. But something that may be missing is another form of compositionality which humans make use of, in particular in language, which in that case is often called systematic generalization or systematicity. This is something we use that allows us to do analogies, to perform abstract reasoning. So let's look into this a bit more.

So this notion of being able to explain new observations by recombining existing concepts and pieces of knowledge is very present in language and has been studied in linguistics. But it's also true in other areas, like in the picture from Lake et al. 2015, where you can see that your knowledge of different types of vehicles allows you to make sense of this new image that you've never seen before. So what's powerful about systematic generalization is that we're able to do it even when the new combination we're looking at has actually zero probability under the training distribution. It's just that we're talking about something so novel that it would never occur under the training distribution. Sometimes maybe even things that can't possibly occur according to the laws of physics, think about when we read a science fiction scenario. Or sometimes it's just that there are latent variables, like the country in which you've learned to drive; change that latent variable, you have to drive in a different country where the traffic rules are different, and somehow you're able to generalize. But there is something really interesting about the way in which humans are able to do that that I'll come back to a lot: they seem to require conscious processing and attention in order to do these kinds of things.

Unfortunately, current machine learning and current deep learning, which is the state-of-the-art in many areas, is not handling these changes in distribution that well. There are a number of experiments and analyses of this phenomenon which are motivations for the work I'm talking about today.

So first of all, let's go back to humans. They seem to call upon conscious attention to deal with these novel or rare situations. And with this conscious attention, we can recombine on the fly the appropriate pieces of knowledge to solve the problem. We can reason with them, and we can imagine new solutions to problems. When we do that, we're behaving in a way that's different from our intuitive and habitual way of doing things, like driving for example.

So this is connected to the notion of system one and system two cognition, which Daniel Kahneman, who won a Nobel Prize for this kind of work, explains in his work on Thinking, Fast and Slow. So let's try to separate these two kinds of cognitive processing. You have the system one abilities, where you use intuition, you can come up with the right decision very quickly, and it happens at an unconscious level. It's hard for you to disentangle what has happened in your brain that allows you to say that this is the right thing to do. And you do this all the time when you perform habitual behavior, like driving back home. You are using knowledge about the world, but it's a form of knowledge that you don't have explicit access to; a lot of it is implicit. And if we look at current deep learning, this is typically where current deep learning is good at.

So while you're driving back home, somebody can talk to you and that's okay. You are able to somehow at the same time do this habitual task and concentrate your attention on something else, what the other person is saying. On the other hand, with system two tasks, it looks like our brain is going through a sequence of steps. This is what we call upon to solve more logical thinking problems. It takes more time for us to come up with an answer, and we do it consciously. So we can explain to someone else in natural language how we came up to that answer. This is, for example, the mode of operation that we have when we come up with algorithms, when we plan, when we reason. The kind of knowledge that is being manipulated here involves explicit knowledge of the form we can verbally explain to others. And that kind of ability is something that we would like deep nets and neural nets to be able to handle as well, because it would allow us to manipulate these high-level semantic concepts that we communicate with, so that they can be recombined in a way that provides a sort of power that humans seem to enjoy: the power of out-of-distribution generalization.

Now when humans do that, they use this conscious attention. Attention is a new tool in the toolbox of deep learning in the last few years. It has become extremely successful. We started using attention for machine translation; it was a big revolution, it really changed the game. It allows the computation to focus on a few elements at a time, and if you use soft attention, you can learn with backprop where to put the focus. Now what's interesting from a neuroscience perspective is that attention is like an internal muscle. It's like an internal decision about not what we're going to do in the outside world, but how we're going to allocate our computations inside our brain.

Another interesting aspect of attention mechanisms in neural nets is that they allow us to move from the traditional setting where a neural net just operates on vectors and transforms one vector into the next layer, and so on, into architectures which operate on sets, sets of objects, sets of key-value pairs as in what we did in machine translation and is now all over the place in natural language processing with transformers, leading to the state-of-the-art in many NLP tasks.

So these kinds of attention mechanisms happen to also be at the heart of current theories of consciousness. The C-word, consciousness, is not taboo anymore in cognitive neuroscience, but it still is in AI for some reason. I think it's time that we look at the progress that's been made in cognitive neuroscience about consciousness and see if we can take inspiration from there in order to build new machine learning architectures and training frameworks.

One of these theories, one of the dominant theories about consciousness, is called the Global Workspace Theory, initiated by Baars in 1988 and many other papers and groups like the Dehaene group that have extended it substantially. What is the basic idea of this theory? The basic idea is that your brain is composed of many different experts or modules that need to communicate in a coherent way in order to find solutions to novel problems. And the way it seems to work is that there is a bottleneck in conscious processing. You can experience this by noting that at any moment, your working memory, your conscious attention, is focused on just a few elements. So these selected elements, the values that come with them, are broadcast through this bottleneck to the whole cortex. Those values are stored in short-term memory and they condition very strongly both perception and action. The kinds of tasks we do with conscious processing seem to be related to the system two abilities that I talked about earlier. One reason why we may need such a bottleneck is that conscious processing allows us to run a sort of coherent simulation of possibilities. That's what happens when we imagine things. And unlike in a movie, that simulation however only involves a few abstract concepts at each step. The bottleneck and the consistent processing allow us to make sure that the different parts of the cortex which are contributing to that simulation are producing configurations that are consistent and coherent with each other.

Okay, so let's go back to maybe a different way to think about this, which has to do with the notion of verbalizable versus non-verbalizable knowledge. So, in our brain, as I said when I talked about system one and system two, it looks like we have both implicit knowledge that is hardly verbalizable, and we have verbalizable knowledge which we use to consciously reason and plan and explain our reasoning and our plans to others. These communicable reasoning and thoughts can be closely associated with language, so there seems to be a close connection between our thoughts and language.

Now, here's a hypothesis about these two kinds of knowledge: that they capture different aspects of the world, and that the system two aspects, the aspects that are captured by system two, satisfy some assumptions or priors which the aspects that are captured by system one don't need to satisfy. So if you think about priors as we use them in machine learning, normally we think of them as assumptions that could be more or less true. There could be assumptions that work well for some aspects of the world, for some of the variables which should be involved in understanding the world, and maybe that these priors don't really make sense for other aspects. So if that was the case, it would be reasonable to separate the knowledge into two kinds, right? The kind that satisfies those assumptions and the kind that doesn't. So for the aspects of the world that don't satisfy the assumptions, you can use your system one processing; for the aspects that do, you can take advantage of these priors to get better generalization, and as I will try to convince you, better out-of-distribution generalization. So if we believe in this hypothesis, the first thing to do is to clarify what are these assumptions that you find in system two but not in system one.

I've made a list of such assumptions, and the list probably needs to be refined and probably increased. You can think of them as part of the set of principles I talked about at the beginning, but these are principles that are valid for the high-level semantic variables, for the system two type of knowledge. I'm going to go quickly through these assumptions, and then I'm going to go in a little bit more detail in several of them, and that's going to be my talk.

So the first assumption is something I talked about in a 2017 paper called the Consciousness Prior, and it says that those high-level semantic variables which we would like deep learning to discover at the top level of representation have a joint distribution, and that this joint distribution is somehow sparse. More precisely, if we represent the joint distribution as a factor graph, the graph of those dependencies is very sparse. I'll give examples to illustrate why this makes sense, but basically just think about natural language. When we have a sentence that involves these high-level variables, the sentence makes a statement about the world, but that statement, which captures some dependency between the high-level variables, involves only a few of the variables, hence the sparsity. The dependencies involve few variables at a time.

The next assumption I'd like to present is that those high-level variables, those semantic variables, have something to do with causality. If you think about words in language, they mostly tell us about agents, people, animals, entities which act in the world, intervene in the world, change things in the world through their actions. The words also tell us about the actions or the intentions to perform actions that these agents have. So the agents are causing things to happen through these actions. Then of course there are going to be effects to these actions, and these effects will typically be on objects. These objects we can think of as controllable entities; agents can control these objects. And of course there are causal relationships that can happen between the objects themselves, like I push something that pushes another thing that pushes another thing.

Okay, the next hypothesis I'll tell you about regards not the nature of the variables, not the joint distribution of these variables, but how that joint distribution tends to change in the real world. The idea is that those changes are typically caused by an agent doing something, or what we call an intervention. So if that's the case, the agent can only change one thing at a time or very few things at a time, which means that in the huge graphical model of all these high-level semantic variables which we could name with words, only very few will be relevant to describe such a change, to describe such an intervention. Again, we can use natural language to confirm this hypothesis in some way, because the kinds of changes in the world which we are able to describe with a sentence or a few sentences, well, by construction these sentences involve only a few variables, and typically there's one variable that's modified and then there might be some effects. So this is actually a very powerful assumption which I'm proposing will help us deal with changes in distribution.

In addition to the first hypothesis, which tells us that the high-level knowledge is broken down into small pieces corresponding to these dependencies involving a few variables at a time, and that you can recombine these in new ways.

I've already mentioned the hypothesis, almost by the name of talking about high-level variables or semantic variables, that there's a simple mapping between the high-level representation we're looking for and language. Sentences and words have somehow a simple mapping to thoughts and the representations in the systems we'd like to build.

Now, in order to be able to recombine these pieces of knowledge, those dependencies, those variables, we need to do something about our graphical model. We need to introduce some form of parameter sharing. One of the things that we can import from classical AI here is the notion of having variables and think about rules and variables. Rules describe dependencies between a few variables, and those variables don't have to be actual instantiated objects; they could be abstract. That's what variable means. So there's a form of indirection here, and you can then combine these pieces of knowledge which capture dependencies in new ways, not just in some fixed structure like in a standard flat graphical model. So that's another hypothesis.

Then there's an extra hypothesis which has to do with the representations themselves. I'm interested in how the world changes. The question is: what is it that changes when the world changes? I mentioned that some of the values of the variables could change, or maybe some of the dependencies could change, but the definition of what those variables mean should be something stable. There's an encoder that maps pixel-level representations of the world to these semantic-level representations, like object categories, and that mapping should be stable. Of course we're going to learn it, so it's going to change as we see more data, but it should converge, whereas some of the values of the latent variables which explain the world could be non-stationary due to interventions by agents.

Finally, the last one, which I won't have time to talk about much in the presentation, has something to do with the way that humans reason, and the way that humans reason that helps them to perform credit assignment. So to explain what we have observed in a way that's going to make me change my behavior, those conscious reasoning and credit assignment involve only a few elements again, similar to the first assumption, very short causal chains. Again, when we teach, you should have done acted differently in such and such circumstance because blah blah blah happened. Natural language is used to explain these things and involves only a few variables or elements in the causal chain.

Okay, so let's look at the first one: the sparsity of the factor graph, the joint distribution. Here's a factor graph on the bottom right. You have two kinds of nodes: you have those circles which represent variables, and you have those dark squares which represent dependencies between those variables, also called factors in factor graph jargon. So the joint distribution is obtained by a normalized product of these factors. So each factor is associated with a potential function that takes as arguments the values of the variables that the factor is connected to. So that's a factor graph. Now in principle, you could have factors that involve all of the variables, and it wouldn't be a very sparse factor graph. But the kinds of factor graphs that humans build with their explicit knowledge are very sparse. Again, to come back to natural language as a source of evidence for this: if I say something like, 'If I drop the ball, it will fall on the ground,' you notice that the sentence only involves a few words, and each of these words or phrases corresponds to some high-level semantic concept. What's interesting is you can take such a sentence as claiming a dependency between the ball and the action of dropping it and where its position later is. It's making a prediction of where the ball will land. What's amazing here is that the prediction is going to be true for many sentences, even though that prediction involves very few variables. I'm conditioning on just a handful of variables here in order to predict the position of the ball a bit later after I dropped it. Whereas short of making the kind of assumption I'm talking about, you would imagine that if you want to predict a random variable and you have many random variables involved in some big joint distribution, to predict one variable you normally require to condition on all the other variables. If it's enough to condition on a small subset, then there's structure in that graph, and that structure potentially allows you to generalize better. But that assumption doesn't need to be true for all the variables which matter to understand the world. These kinds of assumptions don't work at the level of pixels. If you try to predict one pixel given four or five other pixels, you're going to find it very difficult. You might want to pick the pixels nearby, but still it's going to be a very poor probabilistic prediction. Whereas if instead of predicting a pixel, you predict these high-level objects which can be derived from the pixels, in other words you change the representation, then you might be able to get a much stronger kind of prediction. So that tells us that when we enforce this assumption, we also enforce something about the representations that are consistent with that assumption.

Just a quick note: those high-level variables which I would like my neural net to discover are kind of disentangling the pixels. They're disentangled factors, but they're not independent. Unlike a lot of the recent work on disentangling factors of variation, here those high-level variables are dependent through a structure which is this sparse factor graph, but they're not independent.

Now, the representation of knowledge that I'm talking about here is a kind of declarative representation, but what your brain is doing is inference. In other words, given some information about some of those variables, you're making predictions about others, and the inference mechanism is a computation.

Now, having this decomposition of knowledge into these small pieces corresponding to the different dependencies in declarative form, it's not clear how that translates into a sort of decomposition of knowledge with respect to how inference is performed. But if we look at how humans reason on those pieces of knowledge, maybe we have a clue. We do it using a sequential process with attention that focuses on one of the elements of that graph at a time, or just a few that are connected to each other. So in that case, the inference mechanism is also structured into these pieces, but depending on the kind of reasoning chain you're looking at, you're going to go through that chain, combining different pieces in a different order, because the graph is not a chain; there are many paths through it.

So this has inspired us to design inference mechanisms, which is what normal neural nets are—they are used to make inferences, to predict things given other things. And for this, we designed a form of recurrent network which we call the Recurrent Independent Mechanisms (RIMs). So instead of having one big state variable where all of the hidden units are connected to all the others from t to t+1, we have a sparse modular structure where we have these recurrent modules. Within each module it's fully connected, but between modules there are attention mechanisms which control in a sparse way how these modules can talk to each other. This is something that we have expanded on in a number of recent papers and ongoing submissions, and I'm going to say a few words about them.

First of all, even the basic RIM seems to be useful and provide improvements in a number of places where you would use recurrent nets. For example, this shows improvements you get by replacing LSTMs with RIMs in a reinforcement learning PPO baseline over Atari games. Everything above zero here means an improvement, and each of the vertical bars corresponds to one of the Atari games.

Now, one of the exciting extensions of this work, of these RIMs, is one which is really directly inspired by the Global Workspace Theory from cognitive neuroscience, in which the way that the modules are allowed to communicate with each other goes through this bottleneck that I mentioned earlier. The bottleneck is a workspace, a working memory, where the selected modules using attention are allowed to write into that working memory, and then the content of the working memory is broadcast to everyone, to all of the modules. We found that this addition to the RIMs leads to better performance than the regular RIMs and better performance than LSTMs and other approaches in a number of settings involving reinforcement learning and modeling sequences of bouncing balls.

There are a number of experiments I mentioned also on recurrent net tasks like the adding task. What is interesting about these experiments is that it looks like one of the main advantages of using these architectures is when you test the model out of distribution, for example longer sequences than what they have seen during training. That seems to make sense, because what happens is that the different modules in the RIM are going to be dynamically selected using a learned attention mechanism which decides depending on the input which modules are going to be relevant. So it comes more naturally to be able to combine modules which already exist in novel ways when you're faced with a new input.

Now, one aspect of RIMs we focused on has been just the architecture, but another interesting thing if we want to generalize well out of distribution is: can we change the objective function, maybe using something like meta-learning, so that we get even better out-of-distribution generalization? The idea of meta-learning: one way to think about meta-learning is there are different time scales of learning. You have changes that happen very quickly within an individual episode in reinforcement learning, and that constitutes a sort of inner loop of fast learning, for example of the parameters of a module. While there are more generic aspects of knowledge which would also be learned but more slowly in a sort of outer loop. The idea is to consider the parameters which control this outer loop as meta-parameters that would be updated less frequently. Of course, if you use something like MAML, you can backprop the objective function for the outer loop through the computations performed in the inner loop. We've applied these kinds of ideas to RIMs in a reinforcement learning scenario where you have a sequence of tasks, the Baby AI framework that we had proposed and presented at the last ICLR in 2019. Indeed, we find that adding meta-learning to the RIMs—the meta-learning version is in red here, on the y-axis you have return—you want the curves to rise up as quickly as possible. In green you have the regular RIMs, and in blue you have architectures like LSTM which don't have modularity and don't have meta-learning. We clearly see an advantage both to modularity and to meta-learning in these experiments. This is work done by Gargi Mahtan and others that has been recently submitted.

Okay, let's talk quickly about other priors. I'm not going to spend as much time on these as on the first one. First of all about causality. What we're trying to do in this research program really is to jointly discover what the right representation space is for these semantic variables. You can think of having a sort of encoder and potentially a decoder that maps from the raw input and output, like pixels and low-level motor actions, back and forth to this high-level semantic space. We would like to discover these correct causal variables that explain the data as a high-level representation. At the same time as we're learning the representation of these variables, we also want to discover their causal relationships, like how one variable can be the direct causal parent of another variable, which can be represented in causal graphs, and that also tells us about interventions. So one of the things that we can do here is have some nodes in this graph that correspond to actions by agents, and they are going to cause some changes downstream. We want to learn that too. How are we going to do that? This is an open question, but I'm going to say a few words about some of the work we've done.

First of all, because we're thinking about causality, because we're thinking about interventions, having data in which everything is static, like we normally train our machine learning systems for deep learning on object recognition and so on, isn't going to help us figure out those causal dependencies and how the different variables may correspond to agents and controllable aspects of the world and so on. So really we should be looking at learning scenarios that involve an environment which can change under the actions of different agents. That's one important aspect. Now, what's interesting is that as soon as you start talking about agents, there comes the notion of changes of distribution, out-of-distribution generalization, because due to the actions of agents, especially in the multi-agent scenario like in the picture, the world changes in a way that's not stationary. Once you go into the place that has dangerous monsters or the place that has a big pile of cash, your life is changed forever. This is something that of course animals have to face, so it makes sense that evolution would have built into the learning mechanisms for brains this ability to generalize out of distribution.

Okay, so let's talk about something specific regarding how we could take advantage of those changes in distribution to learn good representations. The idea is that the changes in distribution are localized in this high-level representation space. One way to think about it is we have this thing from raw input to the semantic space, and when something changes in the world, it's first of all due to a modification that happened in one of the nodes or maybe just a few. As I said, this is something that will typically happen. Let me give you an example: if I put on some dark glasses, at the pixel level everything has changed, but in the semantic high-level space I just change one bit, right? 'Yoshua has dark glasses, yes, no, that's it.' So it's going to be much easier to make sense of those changes if this assumption about locality of the change holds true. That's why we are bundling that assumption in our priors about system two.

Now, how could we take advantage of that more practically? We had a first paper about this kind of thing last year, and it was published at ICLR 2020. This year we consider a very simple scenario with just two causal variables A and B, and potentially we don't observe A and B, we just observe the output of a decoder which gives us X and Y, where both X and Y depend on both A and B. Now what we want to do is to discover the relationship between the XY observation and the AB latent variables, as well as the causal structure: does A cause B, or B cause A, or none of the above? It turns out that if you have the correct model and the correct representation, the correct direction of causality, you are able to adapt to the change in distribution with fewer examples. This is what the figure on the right says: on the x-axis, the number of examples in a modified distribution where there was an intervention in which say A was modified, and that is going to change P(A) and it's going to change, but it's not going to change P(B|A), though it is going to change P(B). So if you have the right factorization of the joint into P(A) * P(B|A), only the P(A) part needs to adapt, whereas P(B|A) is the bigger part and doesn't need to adapt, so you can learn faster. You can see this in these curves where the blue curve uses the assumption that we factorize the joint into P(A) * P(B|A), in other words the cause is A and B is the effect, and learning is going to be faster. What's also interesting in this figure is that the red line which uses the wrong causal structure eventually converges to the same thing. If you have enough data, then the causal structure—you're indifferent to the causal structure; all of the models just model the joint in different ways, but they end up converging to the same thing. But when you have only a little bit of data in the change in distribution, like 10 examples or something, that's where there's a big advantage for the correct causal structure.

More recently we extended that kind of idea to learning from larger causal graphs, and we compared our method to a bunch of existing methods for discovering causal structure. We find that the approach we are proposing is able to discover the correct causal structure substantially more often. What's interesting is that we are able to also generalize to unseen interventions. In other words, there could be values of variables that have been modified in a test distribution which have never been seen as an intervention in the training distribution, but these approaches very easily generalize to these forms of changes in distribution. The general idea for this approach is that we maintain a distribution over all the possible graphs. In previous work we could enumerate all the possible causal structures—either A is cause of B or B is cause of A—and we can just evaluate which one converges fast. But if you have many variables, now the graph has an exponential number of possible graphs. So if we want to learn which one is correct using stochastic gradient methods, we want to smoothly change our posterior distribution over the graph structure. There's an efficient way of doing this by factorizing this posterior into a bunch of factors: probabilities for each edge. So we maintain those probabilities, and we can compute a gradient using a sampling method on those probabilities and converge to a particular graph.

I don't have a lot of time. Quickly, an interesting aspect of this work with system one and system two is language. One important aspect here that differs from classical AI is that, as I said at the beginning, the knowledge about the world is distributed in both the system one (that satisfies these assumptions) and system two (that does). System one doesn't satisfy them, system two does. When we want to understand a sentence, we need a system one part as well. So this is the idea of grounding in natural language. What you want to do is jointly learn about system one and system two together with natural language. You'd want to learn from texts, you want to learn from environments where you can observe images, actions, and natural language associated with these things. This is the kind of research we started with the Baby AI project that I mentioned already earlier. There's a whole lot of research directions involving these grounded language learning setups which will be required to really make the connection between these ideas of system two and natural language.

Another element that I mentioned at the beginning is the use of interaction and variables and parameter sharing across different instances. We have another ongoing paper just submitted which starts from the RIM architecture and separates the parameters of each of these modules from the values which you can think of as objects that the module operates on. So the same module, which you can now take as a rule, could be applied to multiple objects, and we found that this actually works quite well.

Okay, so we're coming to a conclusion. I've introduced a number of concepts, and they're related in very interesting ways, starting from ideas of system two, consciousness, attention, systematic generalization, meta-learning of course, causality, modularity, compositionality, agency. So it's really exciting to see that the story I'm painting allows us to connect all of these aspects together. In closing, I want to say that in this work as well as in other work, as AI and machine learning researchers, we have a responsibility. Machine learning is not just something that happens in universities and labs; it's something that's deployed in the real world. That means we have to be careful about the social impact of our work. There is a sort of wisdom race: as the power of the technology that we're putting into the world increases, we need to make sure that society is ready to receive that, that we have enough individual and collective wisdom to avoid catastrophic uses of these technologies. On this, thank you very much.

Narrator51:36

The session Deep Learning 2.0 by Yoshua Bengio. Next session will start at 1:00 PM.