Geoffrey Hinton

Professor Emeritus, University of Toronto

Khipu 2021 Event Series in AI: A fireside chat with Geoffrey Hinton and Oriol Vinyals.

🎥 Jul 27, 2021 📺 KHIPU AI ⏱ 61m 👁 2305 views

Khipu 2021 Event Series in AI: A fireside chat with Geoffrey Hinton and Oriol Vinyals. July 27th.

About Geoffrey Hinton

Geoffrey Hinton, the Nobel Prize-winning computer scientist often called a "godfather of AI," has stated in multiple recent interviews that he believes current AI systems are already conscious. He said he rarely discusses this view publicly because it "puts people off from the other safety messages." Hinton described the common model of consciousness as "as wrong as the belief that people were designed by God" and argued that anyone who uses a chatbot regularly knows the systems understand language, calling the opposing "stochastic parrot" argument "complete nonsense." Hinton has also discussed his regret about the technology's trajectory, saying he is "quite unhappy" and that society is not doing enough work to contain risks. He cited potential massive unemployment and the longer-term risk of AI becoming much smarter than humans, noting there are few examples of a much smarter thing being controlled by a much less smart thing. He reflected on his 2016 prediction that radiologists would stop reading scans within five years, acknowledging it was wrong due to the elasticity of healthcare and his incomplete understanding of radiologists' roles. Hinton said he has become slightly more optimistic in the past year or two about the possibility of designing AI systems that care about humans or that act only as oracles, but he cautioned that predicting the future beyond a few years is like "looking into fog."

Source: AI-verified profile updated from Geoffrey Hinton's recent appearances. Browse all interviews →

Transcript (38 segments)

✨ AI-enhanced transcript with speaker attribution

Mary Fertinatu0:05

Are we live? Yes, it seems so. Hi everyone, thank you for joining the Kipu 2021 event series in AI today. My name is Mary Fertinatu and I'm part of the Kipu team, and I'm also a research scientist at DeepMind. I'll be very briefly telling you what we have prepared for today. For those of you who are joining for the first time, welcome. Kipu 2021 event series in AI is a series of monthly online meetings geared towards supporting the advancement of AI talent in Latin America. In general, each session will consist of two sessions: the first part, conversations on AI, which are fireside chats with AI researchers exploring the most critical topics and problems in AI today. In the second part, usually we have an applications of AI session where we cover in detail examples of AI applications in the real world, its challenges and opportunities, focusing on a Latin American perspective. But today, instead of the applications in AI session, we'll host our first Kipu 2021 social, and I'll share more details about that because I'm very excited about the social. So please don't leave the event after the fireside is over. In the past months, we had many amazing events with the speakers you see on these slides, on the topics of reinforcement learning, self-supervised learning, and fairness and biases in AI. If you missed any of these meetings or you would enjoy a recap, you can watch the recordings in our Kipu Crowdcast profile. The link is in the chat as well. I take this opportunity to thank our sponsors for supporting Kipu 2021, allowing us to host this event free of charge for everyone. Today we'll start our first fireside chat with Jeff Hinton and Oriol Vinyals, no less. So please use the Ask a Question feature to ask live questions so others will have a chance to upvote the questions as well. But let's also use the chat, as you already are, to express emotions, leave live comments, basically to bring some warmth to these online events. Thank you for the claps. After the fireside chat, we will head over together to Gather Town for our social event. The link will be shared after the conversation session is over. For those of you who are unfamiliar, Gather Town is basically this 2D grid world where you get a sign-in avatar and you can walk around and have video calls with the avatars that are close to you. So it's way more interactive. You'll be able to be on camera, chat with people from the Kipu community. There is no pre-established format. We invite you to join, to introduce yourself, talk about AI, or visit our sponsor booths to learn about career opportunities. We also have pre-assigned spaces that you see on the top, where you can speak Portuguese, Spanish, and there are some topics that we thought would be interesting, such as NLP, computer vision, random PhD chat. But really, feel free to chat about anything, as you would in a coffee break at a conference. Without further ado, let me introduce our host and our guests for today's event. If we can have them on camera. Yes, there they are. Let's start with Oriol. Oriol Vinyals is a principal scientist at DeepMind and a team lead of the deep learning group. Prior to joining DeepMind, Oriol was part of the Google Brain team. Oriol is an active member and great supporter of our Kipu community. He was one of our speakers in 2019 in our event in Montevideo, and he's also a member of our Slack channel that I invite you to join. So feel free to ping him over there. Oriol is an early adopter of deep learning. Some of his contributions, such as sequence-to-sequence, knowledge distillation (which was in collaboration with Geoffrey Hinton), or TensorFlow, are used in Google Translate, text-to-speech, and speech recognition, serving billions of queries every day. Oriol was the lead researcher of the AlphaStar project, creating a grandmaster AI agent in the game of StarCraft, and the paper for this work was featured as the cover of the Nature journal. He was also involved in other well-known projects such as WaveNet and AlphaFold. Oriol is also the recipient of the 2016 MIT Tech Review Innovator Under 35 award, and his articles have been cited over 100,000 times. That's so impressive, I cannot read the number. But as a personal disclaimer, Oriol is also my husband, so these achievements look all particularly shiny through my eyes. Finally, I would like to introduce Jeff Hinton, which I'm sure most, if not all of us, are familiar with. But nonetheless, it's my honor, real honor, to refresh a bit of his background and some of his achievements. Jeff Hinton is a fellow of CIFAR, the Canadian Institute for Advanced Research, and emeritus professor at the University of Toronto, a VP Engineering Fellow at Google, and Chief Scientific Advisor at the Vector Institute. Jeff is one of the pioneers of deep learning and shared the 2018 Turing Award with colleagues Yoshua Bengio and Yann LeCun for their breakthroughs in artificial intelligence. He was one of the researchers who introduced the backpropagation algorithm and the first to use backpropagation for learning word embeddings. His research group in Toronto made major breakthroughs in deep learning that revolutionized speech recognition and object classification. His research has been cited almost half a million times, and as a matter of fact, he is the ninth most cited person in Google Scholar in the world. Jeff has been an inspiration and reference for many AI researchers, myself included. But on a personal note, I would like to add how I met Jeff, because it was a rather unusual way, and maybe he himself doesn't remember. In 2014, I was a math graduate student at UC Berkeley and I was starting to grow interest in AI. I went to the Bay Learn event, which was free and also had free food. I stopped talking to this nice gentleman who explained a lot of basic concepts and discussed object representation in the brain, how he thought it worked, how we had to represent that using AI systems. And only much later I got to know that he was Jeff Hinton, basically a legend in AI, in deep learning. But nonetheless, he took the time to chat with a clueless but curious graduate student. I'm so glad we got to know each other through that event, and that he agreed to speak today with us at Kipu. In basically a few minutes he replied saying yes, and I was so glad because I thought who would be our star guest to have here, and it would be Jeff. So thank you so much, Jeff and Oriol, for joining us today. I ask everyone to make some noise on the chat and welcome our guests so we can start this conversation. Please use the Ask a Question button for questions. You can post it also in Spanish or Portuguese. Oriol is also from Spain, Lincoln says. And yeah, thank you. Now I'll pass it to Oriol.

Oriol Vinyals9:01

Yeah, thank you so much for the introduction. I guess this is very exciting for me. I don't get to chat to Jeff so much since we used to share an office back in Brain. But maybe let me also tell you how I really met Jeff, because it's going to be relevant for some of the conversations and discussions. It will tell you a little bit who Jeff really is, and I think that's to me the most amazing bit that maybe not many people know. How I really met Jeff is I was in, I just joined Google Brain and I was trying to figure things out, and all of a sudden Jeff comes to my desk and says, "Oh, you should come see something that I have on my screen." So I go to where he sits, or doesn't sit rather, he stands, and he shows me on the screen some filters of a neural network, the weights of the neural network that he had just trained, and showed me a bit the shape of the filters. That's something that hopefully many of you practitioners take a look at sometimes. He was just super excited about what he had just learned, or what the model had just learned. I think this is really the most amazing part about Jeff: that he still codes, implements experiments, computes gradients (hopefully with automatic differentiation), and brainstorms a lot about ideas. No, he manually computes gradients, it's quite amazing. You should look at the MATLAB code he posted on his website. So I think this is really quite unique, very aspirational, to at this stage of your career still be so hands-on. So welcome, Jeff. And maybe to kick off, after saying hi to our audience, I'll ask you: what's the last experiment you've run? What was it on? What was it trying to do?

Geoffrey Hinton11:07

Well, I'm currently running an experiment downstairs that's training a big Boltzmann machine. Not one of these restricted Boltzmann machines that makes the training algorithm easy, but I think I figured out how to train a big one, and I'm training it to see if it works.

Oriol Vinyals11:26

Great. So actually, maybe many of our audience might not even know what a Boltzmann machine is. So can you tell us a bit more intuitively what you are trying to do with a not restricted but Boltzmann machine?

Geoffrey Hinton11:39

Okay, long ago in a galaxy far, far away, there were two learning algorithms. This was 1984, about then. There was backprop, which we all know about, and there was a different learning algorithm called Boltzmann machines. They were the two learning algorithms that could learn hidden representations. That is, they could take neurons that weren't part of the input or output and they could learn what to do with them, and they would learn to use them for representing interesting things that were going to be useful. Backpropagation was a boring, straightforward algorithm that worked rather well. Boltzmann machines was a much better algorithm, it was much more interesting intellectually, much more likely to be what the brain's doing, and it never worked very well. So you all know how backpropagation works. I'll tell you how Boltzmann machines work. It was inspired by work by Hopfield and Francis Crick in particular. Francis Crick. So the idea is, okay, let's take a neural net. It's got some units where you represent data, and it's got some hidden units, and it doesn't have any other units. There's no output units. So this is designed for unsupervised learning. The aim is that you show lots of data, and after a while, if you look at the units inside the network, they're representing all sorts of interesting things about the data. Like maybe whenever you show it a cat, one of the units goes ping. That's what you'd like. We want a really simple way of learning that biology could do. Here's a really simple way of learning it. You have two phases of learning: the positive phase and the negative phase. In the positive phase, you show it data and you let the neurons settle down to what's called thermal equilibrium, or statisticians call it the stationary distribution. What's happening is you fix the states of the units where you represent data, those would be pixel intensities for example, and the other units keep updating their states based on the input they're getting from the data and from the other hidden units. It doesn't have to be layered. The connections in this network are all symmetrical. That is, if there's two units A and B, the connection from A to B has the same weight as the connection from B to A. Because the connections are symmetrical, this is an energy function. That's what Hopfield realized. Newton's third law is that action and reaction are equal and opposite. If you implement that for neurons, if you say the amount by which neuron A, when it goes ping, affects neuron B is the same as the amount by which neuron B affects neuron A, then there'll be an energy function. When the network settles down, if you run it non-stochastically, it just settles to the minimum of that energy function. That's what Hopfield realized. Terry Sejnowski and I realized that if you actually add noise, so make it make a stochastic decision, then it will reach an equilibrium distribution of an energy function called the Boltzmann distribution. We initially used that for doing searches, but then later on we realized there's a simple learning algorithm. The learning rule is really neat. You put data on the data units, and then you let all the other units rattle around, influencing each other for long enough to reach thermal equilibrium. There's a big question about how long that is, but with data there, that might be fine. If it's a neural net that when you give it data can only see one thing, or typically only sees one thing, it'll reach thermal equilibrium quite fast because it's got a basically unimodal distribution. Once it's reached thermal equilibrium, you measure how often two units are on together. There might be two hidden units, there might be a data unit and a hidden unit, and you only bother to do that for units that have a connection between them. So it can be sparsely connected if you like. That's called the positive phase, and you're just measuring this correlation. Then in the negative phase, you don't show it any data, so it's free to do whatever it likes. The units all rattle around and go ping and make other units go ping, and you wait for a while, which might be about the age of the universe, but hopefully shorter. Then you measure the correlations. That's the negative phase. Now, the correct maximum likelihood learning rule, this is what's amazing, is that you should change your weight in proportion to the difference between the correlation of the two units in the positive phase and the correlation of the two units in the negative phase. That's really neat. It's not like backprop where you have activities that are propagated in one phase and derivatives that are propagated in another phase, which is very un-neuron-like. It's just neurons going ping, sometimes with the data clamped and sometimes without the data clamped. Crick's idea was that Crick didn't have any math for it, but he had the idea with Graeme Mitchison that if you didn't clamp the data and did unlearning, that would somehow make it work better. What Terry Sejnowski and I established is that if you use the right kind of neurons, which are logistic neurons, stochastic logistic neurons, which have binary states but turn on with a probability of one over one plus e to the minus their total input, then the learning rule that just takes the difference of the correlations when the data is clamped and when it's not clamped, that is the maximum likelihood learning rule. It's ridiculously easy to implement in a neural net. There's only one thing wrong with it: it takes the age of the universe for the neural net to settle down when you're not showing it data. So there you go.

Oriol Vinyals17:40

Boltzmann machines explained by Jeff Hinton. So I think there are a few thoughts that come to mind as you were explaining this. Maybe let's start with: does it bother you? Obviously one of the most important papers and works was indeed the backpropagation paper. So I'd like to know whether it bothers you to not see the connection with how the brain might work versus how in practice we're training all these models that do all these amazing things that Mary was saying. And it looks like it still bothers you because you're running an experiment downstairs. But can you, like, to what extent should we be worried as a community about this?

Geoffrey Hinton18:27

Okay, so the real question is: does the brain do backpropagation? I worked with Tim Lillicrap for several years. We used to meet every few months and have long arguments, and he had people at DeepMind doing simulations. In the end, we produced a big Nature Reviews Neuroscience paper about the relationship between the brain and backpropagation. We did our very best to make a plausible case for how the brain might do backpropagation, and I don't believe a word of it. If you read that paper, it's just smart people trying really hard to see how this thing that is clearly not doing backpropagation could be doing backpropagation. We make a very convincing case, I think, that it doesn't, because we're smart and we tried very hard. I think the brain is a Boltzmann machine. The only issue is how you get over this problem that it takes the age of the universe to get the negative phase, to get the negative data. One thing to notice about Boltzmann machines is that nowadays for unsupervised learning, it's very fashionable to use contrastive methods. Oriol was involved in introducing a contrastive method. There was actually a better version of something we introduced many years earlier, but we never convinced anybody of that thing because we didn't make it work very well. The best kind of contrastive method is one that has the hardest negative examples. The way to get hard negative examples is to use the model itself, and that's what Boltzmann machines are doing: they're negative examples. Now, what they're not trying to do is get the same representation with positive and negative examples like you are with current contrastive learning. They're trying to get the same statistics of the correlations of neural activities with positive and negative examples. They get very good negative examples, and if you could only overcome this problem that it takes forever for them to settle down, you'd really be in business. I think backpropagation is probably more efficient than what the brain uses, and it's much better at squeezing a lot of information into not many synapses. So you can squeeze a huge amount of knowledge into only a trillion synapses, whereas the brain has 100 trillion and it doesn't need to squeeze knowledge into them. So I think the brain is very good at dealing with not much data with a huge number of connections, which it doesn't use very economically in the sense of saving storage on how many connections you have. What is economical about it is how much data you need. It's profligate with connections and profligate with the amount of computation too.

Oriol Vinyals21:08

Yeah, so okay, we'll get back to this, I'm sure, because I know how much you like this topic, Jeff. But you mentioned that you're currently running an experiment downstairs. So I'm sure many people might be asking themselves: on what data? Is it still MNIST or have we switched datasets? And also, just out of curiosity, on what platform, what software are you using to run that experiment? I'm quite curious to know.

Geoffrey Hinton21:36

Okay, so you've heard about Moore's Law, right? Moore's Law says that computers get a lot faster with time. But I managed to make my algorithms get slower with time, even faster than Moore's Law. I have a bigger exponent than Moore's Law. So I'm testing this new idea not on the whole of MNIST, but on a fraction of MNIST, because if I take the whole of MNIST, it runs too slowly. I'm doing it on my Mac downstairs, locally, in MATLAB on a Mac PowerBook. I have this belief, which isn't widely shared, that if you look at where the really original ideas come from, they don't come from using huge datasets. Huge datasets have very interesting properties and it's very worth exploring. I'm not against exploring huge datasets. GPT-3 tells you a lot, and Google's big models and things tell you a lot. But in terms of understanding new learning algorithms, if you remember distillation, I did the first examples on MNIST on my laptop in MATLAB. Boltzmann machines were developed on a Lispm machine, and the Lispm machine took 12.5 microseconds to do a floating point multiply, so that's a tenth of a megaflop. Machines nowadays go faster than that. So I think to get radically new learning algorithms and to understand properties of learning algorithms, there's some things you'll only understand with big datasets, like what's the true nature of language. But some algorithms you can understand with small datasets running on cheap machines. The crucial thing is: is it quick to do experiments? Can you take an idea and prove that it's wrong in half an hour?

Oriol Vinyals23:31

Yeah, that's very interesting because I think accessibility to learning methods through open source and just making it easy and more available... Oh, we lost Jeff, I think. Oh, there you go, you're back. You're back. Yeah, I was just saying that given the accessibility to software has improved so much, maybe we will miss out on some of these fundamentals because people might not know how to compute a gradient. I remember very well the code that initially took me ages to understand. I'm sure you use the same variable names even. But yeah, I think maybe for those who are starting, and especially those who maybe have more research-minded minds, they might actually benefit from this, right? You might want to start very small and gain a lot of intuition. I mean, that's really how you operate. I've seen this so many times. Indeed, the distillation paper, people might assume that maybe Oriol ran all the experiments and so on, and I think maybe you even ran more experiments than I did, even though I was starting and obviously as an individual contributor I had a lot of time to run experiments and to investigate that particular idea that we were developing. So yeah, that's a very good tip, maybe for new starters. Let me tell you one story about Alan Turing, which I just realized might be relevant. He would program the early computers by flipping switches to put binary numbers in. When they first came along with kind of programs and computer languages and stuff, he wouldn't have anything to do with it. He just kept flipping switches because that's what he knew how to do. I think maybe I'm a bit like that. Yeah, cool. So I guess it's also good to answer a few questions from the audience. I like the first question because I think you've already answered it: what advice would you give to an ML enthusiast that's just beginning to accelerate? Maybe not only their experimentation, which maybe is what you were talking about, experiment fast and quick, but maybe the learning. So from a learning perspective, if you have to learn and you're starting today, what do you think are good resources that you might advise many people about?

Geoffrey Hinton26:06

So I think you have to decide whether what you're trying to explore is what happens when you do gradient descent learning on big datasets, or what you're trying to explore is how to make the learning algorithms better. It's to do with the properties of a lot of data. Big datasets have very special properties, and you can't really study those on small datasets. You couldn't do GPT-3 on one book's worth of language; it just wouldn't work. You need a lot. Similarly for machine translation, you wouldn't know whether it worked or not if you did it on a small dataset. So there's the applications of the technology, which typically work best when you have a lot of data, and then there's the basic technology itself. If you want to explore the basic technology itself, you want to do it on small things that you really understand, and you can look at everything that's going on. You have lots of displays up on the screen. Whenever you put a new display of something up, you discover it wasn't working the way you thought. So you want to display everything in sight: what the activities are, how the weights are evolving, what the histogram of weights is, what the ratio of the gradients to the values of the weights is, all those sorts of things. You get insight into what's going on, as opposed to you take some high-level language, you start with somebody else's code that's already implemented a ResNet, and then you fiddle with a few things. I don't think you'll make new discoveries like that, radically new discoveries. But I'm just an old-fashioned guy. I also believe that automatic differentiation is bad for your moral fiber. The best regularizer we have for forcing other people's models to be easy for me to understand is that they had to compute the gradients. If you look at a lot of models we have now, you'd never have these models if those people actually computed the gradients.

Oriol Vinyals28:19

That's true. Yeah, gradients for free. That's definitely the new movement. I've enjoyed both sides of it, but I must say I will not compute gradients anytime soon, I suspect. So yeah, I think maybe let's go through another question that goes a little bit more to actually Boltzmann machines, just so that we connect and maybe wrap up with how the brain works and we finally discover how it does work. So Fabricio is asking: maybe something you like, how to reconcile the idea that the brain is a Boltzmann machine on the one hand with the notion that the connections in the brain are reinforced or dimmed over time? In other words, the connections change over time.

Geoffrey Hinton29:10

So the Boltzmann machine learning algorithm is the way of changing the connection strengths. You change them based on the difference of the correlations when you're clamping data and when you're letting it run free, which we thought of as sleep. The act of learning from a Boltzmann machine is precisely something that happens all the time in your brain. It's happening all the time. When not much is happening, it's because the learning is equilibrated, not because it's been turned off. So if you take your low-level perceptual system, it really isn't changing much anymore because you've been seeing data with the same statistics for a long time. In your case, video games. As soon as you show it radically different data, you show that for a day or two, and your low-level perceptual system will change. But the reason it was sort of static is just because the data was sort of static. It's learning all the time.

Oriol Vinyals30:06

So I have a question about another scale of learning, which bothers me because at the scale that we had, so basically evolution. There is the scale at which we have brains that happen to be the way they are. How do you see learning at that scale? Do we need to replicate evolution, or maybe there's a way to do gradient descent to actually accelerate that process? How do you see the learning outside of the human brain basically?

Geoffrey Hinton30:43

Okay, I guess I got a lot to say about that. One thing: if you look at reinforcement learning papers now, it may change with model-based reinforcement learning, but if you look at ones that aren't using model-based reinforcement learning, if you look at the horizontal axis very carefully, you'll see it says kind of one, ten, a hundred, a thousand. You think, okay, that's how many sweeps through the training set or updates or something. Then if you look very carefully in rather small print on the right-hand side, you'll see times ten to the seven. Occasionally it's times ten to the six, but it's usually times ten to the seven. That's because with reinforcement learning, if you ask how much information you're getting to set the weights, the reinforcing signal, suppose it's a binary reinforcing signal and you get it after every trial, that's one bit. It's really not very much information. Evolution is like that. Also, evolution has a big problem: you can't backpropagate through it, because development has a big problem. When an organism develops from the embryo into an adult, it's interacting with an environment that isn't internal to the organism. Whereas if you put an input into a neural net and turn that into an output, everything that happens is happening inside the neural net, so you can get derivatives of everything. Evolution couldn't use backprop because there are all these extraneous inputs coming in. So what evolution did, being a very slow, clumsy algorithm, is it evolved a brain. It evolved a brain because a brain can do backprop. The brain can get real gradients. Whether it does it by Boltzmann machine methods, differences of correlations, or by backprop doesn't matter for this argument. It can get gradients in a high-dimensional space. If you take a million-dimensional space and you can get a gradient, you can learn a million times faster than a method that doesn't get a gradient. So it's completely insane to think we learned anything complicated by evolution. The role of evolution is to create a brain and create a rough architecture for the brain, and the brain is then going to do the learning. Now it's even more complicated than that because if you're a Darwinian, you think that properties of the brain that are encoded in your DNA couldn't have been the result of learning. But actually, that's completely wrong. There's something called the Baldwin effect. That's completely Darwinian; it doesn't violate any of Darwin's assumptions that acquired characteristics can't be inherited. Nevertheless, it shows that learning organisms can evolve much faster than organisms that don't learn, and they can get what's learned into their DNA without being Lamarckian. I could go on for about 10 minutes to explain that, but I suggest you just read the article about the Baldwin effect by Steve Hinton and me.

Oriol Vinyals33:58

Great, yeah, I've seen that. But maybe people like that. You're using the one-bit analogy that maybe it's been popularized by a cherry on top of a cake. So I guess you believe in a way that reinforcement learning is like the cherry on the cake, and perhaps Boltzmann machines, if you make them work, might be a way to get the cake and the icing and so on.

Geoffrey Hinton34:24

Yeah, absolutely.

Oriol Vinyals34:28

Great. Great minds think alike, so that's good to know. So although there is a Boltzmann version of reinforcement learning, oh okay, maybe we'll get to that. But maybe let's leave Boltzmann alone for a little bit. One of the papers many people in the audience might have read, because we mentioned in the Slack chat that this is one of the ones you would like to talk about, is the GLOM paper. This is an interesting paper, a single-author paper from Jeffrey. I'll just read the first kind of sentence of the introduction, because it's a very interesting way to maybe write a paper that maybe only Jeff can do. It starts: "This paper does not describe a working system. Instead, it presents a single idea about representation which allows advances made by several different groups to be combined into an imaginary system called GLOM." So I have many questions, but maybe let me ask the one that Leo is asking from the audience, and it's very relevant to this sentence of the abstract actually: have you made progress on the GLOM model implementation? Maybe that's what you're doing downstairs since you published the GLOM paper?

Geoffrey Hinton35:49

Yes, some. So I've been working with Sara Sabour and a software engineer called Laura Culp. Laura has implemented a very toy version to test out one particular idea in GLOM, which is one of the central ideas. The idea is that suppose I have two parts of an object, let's suppose they're a nose and a mouth, but I'm not quite sure they're a nose and a mouth because it's low-resolution data or there's noise or whatever. So I have a possible nose and a possible mouth. What I would like is to get them to disambiguate each other. I'd like the possible nose to provide information that says to the possible mouth, "Hey, that's quite a good bet that you're a mouth." Now, if you think about doing that with transformers, it's quite tricky. Suppose that you have layers of representation, and think of the nose and the mouth as like word fragments. They're going to get revised embeddings at each layer, and the initial embeddings will be rather ambiguous, like they would be for the word "may." But as you go through the layers, the context will disambiguate them. That's fine for language. But now try the transformer approach on parts of objects, and you have to deal with coordinate transforms. If I've got a possible mouth, it really needs to send out a message that looks like this: a query, in transformer talk. It needs to take the pose of the mouth in the image, that is the coordinate transform between the intrinsic frame of reference of the mouth and the image frame, so that's what I call the pose of the mouth. It needs to multiply that by the relationship between a mouth and a nose, that is the coordinate transform between the intrinsic frame of reference of the mouth and the intrinsic frame of reference of the nose. You need to send out a query saying, "Is there something out there that might be a nose and has this pose?" But it's not the mouth pose; it's the transform of the mouth pose in order to be a nose. So it needs to send out queries like that all over the place to look for noses. But of course, it might also look for an eye, and it needs to apply a different coordinate transform, one for the left eye and one for the right eye, and say, "Is there anything that might be a left eye out there that had this pose?" That's quite heavy work for a transformer. What's more, if the nose gets this query from the mouth that's been suitably transformed to say, "Hey, I got a match," what it needs to tell the mouth is it needs to send back a value. The value isn't the pose of the nose; it's the pose of the nose transformed by the inverse coordinate transform to get it back to the pose of the mouth, and that's the value it should contribute. So you've got these coordinate transforms having to happen if you do parts interacting with each other, and only supporting each other if they have the right geometric relationship. That's how you do it with transformers. We did indeed make something like that work with set transformers. We don't actually know how it worked, but that's the only way I can see that set transformers could actually get it to work, so I assume it's doing that.

Oriol Vinyals39:24

By the way, just to interject one second, I guess you would agree with this, but just for clarifying for those who know transformers quite well in the audience, I suppose the fact that transformers come with multiple heads could help because these schedules...

Geoffrey Hinton39:42

Oh yes, they'd have to come with multiple heads. You'd have to have one head for mouths asking about noses and another head for mouths asking about eyes. Yeah, right. But that doesn't scale very much.

Oriol Vinyals39:56

Okay, so please tell us, what's the solution?

Geoffrey Hinton40:01

Here's an alternative. What the mouth does is it takes its pose and it transforms it by the coordinate transform between a mouth and a face, and it says, at the next level up, I can have levels of embedding corresponding to the different levels of the part-whole hierarchy. So I'm a major part, I'm at that level. At the level of the object itself, I can make a prediction about the identity. I mean, I can make a prediction about the pose: it should have the pose of a face that would contain a mouth like me. Now the nose can do the same thing. The nose takes the coordinate transformation of the nose in the face and predicts the face in its pose. Notice, if the mouth and the nose are correctly related geometrically, they will predict the same pose for the face. That's called a Hough transform. So now you've got a much simpler operation. All you have to do is see if the predictions coming from below agree. What's more, when things are interacting, when the mouth and nose are interacting, they're not going to interact directly. They're going to interact by these potential faces interacting. The face predicted from one location should be the same as the face predicted from another location. So you've got these interactions between possible faces. They should be the same, but there's no coordinate transforms going on there; the coordinate transforms will be done already. So that's much simpler, but it has a horrible piece of baggage that comes with it. The horrible piece of baggage is that you weren't sure it was a mouth. Let me forget about that. Suppose it was an eye. We think we found an eye. Well, it could be a left eye, it could be a right eye, or it could be a front wheel of a car, it could be the back wheel of a car. We don't know which it is until we disambiguate it using context, because actually it's just a circle. So now what we're proposing is that without knowing what it is, you make multiple predictions at the next level up. So the next level up, you have to predict: well, it might be a face with this pose because it's a left eye, might be a face with that pose as a right eye, it might be a car with this pose as the front wheel, and so on. So you have to make a multimodal prediction. The question is: can you get neural nets to do that, and can you get neural nets to resolve these predictions? What Laura showed is yes, you can, in a simple case. The idea is you have an embedding vector for this thing that's a possible eye. Well, it's a circle, right? It has an embedding vector. That embedding vector will get revised by top-down input later on as the net settles down, so it'll get turned into an eye rather than just a circle. But to begin with, it's just a circle. So it has to predict that it might be a face with this pose, or a face with that pose, or a car with this pose, or a car with that pose. Somehow the embedding at the next level up, the embedding of the object level, has to be able to represent all those alternatives. What Laura showed is yes, you can get that to work. Now what happens is neighboring embeddings at the object level, coming from neighboring locations in the image, when you average them together, you're representing these multimodal distributions across all these possible alternatives in the unnormalized log probability space. So think of them as like a landscape with bumps, where bumps are modes, but they're unnormalized log probs. If you now just add those together, that's equivalent to multiplying probability distributions together. If there's a common mode to a whole bunch of different multimodal predictions, that common mode will stand out. Because it's log probability distributions, when that common mode stands out, it'll totally suppress all the other modes in the probability domain. Laura made that work. So that's one thing we did, on very toy data, basically faces of sheep made of ellipses. You look at an ellipse and you don't know what it is; it's the arrangement of ellipses that conveys all the information. So that's one direction, but she was using supervised learning to make that work. She was also using backpropagation through time to make it work. What you would do is you'd put in some ellipses, it would settle down, you'd leave out a few ellipses, and it would fill them in just like BERT does. Now you backpropagate through time to make sure it fills things in right, and that's how you can train it. But that was never neurally plausible. So those simulations showed that yes, you can make multimodal predictions for the higher-level embedding, then you can get interactions between neighboring embeddings to resolve those. But the question is: what the hell's the learning algorithm? The whole point about GLOM is it's meant to be neurally plausible, so there's no point having backprop. What I'm running downstairs is the neurally plausible learning algorithm that just learned the whole of GLOM unsupervised.

Oriol Vinyals45:18

Okay, okay. We'll be looking for the next paper then to see how it works. But yeah, it's fascinating. Obviously for those who might not have heard or interacted with Jeff, it's amazing how clearly you explain it. Obviously some knowledge about transformers and neural nets is required, but just watch the recording and this is a great way to learn. This is what Sara and others' work is about. This will obviously maybe you might go ahead and learn about the paper from the paper, and you might see that the paper complemented with Jeff's explanation makes more sense. But really, I think you're an amazing instructor of intuitive like how things might work, and I really like that bit. I miss it a lot because of course we used to interact earlier when we were both in the same office. One thing I should add before we carry on is that the paper that uses lateral interactions between these ambiguous parts and set transformers is called Stacked Capsule Autoencoders, and the first author of that is Adam Kosiorek, who was a student of Yee Whye Teh in Oxford, and now has a Google group in either Berlin or Zurich, I always forget which.

Geoffrey Hinton46:47

Great, yeah. I think he's a DeepMind guy.

Oriol Vinyals46:54

Yeah, he's a DeepMind guy. Great, so people can pause the papers. So let's maybe use the last 10 minutes for quite a change of gears. You've obviously seen the field of machine learning evolve a lot. Maybe one question I have, more on the recent past, is: you mentioned a few works, but in your opinion, what's the most impressive? Let's use two time frames, they're fairly short given the field. So what's the most impressive result that surprised you maybe the most, not just impressive but really a surprise, in the last say five years, and then ten years? Quick answer.

Geoffrey Hinton47:40

Okay. I'm not sure if machine translation fits into the last five years, but it's about five years ago. It really surprised me that backprop starting with random weights could learn to do machine translation that was better than what the phrase-based translation people had put a lot of work into. Initially it wasn't an issue; it was kind of comparable with phrase-based translation, but within a tiny fraction of the amount of human labor involved. The fact that that worked once you put attention in was just amazing to me. That happened much sooner than I thought. The reason it was amazing is because machine translation is a symbol string to symbol string problem. Such as anything those symbolic AI people and those linguists ought to be able to do: symbol strings in and symbol strings out, because they think it's all symbol strings in between. It turned out the first thing you do with a symbol string is you throw the symbols away and turn them into vectors. The last thing you do with all these vectors is turn them back into a symbol string. Symbol strings are only out there in the world, or in the auditory input and output. Inside, it's all vectors. So what amazed me was that the deep learning approach would beat them on that problem, which was you couldn't think of a problem that was more ideal for the symbolic approach. Once you had one on that, the game was over, I think.

Oriol Vinyals49:18

Great, I like that. Any other answers maybe from different time frames? It's fine.

Geoffrey Hinton49:26

More recently, the protein folding, because that's been a well-known problem for a long time. I think that's down to deep learning being very good, and there's a lot of good research at DeepMind. Demis is an amazing organizer of major projects. I always think of Demis as the Oppenheimer of deep learning, although he hasn't yet made killer robots. That's the message. And yeah, that's an example of something where you needed a lot of data, although as you said to me recently, it turned out it wasn't that much data; it was a lot of bootstrapping on unlabeled data.

Oriol Vinyals50:12

Yeah, indeed. It's surprising that a few, it's MNIST-plus-plus size, but it's not even ImageNet. That problem is interesting because it's all about pose estimation, so maybe there's even ways to improve it. We should talk offline maybe about that. And then there's things like GPT-3, the big language models done at Google and OpenAI and elsewhere. That's been amazing, I agree. For me personally, just how the same idea scaled up more just creates such a different qualitative sample, it's quite unbelievable. And that's the kind of thing you'd never discover working on MNIST.

Geoffrey Hinton50:54

Yes, yes. So that's the downside. But you have many good students around the world exploring all different things, which must feel quite good.

Oriol Vinyals51:06

So maybe the last topic on that note. Kipu is a community that tries to instigate more AI research in South America. Given that you've seen AI be not only prime time but also be a bit neglected, approaches come in and out of fashion. What kind of advice would you give? Our audience is majorly from South America. What kind of advice could you give at two levels? One is an individual tuning in who wants to make a career, lives in South America, maybe there's not so many well-known centers for research yet. And maybe more interestingly, what could you recommend someone from a government? Obviously CIFAR was very important, and probably you had a few things to say back then. But if now, knowing what you know, what would you recommend governments invest in for the AI community to flourish in places like South America where it's just starting to happen more and more?

Geoffrey Hinton52:15

So I think there's one thing governments could do that would be a huge win for AI and for everything else: make people wear masks, particularly in Brazil. I must say I meant Latin America, of course, not South America. But they could do that in Mexico too, yes. So now I've got that out of my system. My advice to people, and I haven't really thought hard how this applies to people in Latin America and South America, is: get yourself to a good place where you're going to be surrounded by graduate students who are intelligent and thoughtful and know what they're doing, and will tell you when you're working on something that's already been done, and will tell you when you're working on something that's actually interesting. Get yourself an advisor who is competent but not necessarily famous. I think you get the most out of an advisor who has some evidence that they're good, but they still have plenty of time for you because they're not really well known yet. Those are the advisors who are probably the very best advisors. When I think back to when I was the best advisor for people, it was when I was like that, when I'd done some good stuff but before it was really fashionable. But I haven't really thought through how that applies to Latin America. The other advice is: trust your own intuitions. Now, intuitions come from a lot of experience. You shouldn't trust intuitions based on no experience, but get yourself a lot of experience. The experience you get is from tinkering around with things. For me, tinkering around with different ways of learning MNIST is most of my life's experience. I actually had a funny story from a student of mine called Roland Memisevic. He explained I spent too much time on MNIST. I agreed. I did spend quite a lot of time on MNIST, and I could actually prove it. Because if you told me a particular pixel was on in one of the test images, namely pixel 28 across and 13 down, I think that's the one, if that pixel's on, I can draw the rest of the image. I thought Roland would be very impressed by the fact that I knew this fact, and Roland said, "Get a life." So you literally overfitted to MNIST yourself. But I do have a piece of advice which probably isn't welcome. I believe people can be really successful at anything they choose, but just one thing, and that's the problem. I think to be a really good researcher, you have to be totally obsessive. To be a really good human being, you need other qualities. I don't think there's much point being a researcher unless you're going to be really good. It's like being a musician. A really good musician is worth thousands of not very good musicians, just because the music is disseminated. It's the same with research. So if you're going to be a researcher and you want to be a good researcher, you have to be really, really dedicated. You have to think about research most of the time. It's just unfortunate, but that's the way it is. It could be worse: in the Middle Ages, if you wanted to be a researcher, you had to be a monk.

Oriol Vinyals56:19

Great. So I think that's that. I might be still, no pun intended since we both work in distillation, but I think obviously being extremely passionate, and I like the kind of linking researcher with musician and the amount of dedication that Jeffrey has had in his own life. That's very good advice to pass on. And maybe the other bit of advice that I really like, distilling that again, is that tinker with models. Jeffrey can do it, does it all the time, even today. So if imitation learning is to be pursued, I would say never stop doing that. That's great advice. Even at the personal level, sometimes one forgets that the research is down there with the gradients and the activations and the histograms, and look at all these things all the time. This was great, Jeffrey. Sadly, we only have time for this. So thanks a lot for your time, your passion, your backstories, your explanations of the new and the old algorithms. I've had a great time. Hopefully folks in the chat also had a lot of interesting discoveries from this chat. And of course, as Mary said, we will move to the social gathering where we might just hang out and keep discussions open. But thanks a lot, Jeffrey, for your insights. It's unbelievable. It's been a while since I talked to you, and it feels like regrets of not trying to reach out more often. But yeah, this has been also great for me personally. So thanks a lot for your time, Jeff, again.

Geoffrey Hinton58:04

Well, thank you very much for inviting me. It was a lot of fun. And maybe in a year's time, we should do it again.

Oriol Vinyals58:09

Yes, yeah. Maybe in person if people wear masks.

Geoffrey Hinton58:12

Okay, great. Bye for now. Thanks.

Mary Fertinatu59:26

Hi, can you hear me now? Yeah, it's working now. Okay, so let's get to Gather Town. You can join through this link: kipu.ai/gather. Remember, some of our sponsors have booths there. I'm excited to also announce the next event of Kipu. It's going to be on Monday, the 30th of August at 5 PM. The topic is going to be very related to the pandemic, so it's going to be on AI and biomedicine: lessons learned from the pandemic and future perspectives. We're going to have very prominent speakers: Gonzalo Moratorio from Uruguay and Pablo Velazquez from Colombia University. Sorry, I lost my notes about the speaker's bio. But Gonzalo has been recognized for his work by Nature magazine for doing very impressive work on how to fight the pandemic in Uruguay. Pablo Velazquez also has a lot of interesting work in biomedicine and its applications. We also have more speakers to be announced. So thank you all for participating. You can follow and know more about our events through our Twitter account or our webpage. You're also welcome to join our Slack through kipu.ai/slack. Also, we're going to share a link to a feedback form so we are always looking for ways to improve. If you can take the time, please provide us feedback about our social and also the chat today with Oriol and Jeff. I'll see you in the social. The link is in the chat. See you there briefly. Bye bye.