Geoffrey Hinton0:00
Long and circuitous route. I'll only arrive at that again at the end. So there's been a battle going on for a long time about the nature of artificial intelligence. What nature should be? There's two very different approaches, they're very different ideologies. One is this part by logic, and so the idea is that the essence of intelligence is using symbolic rules to manipulate symbolic expressions. And the people who believe this completely believed it, didn't think this was empirical. They thought that's the only way you can have intelligence. There is a sort of necessary truth about intelligence. There's a different set of people who are equally ideological who believed that if you want to make an intelligent system, you should look to biology and in particular try and understand how the brain was doing it. And it doesn't look as if the brains of animals are doing it by logic. Of course there's all sorts of debates about whether animals are intelligent, but the biologically inspired people thought that the essence of intelligence is learning the connection strengths in the neural net. And you can see these are two completely different worldviews because they kind of miss each other. So there's a battle went on for a long time. In the 1960s, we had very simple neural nets and we had a simple learning algorithm. They only have one layer of features and they would completely be wiped out by symbolic AI. So when I started in graduate school in the 1970s, it was one of them: the neural networks were the past, they would never do anything to be improved, they were useless. That's what I was told, and that's what almost everybody believed. Then in the 1980s, various groups came up with the backpropagation procedure, which allowed neural networks to learn multiple layers of features. So we got over this problem that until then neural networks had designer features. They weighted those features to make a decision, and then they made a decision. They didn't learn their own features. With backpropagation, we could learn layers of features. And there's a lot of hype about how this was gonna solve everything, and that's why I believed at the ages that what's happening today was going to happen then. And it didn't. In the 1990s, people showed there were other machine learning algorithms that work better than backpropagation on modest-sized problems, things like support vector machines. They had fancier math and they worked a little bit better on what we now think of as very small problems, things with a few thousand training examples. So neural nets were wiped out again. Then in about 2006, we came up with some technical improvements to neural networks that made them a bit easier to learn. And that was the main point. At that time, we got more computation and more data, and suddenly neural networks with lots of layers of features worked amazingly well. And now Google and Facebook and Microsoft and Apple and Amazon and all the other high tech companies you can think of are all betting the farm on neural networks. They believe that's the future, and there's a huge demand for students who know anything about neural networks.
The only way to make a computer do what you wanted was to write a program where you figured out how you would do it yourself, and then you explained to the computer in exquisite detail how to do what you would have done yourself but a whole lot faster. So if I want you to sort a list, I would figure out how it can be decided. I would figure out well, I could go down the list and find the biggest thing and then put that at the top, and then I could go down the list again and find the next biggest thing and put that underneath the best thing. It's not a good algorithm, but I might just like to do it like that. And then I could tell the computer to do that. And then the computer would sort this list. The new way to do it is: you first tell the computer to pretend to be a neural network, and then you just show it examples. You show it an example of an unsorted list from a sorted list. Show it lots of examples, and you say, "I want you to take these things and produce those things." And after a while, it gets the idea. You don't have to write the program. Sorting lists is something that's difficult to do this way. Recognizing objects in images is something that's much easier to do this way. So here's an example of something we'd like to do. People in classical AI tried to do this for 50 years and they didn't make much progress. The idea is you're given the RGB values of a whole bunch of pixels and you have to convert those to a caption that describes what's in the image. So you get all these numbers coming in and you have to have a string of words coming out. You can imagine how you might go about writing a program to do that, but you can imagine it's gonna be difficult. You first of all would maybe write a program to find little pieces of edge in the image or maybe color contrast, and doing the vision would be more than usual. And then relating the vision to language is really complicated. Now what we can do is we can solve that problem just by using machine learning with hardly any hand-programming. And by the end of the lecture, you'll understand how we do it.
Okay, that louder? Okay. I think the mic is still not on. So I'm going to explain. Many of you know this already, but I'll cover it anyway for those who missed it. A neuron is a little computational device. It's a little bit like a real neuron with a multiplier. It takes some inputs, typically from other neurons. It has weights on the connections. It takes the inputs times the weights, adds up the total input, and then it uses its own ordinary function. I'm sure about that is what we must use now, and that's the basic rectified linear unit. That nonlinear function simply says if you've got a little bit more, give an amount of output; if it's below threshold, don't give any output. So it's called a rectified linear unit. The question is, if you hook together a bunch of those guys and you figure out a way to make those weights on the connections determined, what can you do with it? So we typically hook them together by making multiple layers. This again is a sort of newer instrument. If you look at one end of your network like that, the layers in the middle, it's hard to figure out what the weights should be. When you adapt those weights, in effect you're deciding what features the neurons in the middle are to detect. So the problem of adapting is basically two primary decisions: what intermediate features do you want? Now I will show you a way of learning those features that's really stupid.
Without a viewer, we are given inputs and we're going to be told what the right answer is, and then we get to adapt weights one at a time. Suppose you want something to tell the difference between pictures of cats and pictures of dogs. We give it two outputs: one for cat, one for dog. If you showed a cat and the actual output is something like a dog, that's bad. So we show a bunch of pictures and measure how well it's doing. Then we can pick one of the weights in the network, and we can change the value of that weight, and we can see if the network is doing better or worse than it was before. To see that, you need to show a bunch of examples. If the change is good, we keep it and keep doing a little bit more. Now you can see that that procedure is going to have to evaluate a bunch of examples for each weight. If you've got a billion weights, it's going to be extremely slow. But that's what you want to do: we only change weights in the direction that makes them help. That's what we're trying to do. We're just trying to figure out efficiently. Backpropagation, which many of you know but I'm not going to explain, is just an efficient way of computing how to change the weights in the direction that will help reduce the discrepancy between what the network outputs and what you'd like it to output. The big difference is that if you perturb the weights, you measure the effects of a change on the network's output. With backpropagation, you compute the effects. So if you know all the weights in the network, you don't have to measure; you can compute. And when you compute, you can compute for all the weights at the same time what direction you should change them to help. You can't do that in evolution; you have to change weights one at a time. So evolution is not provided with that measure. Evolution can't do the same computing. Evolution changes a gene, and then there's a whole developmental process in the environment, and you don't know the effect. So what evolution did was it came up with a brain that could use that information. So with backpropagation, you go forward through these layers, you compare with the correct answer, you backpropagate signals, which allows you to compute for each weight what direction to change it to make it work better. And we thought at the ages that was problem solving. We cannot know that. But it didn't work very well, and we didn't know why it didn't work. Most people thought this is a stupid idea. That is the idea that you would take a network with random weights and you just show it a lot of data, and there's lots of signals to be able to design all its own feature detectors and do very good performance on things like machine translation, or recognizing objects in images, or recognition of speech. Almost everybody thought it was completely crazy. A few of us had the humanist view and just thought, "No, this is obviously the right thing to do." And we discovered later the reason it wasn't working was because computers were too slow and the data were too small. At the ages, people would say, "Oh yes, sure, you have a good computer and more data it will work, but look it doesn't work." That's right. And when we got bigger computers and more data, it worked. Now there are some technical things too, but it was mainly the increase in computer speed and data size that matters. So I'm gonna show you a couple of examples before I get back to the issue of authorities. The thing that made the most impact was in 2012. There was a big public competition where you have a thousand different types of object and a million training images, and you have to learn to recognize the different objects. There's a secret test set so you can't estimate. It's counted as right if you get the correct class in your top five bets, because it's not clear what the right answer is for some results. We have an image which has a Dalmatian on a bottle of cherries, and some Dalmatian the same as cherries? One of those is right, and if you say both you get it right. So the mystery. A lot of the best existing computer vision methods were used, and the results look like this: the results at the top for where we are now? The results below from my graduate students at the University of Toronto were 16 percent error, and the best computer vision groups were getting nearly 20 percent. So we almost halved the error rate, and that had a huge impact on computer vision. In my seminars, within a year all the leading people changed from doing handcrafted things to doing neural nets. So in 2011, if you submitted a paper using neural nets to a computer vision conference, it would be rejected because it was using neural nets. By 2015, if you submitted a paper to a computer vision conference, it would be rejected if it wasn't using neural nets.
It was obvious if two grad students could beat the best computer vision researchers who'd been doing this stuff for thirty years, then you're gonna develop a little further and things would get better. By 2015, the error rate on ImageNet was about 5 percent, which is about the human average. And it's not a static dataset; it's entirely different. I'll just show you some examples from a book. There's an image on the right: a cheetah. If you look at what the neural net says, it's very confident the cheetah is the right answer, and it has a few things that are around the question like leopard and spots. That's a very sensible answer. This is not a sort of typical image of a cheetah; the head is missing and it's a cheetah close-up. It's not the diagram of a cheetah. Here's a bullet train. You'll notice the bullet train probably occupies less than 10% of the pixels. The building in the background is much bigger, there's a platform, a person on the platform. But the neural net knows what the right answer is likely to be. It understands about the focus of images: you put the thing you want to attend to in the middle. And its answers are sensible. Here's one where it gets it wrong. The correct answer is a handbag. It thinks the best answer is scissors. You can sort of see why: American scissors? You can see better where I might be in front of that. I don't understand that thing that I think must be a chain in the handle. Basically, it's not scissors. You see? It's a stethoscope. The important thing here is that all the wrong answers are visually plausible wrong answers to give. It shows you that the neural net understands a lot about what things look like.
So that now had a huge effect on computer vision. Then a few years later, with some very great people including Ilya Sutskever, who was one of the people who created that ImageNet system, they decided to try neural nets on machine translation. Machine translation ought to be the perfect problem for symbolic AI, because people in symbolic AI would agree that symbols come in and symbols come out. The question is, what's in between? Is it just more symbols? So the classic idea is symbols come in, what's inside is sort of like symbols, it's sort of cleaned-out symbols. It's symbolic expressions and symbolic manipulations, and then another sentence string comes out. So it's various symbols. The biological view would be: symbols come in, but the way we code things to go over a communication channel is noiselessly. Once the symbols come in, they get converted to things that are nothing like symbols. The point about a symbol is it only has basically one property: a symbol either is or is not identical to another. So if you take two different symbols, the only thing that matters is: are they the same or not? All the other properties of the symbol are not relevant to the process. In neural nets, what happens is a symbol comes in and it gets converted into a greatly distributed vector that has all sorts of features that capture lots of knowledge about the meaning of that word, and what is similar to it. For example, Friday and Wednesday mean pretty similar things, but they are distinct symbols. There's no more internal structure to them. They're just distinct symbols, and that's all the symbolic processing system knows. In a neural net, it learns that Wednesday and Friday are very, very similar things. So the symbolic approach to machine translation says: a symbol stream comes in in English, we do some manipulations using rules, clinically used rules of grammar, also used semantics somehow, do compositional semantics. That is, given what the individual pieces mean, we figure out what a group of pieces must mean. And then we're gonna produce symbols in the target language. And we're gonna use all the knowledge that linguists have about how language works. The neural network approach, even I thought this was a bit crazy and overambitious, is to say: let's just take a whole bunch of translation pairs of the original sentences. We'll stick in the English sentences one word at a time, and then we're going to use a recurrent neural network, which has many, many loops inside. After we've finished feeding in the sentence, the recurrent neural network will have some particular state of activity, and we'll interpret that as what the neural net thinks the meaning is. That state of activity of the neural net is what represents the English sentence. And you'll notice, if you think about how the word "thought" works in English, I said...