Craig Smith10:16
It was a serendipitous conversation I had with a linguist who was very kind to just offer to interview me or recruit me, you know, in that process of offering me a job at Princeton. And she heard I was very interested in cracking a Holy Grail problem of computer vision: object recognition, the process of naming objects like cats, dogs, chairs, microwaves, cars, trees, and all that. And then she said, 'Have you heard of a linguistic project called WordNet?' And I had never heard of WordNet before. But with a little bit of research, you realize WordNet was one of the most profound, important linguistic and natural language projects that emerged out of Princeton in the 1990s, which reorganized the entire English lexicon in a taxonomy that is different from typical dictionaries. The lexicons are organized by their relationship, such as what we call the 'is-a' relationship: a German Shepherd would be related to a dog, which would be related to an animal. And that was a large-scale project that really had profound influence in linguistics and computational linguistics. And this researcher mentioned to me that wouldn't it be nice if every node, or every set of words like German Shepherd or tree, in this tens of thousands of entry WordNet dictionary or taxonomy, just had a picture attached to it, so that people who go to WordNet would know what a German Shepherd looks like, or what a panda bear looks like, or what a microwave looks like. And then she said she tried this project with a bunch of undergrads at Princeton and it didn't go very far for a couple of reasons. One is it's not clear it's useful for linguistic research to have a picture attached to the word. Second is that it's really hard: the undergrads had to go through tens of thousands of entries and find the picture, so it just didn't work. But that conversation really was like a spark of light in the darkness. I had been struggling to try to make object recognition work, and suddenly I was thinking there's a role that data can play in a way that we never paid attention to. We've paid so much attention to tuning the parameters of our models, but mathematically we're always running into the problem of overfitting our model without enough data, and lack of generalization. These are kind of jargon words in machine learning, but they point to the mathematical fact that models are hard to fit and one needs lots of good data to drive the model. And that was an insight that wasn't prominent in the field yet. I realized maybe we should try something completely radically new: instead of spending time to tweak the parameters, we should create a large database of pictures of many, many tens of thousands of different kinds of objects and drive the capacity of the models to a whole different state and see how that goes for this important problem of object recognition. So I asked the linguistic researcher, 'Is anybody doing this project?' She said no. She said, 'In fact, we had a name for it and it's called ImageNet, but it's a terminated project.' So I said, 'Would you mind if I started it, but in a completely different way for my computer vision research? But I really like the name, can I inherit the name ImageNet?' She said, 'By all means, just take it.' So that's really the beginning story of ImageNet. It was during my transition into Princeton. So I moved my small, tiny lab to Princeton in 2007 and we began the ImageNet project. And the idea is that we would take 22,000 nouns that are countable and concrete in the WordNet taxonomy, and these nouns are conceptually visual — they're nouns that are not like 'love,' which is harder to visualize, but nouns like 'chair' are so. We wanted to provide hundreds and thousands of pictures from all kinds of sources to drive the diversity and variability of each concept. So if you multiply those numbers together, we're looking at tens of millions of curated pictures. And to do that, we had to download nearly a billion pictures from the internet, and the downloading process itself was very interesting in 2007. And then we needed to find a way to curate them, and we struggled a lot. First, we also went to the undergrads and tried to entice them to label, and that was just impossible; at the hourly rate, trying to hope we could get undergrads to label a billion images, my PhD student then was working with me and did a back-of-the-envelope computation and said, 'I won't graduate for another 19 years if we did this.' So we also tried other ways to get computers to label, but that was actually just philosophically the wrong way to do it, because we're trying to curate a ground truth training data set to improve the computer's ability. If we used anything that was based on the existing ability of computers, we would introduce very low quality, erroneous data. So we had to go to humans. And during early 2007, or maybe in the middle of 2007, another serendipitous hallway conversation changed everything. It was with a master's student who happened to come from Stanford and was at Princeton, and he said to me, 'Have you heard of Amazon Mechanical Turk?' And I said I had never. He said, 'I heard there's a Silicon Valley startup that didn't have enough people to label some data — I forgot what kind of data, some color or wine bottle tags, one of those things — and they used this very new online worker service that Amazon had beta tested. It's a global online worker market; people just post jobs and people worldwide do the jobs.' I remember I was very busy teaching that day, and I went home at night and logged into Amazon Mechanical Turk. That night I knew ImageNet would happen, because I had never seen a platform of that meaning. Now we call it crowd workers that can contribute on a global scale. Fast forward to 2009, we rolled out ImageNet as a research paper in our community. By that time, we had almost 60,000 online workers from more than 150 countries working and contributing to the curation of ImageNet. In 2009, we rolled out ImageNet. In the meantime, we open-sourced it to the research and education community, and we felt strongly that this is a path to our North Star — one of our North Stars. And in order to create this path and invite more researchers worldwide to participate with us, we had to make a challenge. We had to make an international challenge so that we could roll out a benchmark that would test our researcher community. So in 2010, we started what we call the ImageNet Challenge benchmark, which invited worldwide research teams to come and work on the problem of object recognition, and we would release the challenge results annually and have an international workshop to talk about the results. In 2010 and 2011, there were slow progress, but the progress was not significant. And in 2012, I remember the deadline for the challenge was in late summer because we wanted to announce this at the international workshop of our annual computer vision conference, which rotates around the world. But that year it was in Florence, Italy, in September or October, so we needed to process the results in late summer. I got a call from my graduate student at night who said, 'We've got a remarkable result, and we need to check if this is wrong,' because the error rate was just cut in half from last year. And they also made a comment that it used an algorithm that we have known for 30 years, and we didn't realize this algorithm could be this powerful. It turned out to be the entry to the 2012 ImageNet Challenge by Geoff Hinton and his students, and it was the winning entry of the ImageNet Challenge that year. A small story there: I think I was lucky to realize that was a historical moment. So I wasn't going to go to Florence, Italy that year to attend the conference because my child was very small — I was a nursing mom. But I realized it was so significant that I bought a last-minute ticket to Italy. I squeezed in the middle seat, cursing my middle seat the entire trip. I was in the air back and forth for probably 40 hours; I was on the ground for less than 18 hours, just to go there, announce the result, and lead the workshop, because it was just so significant. And I remember at that workshop, Geoff Hinton didn't come, but his student Alex came, and also young Ilya Sutskever came, and a number of very prominent AI and computer vision researchers came. And there was palpable energy in the room. You know, researchers' energy is not everybody clapping; it's really thinking deeply about this result and debating, and some people even expressing skepticism and pointing out the potential pitfalls. So there was a lot of discussion. But in a couple of months, Geoff Hinton published this ImageNet paper and it went viral, and that was the paper that, fast forward eight years later, got them the Turing Award. And that algorithm was the backpropagation algorithm.
Is that right?