Satya Nadella13:33
I really want to talk with you all today about just a couple of simple things. What's driving all of this progress? Why is all of this happening right now? Part of it is we're riding an extraordinary platform wave—something is fundamentally changing in the universe of technology, much in the same way that it changed when we were going through the PC revolution, where Moore's Law was driving an incredible increase in the power and lowering of the cost of personal computing, which led to it becoming ubiquitous. A similar thing happened with the internet revolution, where networking technology connected all of this compute together and allowed us to do things that previously were unimaginable. And we're going through one of those major technological changes right now, partly driven by the incredible scaling of the capability of AI systems as you apply more compute and more data to training them. But before we get to that expansion of the frontier, a super important part of the emergence of a new powerful platform is completing the stack. It's actually hard work—even when you have a piece of technology that is improving at an exponential rate—to figure out how to do all of the things that have to be done in order to deploy it in real applications so that you can go out and deliver value to real customers. We've done a huge amount of work over the past year on the Copilot stack—both optimizing a bunch of systems so things are getting cheaper and more capable, and building that whole cloud of capabilities, systems, services, and tools around the core AI platforms so that you all can build the things that matter to you.
One of the reasons that we have been able to do this is no other company has deployed more generative AI applications over the past year than Microsoft has. You have probably heard us talking about all of these different Copilots—this new software pattern that we originated with GitHub Copilot, where you pair powerful generative AI with this user interface paradigm where you're using the AI to help assist users with tasks. You can apply this to everything, and many of you in the audience are building your own Copilots. Microsoft itself is building Copilots for service, for sales, a Copilot in Bing, Copilot in Edge, Copilot in Windows. The reason we've been able to do all of this work is because we have the Copilot stack that we built for ourselves—to have real agility in getting these products built quickly, built efficiently, price and cost optimized, and built in a way where they're safe and secure. One of the things you'll be hearing a lot more of at Build is that part of what the Copilot stack is allowing us to do is to unify the experience across all of these Copilots into one logical Microsoft Copilot, where you don't have to really pay attention to which Microsoft product or service you're in—the Copilot just understands all of your context and delivers all of the capability of the model in the context of your data and your task to you when you need it.
The other thing that is really driving progress is not just this completion of the Copilot stack, but we are riding a fundamental wave in the development of this AI platform. If you just look at compute over time—how much GPU cycles or accelerator cycles we're using to train the very biggest models in the world—since about 2012, that rate of increase in compute when applied to training has been increasing exponentially. And we are nowhere near the point of diminishing marginal returns on how powerful we can make AI models as we increase the scale of compute. So we're sort of doing two things at once at Microsoft: optimizing the current frontier and building that toolkit to help you all leverage it, while at the same time investing at a pretty incredible rate in pushing the frontier forward.
One of the super interesting things that has just happened as we're pushing the frontier forward is what our partner OpenAI launched last week in GPT-4o. As mentioned earlier, GPT-4 is a stunning achievement—a multimodal model that understands a bunch of different input types from video to text to speech, that can respond in a bunch of rich ways from text to speech and eventually video, and can respond to users in their applications in real time. In the case of the ChatGPT demos that folks have seen, you can even interrupt the model so that you can have really fluid interactions with these systems. An enormous amount of work has gone into GPT-4o—both the model itself as well as the supporting infrastructure around it—to ensure that it's safe by design.
I wanted to also just remind folks that this efficiency point is real. While we're building bigger supercomputers to get the next big models out and to deliver more and more capability, we're also grinding away on making the current generation of models much more efficient. Text: 12 times decrease in cost, six times increase in speed. Quite a year and a half ago, it's 12 times cheaper to make a call to GPT-4o than the original GPT-4 model, and it's also six times faster in terms of time to first token response. It's just really extraordinary how much progress we're making because of the full set of optimizations—from the silicon we're building, networks, data center optimization, as well as an incredible amount of software work on top of all of this hardware and infrastructure to really tune the performance of these systems. The great thing is, again, there's no point of diminishing marginal return here. One of the messages that I want to land with you all today is that you can count on things getting more robust and cheaper at a pretty aggressive clip over time. It's a really important thing to internalize. We challenge ourselves on at Microsoft all the time: aim for things that are really truly ambitious, because all of this optimization work is going to accrete to make things really ubiquitous in terms of how you can go deploy them.
I have a little quick demo video here, so let's roll the video. [Demo plays: A woman asks for help debugging Python code using GPT-4o's vision and voice capabilities. She shows her code via phone camera, the model identifies a bug—using extend instead of append—and suggests a fix. After she corrects the code, it runs successfully.] I mean, it really is extraordinary. I should say by the way that Jennifer would never make that actual mistake in writing a Python application, but Kevin might. I do want to make sure that we're paying attention to just how much has changed over the past year. What you just saw would have been absolutely inconceivable to think about actually working. This was not a tortured demo—Jennifer showed me this last night and then she just recorded this demo. It's just crazy that it works this well.
Another set of things that have been making a huge amount of progress is what's possible with smaller models. We have been working for a while on this series of models called Phi that are small language models. Satya mentioned this in his keynote earlier—imagine an efficient frontier. Usually when you're building these models, you're trading size off—which is related to performance and cost and a whole bunch of other things—versus quality. The smaller the model is, the cheaper it is to do inference and the less compute you need to run the model, so small models are more amenable to running on devices, but it usually means you take a hit on quality. What we're discovering, particularly over the past year, is that there's this notion of an efficient frontier. We don't even show the GPT-4 point on this slide—it would be way, way off to the right in terms of size. If you want extreme levels of quality and performance, a frontier model is your friend, but in some cases you may want to choose one of these other models somewhere on this efficient frontier where the trade-off between cost to serve, latency, or locality is acceptable given the quality you can get. The very interesting thing that's been happening is the quality you're able to achieve in these small models is getting pretty high.
Remember back ancient history to the launch of ChatGPT in November of 2022. ChatGPT launched on top of GPT-3.5, and everybody was absolutely gobsmacked at what was possible. Fast forward a few months to March 2023, and ChatGPT gets an upgrade to GPT-4, which is even more extraordinary—able to ask extremely complicated questions and get very rich, interesting, compelling completions. Now fast forward to today: a version of Phi-3, optimized and running on a mobile phone, can respond to a prompt just like ChatGPT could just a year or so ago, with responses that are sort of equivalent. This is not arguing that Phi-3 running on this device is just as powerful as GPT-4—it is not. But the way you all should be thinking about it is: in many cases, these models can be appropriate to use for building your applications when you have a particular set of constraints that you're trying to optimize towards.
I wanted to really motivate why this matters with the following example. Satya mentioned earlier the partnership that Microsoft has formed with Khan Academy. Khan Academy's mission is really interesting and important—they are trying to ensure that every learner on the planet, no matter where they are, has access to high-quality, individualized instruction. One of the things we are exploring together with Khan Academy is the possibility of achieving that goal of ubiquity of personalized learning agents by using something like Phi-3, where you can imagine training a Phi-3 model that's very good at something like math instruction. This is an actual interaction with Phi-3 medium that has been fine-tuned to work particularly well for math tutoring. The challenge with doing something like this is not just having the model give the student an answer—you want it to lead them towards discovering the answer themselves. A tutor is very different from an answer agent. It's exciting to think about how many tools organizations like Khan Academy have to solve these really important missions. So with that, I'd love to bring Sal Khan from Khan Academy onto the stage.