William Dally6:48
So I think everybody will agree that there's probably no technology shaping the future more today than that of deep learning, of AI. And I'm going to tell you a story about the technology behind that, which is the computer hardware and the evolution of that that has allowed AI to shape the future. So let's start with a little bit of history. You see this, I can walk around here and use this to advance. So AI really was created by three ingredients, the modern revolution in deep learning. So algorithms, as sort of indicated by this picture from the AlexNet paper in 2012. So if you look at AlexNet, it's a CNN, and it was trained with gradient descent and backpropagation, and all of those algorithms were around in the 1980s. So the algorithms part of deep learning was largely solved, you know, 40 years ago. The next issue is you need enough data to train these networks, and large-scale datasets like the ImageNet dataset, which is kind of a pictograph of, you know, were around in the late 2000s, by 2008, 2009, Pascal in 2005. So the data and the algorithms were there. You think of that as the fuel and the air. And they were waiting for a spark to ignite them and really light off the AI revolution. And that spark was enough computing performance to train a large enough neural network on a large enough dataset in a reasonable amount of time. In the case of AlexNet, it was two Fermi-generation NVIDIA GPUs that were used to train AlexNet on the ImageNet dataset in about two weeks. And so it was the GPU, the computing, that was the missing ingredient that really was responsible for the AI revolution. Now since GPUs enabled deep learning, GPUs have been pacing the progress of deep learning. This chart shows the demand in compute power, in operations, to train a state-of-the-art model from AlexNet in 2012 to one of the GPT models in, what am I doing here, about 2024, and actually the latest frontier models are about an order of magnitude more than this. And this is an increase of 10 million in 10 years in the capability required to train a deep learning model. And at NVIDIA, we feel very responsible for not holding up progress by, you know, making sure that we can keep our GPUs growing fast enough. And so I'm going to tell you a little bit about how we met this tremendous increase in demand during an era when Moore's Law is largely dead, where we're really not getting very much from process technology.
So this chart sort of shows our progress in GPU performance over the last 12 years. It's a 5,000x increase in performance over those years. To put some pictures with this, this starts with Kepler. It's actually one GPU after the one that was used to train AlexNet. That was the Fermi generation. Kepler came out in 2012 and it had a performance of about four teraflops. We made progress along the way and I've kind of shown the key points of it. In the Maxwell generation in about 2015, we added support for 16-bit floating point, which was what everybody was using to train neural networks, either FP16 or BF16, and we added a dot product instruction. I'll explain in a minute why those were important, and we wound up getting up to about six petaflops. The Pascal generation took it to 20. A big jump happened with the Volta generation in 2017 where we added a matrix multiply instruction, the HMMA instruction, a half-precision matrix multiply accumulate. It takes two FP16 matrices, two 4x4 FP16 matrices, multiplies them together and sums them into an FP32 matrix. So it's 128 arithmetic operations, 64 multiplies, 64 adds, done with one instruction. And what this does is it amortizes the overhead of the instruction. Complex instructions are a good idea when it comes to overhead. We added the IMMA instruction. And by the way, the marketing people didn't like the idea of MMA instruction. So they called these tensor cores, which sounds way cooler. And so in the Turing generation, we added the IMMA instruction. In the Ampere generation, we added support for sparsity. I'll explain that in a little bit more detail later. In Hopper, FP8, and then in Blackwell, NVFP4, and I'll talk a little bit about some of the scaling technologies behind NVFP4. So if you look at where this 5,000x comes from, very little of it comes from process technology. The processes that these different chips are in are actually color-coded on this chart. The black ones, Kepler and Maxwell, being 28 nanometer. The green ones here, Pascal and Volta, were 16 nanometer, and actually Turing as well. Ampere was seven, and then Hopper in five and Blackwell in four. And if you look at the gains in performance due to that process technology, and I used performance per watt as the way to attribute it to process technology, from 28 nanometer down to 4 nanometer, you would have expected the gain to be roughly the ratio of those two numbers, or something on the order of seven. But no, it's 3x. We got 3x out of process technology over this period of time, and it's because these numbers are a little bit made up. So the difference in the metal pitch, which to me is the thing that actually matters, between 28 nanometers and 4 nanometers is actually a little bit less than 3x. And so you're not really getting a huge gain from process. Now you're getting 3x out of the 5,000x. The single biggest thing that made a difference during this period of time was number representation. And I'll go into this in a little bit more detail later, but in Kepler, we had 32-bit floating-point numbers for the bulk of the math. That's because the chip wasn't designed to do deep learning. It was designed to do graphics and high-performance computing. And for graphics, we needed FP32. And for high-performance computing, we needed FP64. And so all the deep learning done in Kepler was done in FP32. And you got four teraflops. If you instead of doing a 32-bit operation, you do a 16-bit operation, that's actually four times less energy. And it's because the dominant operation is the multiply, and the multiply scales quadratically. Think about doing long multiplication. Every bit of one operand has to be ANDed with every bit of the other operand to create partial products, and you have to sum those partial products together. So it's a quadratic scaling. And so by going from FP32 to FP16, we gained 4x, and then we gained another 4x going to eight, and another almost 4x going down to four. And so that winds up giving us a 32x gain over this period of time. The other thing that gave us a big gain here was complex instructions. Let me flip to this chart to explain those. So GPUs actually have a very efficient execution pipeline because they don't do any speculation. So they're not guessing and then having to throw away work. They don't do any branch predictions. So they don't have to access several big tables and do lots of computation just to figure out where the next instruction is. It's a very simple pipeline. Even so, the cost, and I've normalized everything to 45 nanometers here, the cost of fetching the instruction, decoding it, and fetching the operands is about 30 picojoules. In contrast, doing a single FMA instruction is about 1.5 picojoules. So the overhead here is 20x. It's a 2,000% overhead. And by the way, if this were a CPU, you could add two zeros to that. It wouldn't be 20x, it would be 2,000x because the cost of doing the instruction fetch and decode is 100x more in a CPU. In Maxwell, we basically added a dot product instruction. Instead of doing a single multiply and a single add, it did four multiplies and four adds. So we're now up to six picojoules. The overhead's only 5x. Still, we're doing five times as much work on unproductive stuff, fetching and decoding instructions, as productive stuff, math. Once we added the HMMA instruction, it flipped things around. Now most of the work is actually going to the productive stuff, to the math. Because we're doing 128 operations, 110 picojoules in the math and only 30 picojoules in the overhead. We're at 22%. With the IMMA instruction, we got to 16%. The GPUs we're shipping today, the overhead is about 11%. And this comes from defining complex operations. Let me flip back to the overall chart. I'll talk a little bit later, but sparsity allows us, because these matrix multipliers are doing with these neural networks are naturally sparse, and all GPUs since Ampere supported 2:1 sparsity. We basically can carry out twice the effective number of operations for the same energy cost because we're not having to pay to multiply by those zeros in the sparse matrices. And then in Blackwell, we kind of cheated a little bit. In the Blackwell GPU, what we call one GPU is actually two reticle-sized, reticle-limited die. And so the die size here is giving us a factor of two. And if you multiply all those factors together, you get the 5,000. So where did we wind up? Well, this is a Blackwell GPU. It's pretty impressive. It's two reticle-limited die. The communication technology that connects them I'm very proud of because it was developed in our circuit research group within NVIDIA Research. And around those two reticle-limited die are eight stacks of HBM4 memory. That's 10 terabytes per second of memory bandwidth, and this winds up giving a tremendous amount of performance. It has NVLink networking coming out of it at 1.8 terabytes per second, so we can connect these up. And that's very important today because even for inference on the large language models, the model doesn't fit on one GPU, and so you have to spread it across many GPUs. You need that very high bandwidth, very low overhead interconnect to make this work in practice.
Now, I showed you that the demand went up by 10 million and we got 5,000 out of a single GPU. So we're clearly short, on the order of 2,000x. And where do we get that 2,000x? Well, it comes from running many GPUs in parallel. And so there are a lot of different axes under which you can exploit parallelism. The easy way to exploit parallelism is called data parallelism. It's a dimension into the slide here. And what you do is you take your training set and you take different parts of the training set and you run them on different GPUs. And then after every what's called a batch of the training set, those GPUs exchange the weight updates so that every GPU learns not just from the part of the training set they did, but from all of the parts of the training set that each plane of these GPUs did. And that lets you do some amount of scaling, but to really get a lot of GPUs working on the problem, you then have to also take a single instance of the model and break it across GPUs. And you can do that in a couple ways. We do what's called tensor parallelism, which is really taking a matrix and slicing it, usually in one dimension. So maybe I take this matrix and I slice it into four bands and I put each of them onto four different GPUs. And now when I do the matrix multiply, each of those four GPUs does a quarter of the computations, the ones that use the part of the matrix they have. And then you have to do a reduction at the end to combine the results of those to get the final matrix multiply. The other dimension that you can do is what's called pipeline parallelism. If you look at a modern large language model, it has a bunch of layers, and in each layer you have a bunch of matrix multiplies for what are called the feed-forward parts of the transformer, and then you have the attention parts. And you can take that pipeline either by layer or by individual matrix multiplies and spread them across GPUs. And so by doing this, we can now get that remaining 2,000x we need by running across thousands of GPUs in one solution, one training run, or typically tens of GPUs for one inference run. And so the technology we have to do that today is in the NVL72 cabinet. This has 72 Blackwell GPUs, 36 Grace CPUs, and they're connected electrically within this cabinet. And we've packed them very tight together so we can keep those electrical interconnections short and basically be able to signal at 200 gigabits per second on an electrical cable without the attenuation of that becoming so excessive we can't detect the bits at the far end. And then to run very large jobs, we connect a bunch of those together. This is mixing generations. This is one generation back, using NVLink and NVSwitch to scale up to about 256 GPUs, and then scaling out using, typically this says InfiniBand, but we typically use Ethernet these days, to tens, in some cases hundreds of thousands of GPUs in a single network. And one technology that's been very effective in making this work is that in our switches, both the NVLink switches, the Quantum switches, and the Ethernet switches, our Spectrum switches, we actually build in reduction operations. So we can do, for example, an all-reduce, and that cuts the communication requirements in half because now we only have to sort of communicate everything into the switch. The switch does the reduction and distributes the sums back out, or otherwise we'd have to do one set of communications, do the summing on the GPUs, then do a second set of communications to distribute those results. So that's the hardware side of the equation.
The other thing which makes the NVIDIA GPUs really good at deep learning is software. And there's two ways to think about this. The first one is providing a complete solution. And we at NVIDIA, we kind of got into the deep learning game early. I started a project in NVIDIA Research in late 2010 that turned into the cuDNN software. And as a result, we then built on top of that, and we now have vertical solutions ranging from Modulus, which I think has been renamed to Physics NeMo these days, which does physics simulations, Clara in healthcare, DRIVE is our autonomous vehicle program, Metropolis is smart city, and so on. Those are all layered on top of a set of libraries we have that basically extend CUDA to do different numerics, to do AI, and then basically run on all different types of hardware. So whatever somebody's end application is, whether it's in physics or healthcare or autonomous vehicles, we have a whole vertical stack they can start with, and in many cases can finish with. They can just use that software out of the box and solve their problem. The other half of the software issue is performance. And the best way to judge performance is to look at somebody's independent benchmark, because everybody will cherry-pick some result and say, 'Oh, my performance is great on this one result.' There's an organization called MLPerf. It's a not-for-profit organization that benchmarks AI solutions, and universally, NVIDIA tends to come in at the top of the pack. This is a result from, I guess, November last year, where basically NVIDIA wins every MLPerf training benchmark. And I think there's actually been one since that we also won all of those, and actually many of the people who make great claims don't even bother showing up at these benchmark competitions. And it turns out that over time things get better as well. So when the H100 first came out, the 3.1 MLPerf benchmark results were due like a week later. And so we didn't have very much time to tune the software. We basically took it out of the box, ran it, we got a certain result. The 4.0s were six months after that. That gave us six months to tune the code. And in six months, running on exactly the same hardware, we got between two and a half and 3x performance improvements. And that shows that it's a non-trivial amount of software effort to actually get the potential of these GPUs to be realized.
So what am I worried about these days? Well, back then if you just ran the large language model, you were doing great. Today that's kind of the table stakes. And the typical AI system looks more like this, where you have a number of agents. You'll actually have some problem you're trying to solve, whether it's trying to do some decision management where you're accessing some big table of data and trying to answer some question or optimize some process, and you'll create a team of agents, each of whom has a certain role, and they'll communicate among each other. They'll also have their own state, memory, and some policy they're executing. They may be able to call tools. It turns out if you want to get a large language model to do calculations, you're better off having a dumb large language model that has a calculator than having a smart large language model. So having the appropriate tools to solve the problem is very important. And then the core technology is some large language model, which you may even be running off-premises. If you want to run one of the state-of-the-art foundation models, you're just calling it in the cloud via an API. And so this changes the computing required a lot, because now you're not just feeding one string of characters in, a string of tokens in, and getting a string of tokens out. But very often you'll give a command to this set of agents and you'll go away, and a day later come back after it's been iterating some design or some solution to a problem through many iterations and get the answer. And so it's created a lot more work from one request than you typically had. The other thing that creates a lot more work from one request is a trend toward reasoning models. It used to be you put the input string into the large language model, you get an output string out. Now, the large language model has to think about it a little while, and you very often have a linear chain of thought where it'll produce one intermediate string. It will then consume that, thinking a little bit, producing another one, and go through many steps before you finally get the output. Or in some cases it's a tree of thought where you'll get different alternatives and it will prune some of these off, and you'll get one step through the tree before you finally get your output. This again increases the amount of work and increases the latency to get a given response. This is on top of the fact that large language models themselves have this very latency-sensitive characteristic. You can take what's happening in a large language model and think of it as having a prefill phase, which is very parallel. You basically have the whole input string that you can process all at once. It gives you a lot of work to do. It tends to be then compute-limited because you can share the weights you're reading across all those tokens that are being processed simultaneously. So if you have the query is 'apple or fruit' and say that's four tokens, you can process all four of those tokens simultaneously. More typically, the input string can be anywhere from a few thousand tokens to millions of tokens. That's called the context. And so this prefill phase is very compute-limited, completely latency-insensitive. It's how much math do you have. It's also memory-bandwidth-insensitive. But then you get the first token out. And then you have to run that through the whole LLM. And going through the whole, you have to read every weight in the LLM, and you get to use that weight in exactly one multiply operation because you're multiplying it by that one token. And then you get it right, and then you got to go through the whole thing again. And so this decode process, which is autoregressive, you produce one token at a time, you have to read the whole model to produce each of those tokens, is very latency-sensitive and it's very memory-bandwidth-limited because you have to read all of the weights to produce each token.
So this drives a lot of concerns, and rather than go through the text, let me show it to you graphically. This is another independent benchmarking organization, run by SemiAnalysis, their InferenceX benchmark. And this shows two different NVIDIA GPUs here, the GB300 NVL72 and the B300 here, and then brand A GPUs in the little blue down there. And we tend to judge the inference performance on two axes. If you're not worried about interactivity or latency, what matters is the y-axis here. And this is basically tokens per second per GPU. And very roughly that translates into the metric that most people really worry about, which is tokens per dollar. And then the x-axis here is interactivity. It's tokens per second per user. Because to get a lot of tokens per second per GPU, you need to take a lot of users' queries and batch them all together. And that then allows you to get reuse on that decode part because you can actually reuse the weights across each user. Although when accessing the KV cache, each user has their own KV cache. And so if you look at what it takes to make these curves look good, if you have a large enough batch size, all it takes to move up is more math bandwidth. You just need to do more computations per second. But to move to the right is a lot harder. It takes two things. First, you need a lot of memory bandwidth because you have to read every weight for every token. And when you start to get down to this part of the curve here, you wind up basically being down to a batch of one. You're running one user and having to read the entire model for that user. And then you also need really low communication latency because between each of these stages you're communicating typically from one GPU to another. If you have a lot of overhead on that communication, that's driving your latency up. And this isn't our Blackwell generation. We're seeing numbers like 250 here. Our aspirations are to see numbers like 2,500 or 5,000 here, and that requires us to really push very hard on the design of these GPUs to get more effective memory bandwidth and reduce latencies.
And while we do this, we have to be very careful not to specialize too much because we constantly get thrown curveballs by very clever software people who come up with new models. This is the DeepSeek V3 model that was introduced a little over a year ago. And when this came out, I think people have known before there was a mixture of experts model, and that wasn't something new. Google had been doing mixture of experts in various ways for easily eight or 10 years before this. But that was something that had not been in a lot of these very large language models before. And then it had a different approach to attention where they sort of projected down to a smaller place, what was called MLA attention. And because of that, if you had your solution too specialized, it was too hardwired, you wouldn't be able to track advances like this that come along in the software. So where are we going to go with future GPUs? We're going to go in a couple different directions. And one way is sort of prioritizing where we want to spend our energy is to look at where the energy gets spent in the GPU. And so this is a pie chart, not actually of a GPU, but of a special accelerator we built in NVIDIA Research to do deep learning. And what you see is the bulk of the energy, this big blue part here, is going into doing math. That's data flow path plus MAC. That's basically doing all the math. So if you want to improve things a lot, you better make your math more efficient. And to make the math more efficient, there are a bunch of things we can do. Part of them is better number representation. I already talked about that a little bit, but I'll talk about three aspects of that here: going to logarithmic base numbers, vector scaling, and optimal clipping. The other thing you can do, which by the way the rest of this is largely moving data around, much of which is under the guise of memory. But if you think about memories, it doesn't actually cost any energy to actually operate a bit cell. It's a negligible amount of energy. All the energy in memories is reading and writing the memory. It's moving data. And then some of these arcs here are actually, the green one is actually data movement itself. The other four, the blue, orange, gray, and yellow, are different types of reading and writing memory, which are just moving data around within those memories. And so the things we can do there better as well. If we can organize the computation, what's called tiling, to order the things to do less data movement and less reading and writing, we can wind up reducing those. We can also do a lot of circuit tricks to make memories and communication better.
Let me start with number representation. And so here are a bunch of different ways of representing numbers. With integers, we weight each bit by a power of two. And so we get a particular set of scaling. We can take something weighted by a power of two and add an exponent to it, sort of the scientific notation in binary form, and that gives us a floating-point number. With log numbers, we get rid of the mantissa part and we simply have an exponent. Although to make that be effective, it can't be an exponent of two because then the jumps between the numbers are too big. So it's typically a fractional power of two. Maybe 2 to the 1/8 or 2 to the 1/16th. We can use a symbol table. So for example, if I have an 8-bit number, I could basically have a table of 256 things and I can put my actual values that I can represent anywhere I want on the number line. And so whatever I come up with to represent my numbers, I really want to ask two questions about it. The first is what is the cost, and the second is what is the accuracy. And the cost has two aspects to it. One is the operation energy. What does it cost to do adds and multiplies these numbers? And the other is the movement energy. What does it cost to move them? And the movement energy is an easy question to answer. It's just the number of bits. Your communication channels, your wires don't care what those bits represent. Moving a bit costs the same no matter what it is. The accuracy, there's two aspects. The first is dynamic range. How big a range of numbers can I represent from the largest to the smallest? And the other is what is the error? What is the biggest error I can have in representing a value I want to the nearest representable number? And so here is how a bunch of different number systems compare to that. I'm going to ignore the spiking for now. If somebody wants to bring that up in the Q&A, they can. What you'll see is that integers have a very, very bad accuracy. What you really care about here is the blue line, which is the maximum error. So bad max error and poor dynamic range. Log and floating-point are very comparable in what they have. These are all in the 8-bit range. Log is slightly better, and I'll show a graph that explains why that is. And they have both way better dynamic range, the gray bar, than integer, and way better error. The blue bar moving to the right means the error is smaller. And then symbol table has this great property. Here's a plot from a paper that Song Han and I published a number of years ago where if you actually prune a network to exploit sparsity, you wind up getting a distribution of values that's bimodal like this because you've taken all the values near zero and you've pruned those out. And so if you were to represent, say, a four-bit number system and use integers, you'd get evenly spaced values like the x's here. And what you see is you're wasting a lot of your symbols out here in the outlier areas where nothing interesting is happening. And then you only have a few symbols under these lobes where all the interesting stuff is happening. But if on the other hand you build a symbol table where you can basically take, I get 16 values, I can put them wherever I want. I can put them under the lobes where the interesting things are happening and wind up getting a much smaller mean squared error in representing this distribution of data. And in fact, you can use backpropagation to train those symbol tables, which is one of the things that was in this paper that was at ICLR about a decade ago. Of course, if you don't do pruning, your distributions look more like this. But still, what you can take away from this is most of the interesting values are near zero and you have very few values out in these tails, and that's something we'll come back and exploit later.
Now, the optimum way of representing numbers if you don't use a simple table is to do a logarithmic representation. For those of us who started school at a certain period in time, we're very familiar with devices like this. This is a slide rule, which is a device that would let you do a multiply by basically doing a logarithmic lookup where the log table was basically engraved on the scale of this device. You would put one on one of your numbers you're trying to multiply, find the other one, and read the answer off. And log numbers have these wonderful properties of having a high dynamic range and very, very good accuracy. So the way to think of this log number is you need a sign bit because you need to represent both positive and negative numbers, and then you have some exponent bits that we'll call the integer exponent bits that are to the left of the binary point, and it's basically 2 to the ei. But you also need some exponent bits to the right of the binary point so that you can represent jumps between numbers that are smaller than factors of two. So in this case we'll have three fractional and four integer exponent bits. And this gives us a good dynamic range and pretty good worst-case accuracy. So the way to think about the accuracy here is that if you have an integer representation, and this is int4, so you get 16 values, and these are the numbers you're trying to represent, and then the closest representable number sort of on the y-axis, you get big steps. So you're at one and then you still call it one until you get to one and a half, then boom, you have to jump to two. And the problem with that is at small numbers your error is large. It's 33% there. You basically lose, the worst case, your error is a third of the value you're representing. And then as you get to the big numbers your error is very small. But remember most of the numbers we're trying to represent are small. So averaged over a typical distribution you have a very large worst-case error and a very large MSE. With log numbers, you get this great property that the proportional error is the same everywhere. You start out small and the errors are small as you step up because you're jumping up by a certain fraction. If you had, I think this was L22, so you have two bits to the left of the binary point, two bits to the right. So you're jumping up by 2 to the 1/4 each time. And so you jump up by 1 plus 2 to the 1/4, and then do that again, and do that again, and do that again, and each time you do it, the error is proportional to the value you have. It's a constant 9% error across the whole range rather than being really big at the low numbers and really small at the high numbers. Now the reason why log numbers haven't caught on is that floating-point numbers are almost as good. They do a compromise between these two things where the log numbers are jumping up by a proportional amount each time. The floating-point numbers start with that same small bump. They do that say four times here if you have four mantissa bits, and then you bump the exponent and the error jumps a bunch, and you do that four times and the error jumps a bunch. You do that four times. So it's not as good, right? In this case it's a 13% error versus 9%. But it gives you that same error which is roughly proportional to your value across the entire range. The other really nice thing about log numbers is that multiplies are really cheap. Remember the slide rule, to multiply two log numbers together what you do is you add them, and that basically creates the multiplication. The problem is that adds are hard. If you have two numbers in the logarithmic representation to do an add, you actually have to convert them back to integer, do the add, and then convert them back to log. The good news is that typically when you're doing a deep network, you're not adding two numbers, right? Because if you did that, you'd be having to do this conversion every time. You're adding tens or hundreds of thousands of numbers together. And therefore, if you can basically do that conversion outside of the loop, basically refactor the loop, so you do all the adds and then you do the conversion, and then you do the conversion back, you can save that. And I'll refer you to this US patent application that sort of...
The integer part of the exponent is easy to handle. That's just a shift. You simply take one, shift it by the integer part, and then bin it into one of these partial accumulators depending on what the fractional part is. All the ones that are going to be multiplied by, let's say we have eight of these, all the ones that are going to be two to the 1/8 get summed together, two to the 2/8 get summed together, and so on. And then after you've summed 100,000 numbers, all you have to do is at one point do the conversion, which is basically looking up this constant which is two to the 0, which is one, two to the 1/8, two to the 2/8, and so on, multiply by that, and then sum the final things in. But because you've amortized that to the very end, it's a negligible cost. So log numbers are great. They've not been hugely commercially successful because floating-point numbers are almost as good and people are very familiar with those. But whatever number system you pick, you want to pick the range optimally. And let me explain what I mean by that.
So suppose I want to represent this distribution. The way we really started out doing things is we would say, okay, I've got to represent numbers from minus 0.8 to 0.8. So I'm going to scale whatever number system I have, in this case, I think it's again a four-bit integer, to fit minus 0.8 to 0.8. So I'll compute a scale factor that scales all the numbers to fit in that range. But by doing that, I've unconsciously made a trade-off in a bad direction because there's two sources of noise here. There's quantization noise, which is really the difference between the number you're trying to represent and what you quantize it to. It's the size of the gap between these points that I can represent. Then there's clipping noise, of which there's zero on this side because I'm not clipping anything. But on the other hand, if I simply say, okay, you guys out here are outliers, I'm not going to worry about you. I'm going to basically take everything from minus 0.2 to minus 0.8 and treat it like it was minus 0.2, and then I can basically put all 16 of my values that I can represent under the part of the curve which is dense. So I'm creating a bunch of noise by not representing the outliers as well. But in exchange for that, I'm making my quantization noise much smaller.
And you can plot what happens on typical weights from typical layers of interconnection of neural networks. And I'm going to use one of the low ones here so I can point to it. What's happening out here is at this far right side, I'm doing no clipping at all. Right? So this is all quantization noise. But as I move to the left, I'm increasing my clipping noise, but it's not hurting very much. But I'm decreasing my quantization noise, which is gaining a lot. Look at the four-bit line. I come down here, and when I get to this point here at about seven or eight, I hit a minimum. And what's happening at that minimum is that my quantization noise and my clipping noise are the same. I'm trading them off. If I go any further, the clipping noise is going to start getting bigger and I'm going to start losing and losing very rapidly. But right at this point, if you walk across, I'm getting almost what I would get with six bits with four by clipping it optimally. And this optimal clipping is the subject of a paper that we published a couple years ago. It winds up basically making NVFP4 work by letting us get out of four bits what you typically require six or so to get.
So then the question is how do you do this clipping? So if you're doing this for weights, it's easy. You can basically solve this by summing over all the weights, figuring out what the optimum value is, and then scaling by that. If you're doing it with activations, which you also want to do, you have to do it quickly. And the closed-form solution is up on top here. You have to solve that integral, but it turns out there's actually an iterative way of estimating this which is shown on the bottom that you can do fairly quickly. So with a couple iterations you get close enough to get optimal clipping, and this winds up making four bits almost as good as six.
The next trick we played to make NVFP4 actually work is something called vector scaling, which is basically a paper we published in MLS back in 2021. The reference is shown down here. And the way to think about it is when you're doing this scaling, you have to decide what granularity you're going to scale. And traditionally what we did is we would scale a whole layer of a neural network together. We take all the weights of the layer and decide how to scale them. But if you have a big distribution you have to scale, you wind up not being able to precisely represent things. And as you pick smaller granularities, you can basically get tighter distributions and represent things much more tightly, whether or not you did the optimum clipping or not. And so what we wind up settling on is scaling order of 16 numbers. Every 16 numbers we'd have a scale factor. And the trade-off here is as you get a finer granularity, you get a tighter distribution of better scaling, but you have more overhead for carrying the scale factor along. With 16, we wind up, an NVFP4 number is really four and a half bits because it's carrying an 8-bit FP8 scale factor along with every 16 numbers, and so it's carrying half a bit per number of that overhead. But this winds up allowing us to get another order of a bit out of a number system.
The next trick to play is exploiting sparsity. This is a figure from another paper that Song and I wrote about a decade ago where it turns out you can lop most of the connections out of a neural network and not lose any accuracy. And so we wrote this paper back in 2015 and for the next four or five years I tried to figure out how to actually make this work well in practice because the problem is the computation on the left is very regular and hardware likes things that are regular. You just make the numbers kind of march by each other and they multiply and everything is good. And when you start introducing sparsity, now things are irregular. There's sort of control to be done and things don't hop by each other in such a regular way. And I could tell you about the numerous papers I published on ideas that are actually bad for how to try to exploit sparsity. What actually wound up working was what we call structured sparsity where you force the sparsity to fit a pattern so it becomes regular again.
So you start out and you say okay here are my weights, dense trained weights, and then what I do is I'm going to lock the ones that are near zero all the way to zero. That's you're doing the pruning, but I'm going to do it in a way that I restrict that I can only have two non-zeros out of every four. And then I'll retrain the weights. I'll basically fine-tune by training the algorithm again to get the weights that remain to pick up some of the slack from the ones that went away. And then what I do is I basically compress out all the zeros and I add this metadata that says where the non-zeros are. And now I basically can take only the non-zero values and I take the input activations and I use the same metadata that says okay the non-zeros are in positions zero, three, five, and six and select only those in. So now I have the red ones aligned with the green ones and I just do a dense multiply and everything becomes regular again. And this winds up basically giving us a factor of two performance boost on everything since Ampere.
Let me see how I'm doing on time here. So I'll briefly talk about accelerators. So one way we prototype a lot of ideas what we want to do in future GPUs at NVIDIA is building accelerators and we've done a series of accelerators over the years, some in collaboration with academics at Stanford and MIT. SCNN was one of these failed attempts to exploit sparsity which basically operated by taking all the non-zeros of the activations and all the non-zeros on the weights and multiplying them all by each other and then sorting it out on the output, which is exactly the wrong thing to do because on the output first of all you have greater precision. You typically have lower precision on the input side. You start with four-bit weights and activations. Then you'll multiply them and you wind up with eight-bit results. You have to sum before you re-quantize down to four bits. And then the other is you wind up having to do what we called a scatter-add to basically take these results and add them into a memory array and that was an expensive operation. So that wound up being almost exactly the wrong way of doing it.
But if you look at these accelerators, you have to ask why are they better than doing it on a programmable engine and is there a lesson for us in designing GPUs that we can exploit. So the first way accelerators really get performance is by eliminating instruction overhead. Even for GPUs which have very efficient pipelines, you saw that trying to do a single multiply-add you had 20x as much energy fetching and decoding instructions. This is a result, the numbers for a very simple CPU, an ARM A15. A state-of-the-art x86, the numbers are even bigger, but these are from this particular paper which I think was in 2008. And basically the overhead here wasn't 200% overhead, it was 250 picojoules versus 32 femtojoules, right? So it was tens of thousands. And what we can do with an accelerator is get rid of all that overhead. Get rid of that fetch and decode and speculation and use just the energy on the math.
The other thing with accelerators is you have to sort of understand what everything costs. And this is a table from a paper my colleague Mark Horowitz had at ISSCC where we're basically for 45 nanometers sort of tallied what different things cost. And the real takeaways here is that low-precision math is way cheaper than high-precision math. I said that earlier. It's a quadratic thing. But just reading even a small SRAM, reading 32 bits out of a small SRAM is more expensive than doing a 32-bit floating-point operation and moving things even a few millimeters is more expensive than that. So data movement is way more expensive than doing the math. And so you need to keep things very local. One of my rules is sort of the order of magnitude rule of locality which is if I'm reading something out of a small SRAM, say an 8 kilobyte SRAM, that's about five picojoules per word. If I have a big SRAM that's spread out over the whole chip, that's about 50 picojoules per word. Now it turns out that it's the same SRAM. You build a big SRAM out of small SRAM banks. And so of that 50 picojoules, five picojoules is reading the word out of the bank. And the other 45 picojoules is getting the address to that bank and getting the data back. It's all data movement. In fact, even in the local SRAM read, it's all data movement. Reading the cell is a negligible amount of energy. Then if I have to go off chip, it's another order of magnitude.
So one of our more successful accelerator projects was something called Magnet where we basically tried to look at what is the most efficient way of organizing a floating-point accelerator and a lot of this was iterating over different ways of doing the tiling, different ways of what a lot of people call the data flow. Do we want to keep the weights stationary and move the activations or keep the activations stationary and move the weights or keep the outputs stationary? And so we came up with this sort of weight-stationary pseudo-output-stationary approach to doing this which wound up being the right compromise and we wound up building a prototype of this accelerator that wound up getting about 100 TOPS per watt with a 50% dense four-bit input.
Now, one of the visions in the long term we have for accelerators is that if you think about what is hard in building a chip, it's actually not doing the thing that the chip is doing. Whether it's doing a deep learning matrix multiply computation or whether it's, I built a custom chip with one of my PhD students to do bioinformatics calculations and the core of it was doing a Smith-Waterman dynamic programming algorithm. You could write the Verilog for that in an afternoon and get it working. The hard part of building these chips is everything that goes around that. It's building the memory system, building the on-chip network, the general-purpose programming that you need to sort of do the scaffolding around it. So the vision we have in the future is that you'll have a GPU which has a great on-chip network and off-chip network. A great on-chip memory system and off-chip memory system. It'll have general-purpose programmable SMs, the streaming multiprocessors are the programmable units of the GPUs. And then it will have these yellow boxes which is the configurable part. Whatever thing you want to accelerate whether it's doing matrix multiplies for deep learning or dynamic programming for bioinformatics. You'll have your program and your compiler will decide what parts of that will run on the general-purpose processor and what part will be synthesized into custom compute blocks that will build your accelerator. So that's a long-term vision. We have a long way to go to get there.
So I think I've talked probably long enough. So let me wrap up. I think deep learning is really affecting every aspect of the human experience and it's been enabled by hardware and its progress is gated by hardware. Remember the algorithms and the data were around. It wasn't until the GPUs came around to sort of be the spark that ignited that fuel, our mixture that lit off our current revolution. And since then, progress has been gated by how fast a GPU is and how many of them you can wire together to get that 10 million x that we've needed to train the foundation models today. On a single GPU, we've gotten a 1000x in the last 10 years. It's 5,000x over the 12 that I showed in the graph. And then we scale up and scale out to get the additional 2,000 we've needed to meet the demand. The hardware is one aspect of that. We need software both to cover the applications and to provide the performance. And what we're looking at today are challenges that come from agentic systems and disaggregated inference where especially the autoregressive part of inference puts huge pressure on latency within the GPU and on memory bandwidth. And then even though we're trying to get very low latency to support chain of thought and agentic systems, at the same time we want to keep it programmable because we don't know when somebody's going to come up with the next clever algorithm and we have to be able to run it.
I showed you some details on the number representation side of optimum clipping where we can trade off quantization error against clipping error and minimize the mean squared error. It's worth almost two bits worth of precision and I'm applying that clipping to the appropriate granularity and it's much more effective to apply it at a small granularity at vectors of like 8, 16, or 32 than it is to an entire layer because the way to think about it is if I take the entire layer and I apply one scale factor to it, if I cut that in half and I apply two scale factors to it, I've incurred almost no overhead and I've improved my clipping and I just need to keep repeating that process until it becomes expensive to add that next scale factor. So with that, I think I should wrap up and we should move on to the Q&A. I should say this is a picture of where I live and you see it looks a lot cooler than it is here.