Back
William Dally
Chief Scientist & Senior Vice President of Research, NVIDIA

NUS120 Distinguished Speaker Series | Dr William Dally

🎥 Jun 09, 2026 📺 NUScast ⏱ 75m
ABOUT NUS120* The National University of Singapore celebrates its 120th anniversary in 2025, commemorating a legacy, forged ...
Watch on YouTube

About William Dally

William Dally, Chief Scientist and Senior Vice President of Research at Nvidia, gave a keynote at GTC Taipei in May 2026 where he discussed the economics of AI factories, stating that tokens "are now profitable units of revenues" and that compute demand in Taiwan has "skyrocketed" as a result. He estimated the cost of a single gigawatt-level AI factory has risen from $30 billion to between $60 and $100 billion, and argued that "compute is revenues" and "performance per watt is your revenues," cautioning against choosing architecture solely on chip cost. Dally also said the number of software engineers is increasing, describing claims that AI reduces jobs as "complete nonsense," citing the productivity gain of $3 trillion worth of software engineer salary generating $9 trillion in output. In a June 2026 lecture at the National University of Singapore, Dally attributed the deep learning revolution to GPU hardware enabling algorithms and data that had existed for decades, and stated that progress remains "gated by how fast a GPU is." He contrasted Nvidia's product development, where "it has to work or we're going out of business," with Nvidia Research, where the ability to fail allows for innovations that can achieve "2x or 4x performance per unit energy on the next generation." He also referenced an earlier 2020 talk where he described a 317x increase in single-chip inference performance over eight years, a trend he termed "Huang's Law," and credited specialized tensor core instructions for allowing GPUs to achieve efficiency near that of dedicated hardware.

Source: AI-verified profile updated from William Dally's recent appearances. Browse all interviews →

Transcript (69 segments)
✨ AI-enhanced transcript with speaker attribution
E
Esther Lim0:07
Dr. William Dally, Chief Scientist and Senior Vice President of Research at NVIDIA Corporation. Mr. Sier Fua, NUS Chairman, NUS Trustees, Professor Tan Chai, NUS President, distinguished guests, ladies and gentlemen. Good afternoon and welcome to the NUS 120 Distinguished Speaker Series: Shaping the Future Through Computing Innovation. I am Esther Lim, a third-year undergraduate student from the College of Design and Engineering and the Master of Ceremonies for this event. The NUS 120 Distinguished Speaker Series brings together distinguished speakers to share their unique perspectives on pressing issues of the future that affect our country, our region, and the world. The series explores the overarching theme of shaping the future and encourages thought-provoking ideas and conversations that inspire debate and discussion. We are privileged to have with us today Dr. William Dally. Today, Dr. Dally will share his thoughts on how computing innovation is shaping the future. Moderating the Q&A session with Dr. Dally will be Professor Tulika Mitra, Vice Provost for Special Projects and Dean of NUS Computing. It is now my pleasure to invite NUS President Professor Tan Eng Chye to begin by saying a few words. Professor, please.
T
Tan Eng Chye1:34
Good afternoon and warm welcome to the NUS 120 Distinguished Speakers. Today we are discussing a subject that sits at the heart of our shared future. To shape the future of computing innovation is in no small measure to shape our economies, our societies, and how we come to understand intelligence itself. We are deeply honored to have Dr. William Dally with us today. His foundational contributions to parallel computing and high-performance interconnects have helped shape the infrastructure upon which modern AI runs. The principles behind his work, which center on systems design, scale, and efficiency, continue to define the direction of the field. To understand where computing is heading, we need to confront a constraint that is sometimes overshadowed: what determines what we are able to build? The demand for computing power is growing rapidly. Advances in AI, data analytics, and scientific simulation are redefining old fields and industries from healthcare to finance, climate modeling to material science. As models grow more ambitious, the infrastructure behind them has become as consequential as the ideas within them. This tension is reshaping how systems are designed. This is where Dr. Dally's work is particularly significant. Behind every frontier AI model lies a coordination issue: getting thousands of processes to work in concert, communicating at speed and scale. It is often this connective issue, not raw compute alone, that determines what's possible. Dr. Dally's work on high-performance interconnects has been instrumental in making that coordination feasible for the large-scale systems that define the AI we see today. As capability scales, energy emerges as a concern. Dr. Dally's work on efficiency addresses this: how do we sustain performance without creating unsustainable demands on energy? There is also a broader question: how do we ensure that these capabilities are deployed in ways that are equitable and responsible? These are questions central to the work NUS is doing. The NUS AI Institute brings together researchers across disciplines to advance scalable, energy-efficient AI with a focus on responsible deployment. We also actively collaborate with industry to amplify the impact of our innovations through our partnerships with IBM on AI and sustainable computing, as well as Microsoft Research Asia and Google on AI-driven research and talent development. We are working to deepen the region's role in shaping technological innovation and application. The future of computing innovation calls for collaboration across disciplines, sectors, and borders. It will demand technical ingenuity and a clear sense of purpose. As a university, our responsibility is not only to advance knowledge but also to cultivate the judgment that guides its use. Dr. Dally's work embodies what this moment requires: powerful, efficient, and scalable systems and the design choices that make this both possible and more sustainable. So, we look forward to hearing from him. Please join me in welcoming Dr. William Dally. Thank you.
W
William Dally6:48
So I think everybody will agree that there's probably no technology shaping the future more today than that of deep learning, of AI. And I'm going to tell you a story about the technology behind that, which is the computer hardware and the evolution of that that has allowed AI to shape the future. So let's start with a little bit of history. You see this, I can walk around here and use this to advance. So AI really was created by three ingredients, the modern revolution in deep learning. So algorithms, as sort of indicated by this picture from the AlexNet paper in 2012. So if you look at AlexNet, it's a CNN, and it was trained with gradient descent and backpropagation, and all of those algorithms were around in the 1980s. So the algorithms part of deep learning was largely solved, you know, 40 years ago. The next issue is you need enough data to train these networks, and large-scale datasets like the ImageNet dataset, which is kind of a pictograph of, you know, were around in the late 2000s, by 2008, 2009, Pascal in 2005. So the data and the algorithms were there. You think of that as the fuel and the air. And they were waiting for a spark to ignite them and really light off the AI revolution. And that spark was enough computing performance to train a large enough neural network on a large enough dataset in a reasonable amount of time. In the case of AlexNet, it was two Fermi-generation NVIDIA GPUs that were used to train AlexNet on the ImageNet dataset in about two weeks. And so it was the GPU, the computing, that was the missing ingredient that really was responsible for the AI revolution. Now since GPUs enabled deep learning, GPUs have been pacing the progress of deep learning. This chart shows the demand in compute power, in operations, to train a state-of-the-art model from AlexNet in 2012 to one of the GPT models in, what am I doing here, about 2024, and actually the latest frontier models are about an order of magnitude more than this. And this is an increase of 10 million in 10 years in the capability required to train a deep learning model. And at NVIDIA, we feel very responsible for not holding up progress by, you know, making sure that we can keep our GPUs growing fast enough. And so I'm going to tell you a little bit about how we met this tremendous increase in demand during an era when Moore's Law is largely dead, where we're really not getting very much from process technology.
So this chart sort of shows our progress in GPU performance over the last 12 years. It's a 5,000x increase in performance over those years. To put some pictures with this, this starts with Kepler. It's actually one GPU after the one that was used to train AlexNet. That was the Fermi generation. Kepler came out in 2012 and it had a performance of about four teraflops. We made progress along the way and I've kind of shown the key points of it. In the Maxwell generation in about 2015, we added support for 16-bit floating point, which was what everybody was using to train neural networks, either FP16 or BF16, and we added a dot product instruction. I'll explain in a minute why those were important, and we wound up getting up to about six petaflops. The Pascal generation took it to 20. A big jump happened with the Volta generation in 2017 where we added a matrix multiply instruction, the HMMA instruction, a half-precision matrix multiply accumulate. It takes two FP16 matrices, two 4x4 FP16 matrices, multiplies them together and sums them into an FP32 matrix. So it's 128 arithmetic operations, 64 multiplies, 64 adds, done with one instruction. And what this does is it amortizes the overhead of the instruction. Complex instructions are a good idea when it comes to overhead. We added the IMMA instruction. And by the way, the marketing people didn't like the idea of MMA instruction. So they called these tensor cores, which sounds way cooler. And so in the Turing generation, we added the IMMA instruction. In the Ampere generation, we added support for sparsity. I'll explain that in a little bit more detail later. In Hopper, FP8, and then in Blackwell, NVFP4, and I'll talk a little bit about some of the scaling technologies behind NVFP4. So if you look at where this 5,000x comes from, very little of it comes from process technology. The processes that these different chips are in are actually color-coded on this chart. The black ones, Kepler and Maxwell, being 28 nanometer. The green ones here, Pascal and Volta, were 16 nanometer, and actually Turing as well. Ampere was seven, and then Hopper in five and Blackwell in four. And if you look at the gains in performance due to that process technology, and I used performance per watt as the way to attribute it to process technology, from 28 nanometer down to 4 nanometer, you would have expected the gain to be roughly the ratio of those two numbers, or something on the order of seven. But no, it's 3x. We got 3x out of process technology over this period of time, and it's because these numbers are a little bit made up. So the difference in the metal pitch, which to me is the thing that actually matters, between 28 nanometers and 4 nanometers is actually a little bit less than 3x. And so you're not really getting a huge gain from process. Now you're getting 3x out of the 5,000x. The single biggest thing that made a difference during this period of time was number representation. And I'll go into this in a little bit more detail later, but in Kepler, we had 32-bit floating-point numbers for the bulk of the math. That's because the chip wasn't designed to do deep learning. It was designed to do graphics and high-performance computing. And for graphics, we needed FP32. And for high-performance computing, we needed FP64. And so all the deep learning done in Kepler was done in FP32. And you got four teraflops. If you instead of doing a 32-bit operation, you do a 16-bit operation, that's actually four times less energy. And it's because the dominant operation is the multiply, and the multiply scales quadratically. Think about doing long multiplication. Every bit of one operand has to be ANDed with every bit of the other operand to create partial products, and you have to sum those partial products together. So it's a quadratic scaling. And so by going from FP32 to FP16, we gained 4x, and then we gained another 4x going to eight, and another almost 4x going down to four. And so that winds up giving us a 32x gain over this period of time. The other thing that gave us a big gain here was complex instructions. Let me flip to this chart to explain those. So GPUs actually have a very efficient execution pipeline because they don't do any speculation. So they're not guessing and then having to throw away work. They don't do any branch predictions. So they don't have to access several big tables and do lots of computation just to figure out where the next instruction is. It's a very simple pipeline. Even so, the cost, and I've normalized everything to 45 nanometers here, the cost of fetching the instruction, decoding it, and fetching the operands is about 30 picojoules. In contrast, doing a single FMA instruction is about 1.5 picojoules. So the overhead here is 20x. It's a 2,000% overhead. And by the way, if this were a CPU, you could add two zeros to that. It wouldn't be 20x, it would be 2,000x because the cost of doing the instruction fetch and decode is 100x more in a CPU. In Maxwell, we basically added a dot product instruction. Instead of doing a single multiply and a single add, it did four multiplies and four adds. So we're now up to six picojoules. The overhead's only 5x. Still, we're doing five times as much work on unproductive stuff, fetching and decoding instructions, as productive stuff, math. Once we added the HMMA instruction, it flipped things around. Now most of the work is actually going to the productive stuff, to the math. Because we're doing 128 operations, 110 picojoules in the math and only 30 picojoules in the overhead. We're at 22%. With the IMMA instruction, we got to 16%. The GPUs we're shipping today, the overhead is about 11%. And this comes from defining complex operations. Let me flip back to the overall chart. I'll talk a little bit later, but sparsity allows us, because these matrix multipliers are doing with these neural networks are naturally sparse, and all GPUs since Ampere supported 2:1 sparsity. We basically can carry out twice the effective number of operations for the same energy cost because we're not having to pay to multiply by those zeros in the sparse matrices. And then in Blackwell, we kind of cheated a little bit. In the Blackwell GPU, what we call one GPU is actually two reticle-sized, reticle-limited die. And so the die size here is giving us a factor of two. And if you multiply all those factors together, you get the 5,000. So where did we wind up? Well, this is a Blackwell GPU. It's pretty impressive. It's two reticle-limited die. The communication technology that connects them I'm very proud of because it was developed in our circuit research group within NVIDIA Research. And around those two reticle-limited die are eight stacks of HBM4 memory. That's 10 terabytes per second of memory bandwidth, and this winds up giving a tremendous amount of performance. It has NVLink networking coming out of it at 1.8 terabytes per second, so we can connect these up. And that's very important today because even for inference on the large language models, the model doesn't fit on one GPU, and so you have to spread it across many GPUs. You need that very high bandwidth, very low overhead interconnect to make this work in practice.
Now, I showed you that the demand went up by 10 million and we got 5,000 out of a single GPU. So we're clearly short, on the order of 2,000x. And where do we get that 2,000x? Well, it comes from running many GPUs in parallel. And so there are a lot of different axes under which you can exploit parallelism. The easy way to exploit parallelism is called data parallelism. It's a dimension into the slide here. And what you do is you take your training set and you take different parts of the training set and you run them on different GPUs. And then after every what's called a batch of the training set, those GPUs exchange the weight updates so that every GPU learns not just from the part of the training set they did, but from all of the parts of the training set that each plane of these GPUs did. And that lets you do some amount of scaling, but to really get a lot of GPUs working on the problem, you then have to also take a single instance of the model and break it across GPUs. And you can do that in a couple ways. We do what's called tensor parallelism, which is really taking a matrix and slicing it, usually in one dimension. So maybe I take this matrix and I slice it into four bands and I put each of them onto four different GPUs. And now when I do the matrix multiply, each of those four GPUs does a quarter of the computations, the ones that use the part of the matrix they have. And then you have to do a reduction at the end to combine the results of those to get the final matrix multiply. The other dimension that you can do is what's called pipeline parallelism. If you look at a modern large language model, it has a bunch of layers, and in each layer you have a bunch of matrix multiplies for what are called the feed-forward parts of the transformer, and then you have the attention parts. And you can take that pipeline either by layer or by individual matrix multiplies and spread them across GPUs. And so by doing this, we can now get that remaining 2,000x we need by running across thousands of GPUs in one solution, one training run, or typically tens of GPUs for one inference run. And so the technology we have to do that today is in the NVL72 cabinet. This has 72 Blackwell GPUs, 36 Grace CPUs, and they're connected electrically within this cabinet. And we've packed them very tight together so we can keep those electrical interconnections short and basically be able to signal at 200 gigabits per second on an electrical cable without the attenuation of that becoming so excessive we can't detect the bits at the far end. And then to run very large jobs, we connect a bunch of those together. This is mixing generations. This is one generation back, using NVLink and NVSwitch to scale up to about 256 GPUs, and then scaling out using, typically this says InfiniBand, but we typically use Ethernet these days, to tens, in some cases hundreds of thousands of GPUs in a single network. And one technology that's been very effective in making this work is that in our switches, both the NVLink switches, the Quantum switches, and the Ethernet switches, our Spectrum switches, we actually build in reduction operations. So we can do, for example, an all-reduce, and that cuts the communication requirements in half because now we only have to sort of communicate everything into the switch. The switch does the reduction and distributes the sums back out, or otherwise we'd have to do one set of communications, do the summing on the GPUs, then do a second set of communications to distribute those results. So that's the hardware side of the equation.
The other thing which makes the NVIDIA GPUs really good at deep learning is software. And there's two ways to think about this. The first one is providing a complete solution. And we at NVIDIA, we kind of got into the deep learning game early. I started a project in NVIDIA Research in late 2010 that turned into the cuDNN software. And as a result, we then built on top of that, and we now have vertical solutions ranging from Modulus, which I think has been renamed to Physics NeMo these days, which does physics simulations, Clara in healthcare, DRIVE is our autonomous vehicle program, Metropolis is smart city, and so on. Those are all layered on top of a set of libraries we have that basically extend CUDA to do different numerics, to do AI, and then basically run on all different types of hardware. So whatever somebody's end application is, whether it's in physics or healthcare or autonomous vehicles, we have a whole vertical stack they can start with, and in many cases can finish with. They can just use that software out of the box and solve their problem. The other half of the software issue is performance. And the best way to judge performance is to look at somebody's independent benchmark, because everybody will cherry-pick some result and say, 'Oh, my performance is great on this one result.' There's an organization called MLPerf. It's a not-for-profit organization that benchmarks AI solutions, and universally, NVIDIA tends to come in at the top of the pack. This is a result from, I guess, November last year, where basically NVIDIA wins every MLPerf training benchmark. And I think there's actually been one since that we also won all of those, and actually many of the people who make great claims don't even bother showing up at these benchmark competitions. And it turns out that over time things get better as well. So when the H100 first came out, the 3.1 MLPerf benchmark results were due like a week later. And so we didn't have very much time to tune the software. We basically took it out of the box, ran it, we got a certain result. The 4.0s were six months after that. That gave us six months to tune the code. And in six months, running on exactly the same hardware, we got between two and a half and 3x performance improvements. And that shows that it's a non-trivial amount of software effort to actually get the potential of these GPUs to be realized.
So what am I worried about these days? Well, back then if you just ran the large language model, you were doing great. Today that's kind of the table stakes. And the typical AI system looks more like this, where you have a number of agents. You'll actually have some problem you're trying to solve, whether it's trying to do some decision management where you're accessing some big table of data and trying to answer some question or optimize some process, and you'll create a team of agents, each of whom has a certain role, and they'll communicate among each other. They'll also have their own state, memory, and some policy they're executing. They may be able to call tools. It turns out if you want to get a large language model to do calculations, you're better off having a dumb large language model that has a calculator than having a smart large language model. So having the appropriate tools to solve the problem is very important. And then the core technology is some large language model, which you may even be running off-premises. If you want to run one of the state-of-the-art foundation models, you're just calling it in the cloud via an API. And so this changes the computing required a lot, because now you're not just feeding one string of characters in, a string of tokens in, and getting a string of tokens out. But very often you'll give a command to this set of agents and you'll go away, and a day later come back after it's been iterating some design or some solution to a problem through many iterations and get the answer. And so it's created a lot more work from one request than you typically had. The other thing that creates a lot more work from one request is a trend toward reasoning models. It used to be you put the input string into the large language model, you get an output string out. Now, the large language model has to think about it a little while, and you very often have a linear chain of thought where it'll produce one intermediate string. It will then consume that, thinking a little bit, producing another one, and go through many steps before you finally get the output. Or in some cases it's a tree of thought where you'll get different alternatives and it will prune some of these off, and you'll get one step through the tree before you finally get your output. This again increases the amount of work and increases the latency to get a given response. This is on top of the fact that large language models themselves have this very latency-sensitive characteristic. You can take what's happening in a large language model and think of it as having a prefill phase, which is very parallel. You basically have the whole input string that you can process all at once. It gives you a lot of work to do. It tends to be then compute-limited because you can share the weights you're reading across all those tokens that are being processed simultaneously. So if you have the query is 'apple or fruit' and say that's four tokens, you can process all four of those tokens simultaneously. More typically, the input string can be anywhere from a few thousand tokens to millions of tokens. That's called the context. And so this prefill phase is very compute-limited, completely latency-insensitive. It's how much math do you have. It's also memory-bandwidth-insensitive. But then you get the first token out. And then you have to run that through the whole LLM. And going through the whole, you have to read every weight in the LLM, and you get to use that weight in exactly one multiply operation because you're multiplying it by that one token. And then you get it right, and then you got to go through the whole thing again. And so this decode process, which is autoregressive, you produce one token at a time, you have to read the whole model to produce each of those tokens, is very latency-sensitive and it's very memory-bandwidth-limited because you have to read all of the weights to produce each token.
So this drives a lot of concerns, and rather than go through the text, let me show it to you graphically. This is another independent benchmarking organization, run by SemiAnalysis, their InferenceX benchmark. And this shows two different NVIDIA GPUs here, the GB300 NVL72 and the B300 here, and then brand A GPUs in the little blue down there. And we tend to judge the inference performance on two axes. If you're not worried about interactivity or latency, what matters is the y-axis here. And this is basically tokens per second per GPU. And very roughly that translates into the metric that most people really worry about, which is tokens per dollar. And then the x-axis here is interactivity. It's tokens per second per user. Because to get a lot of tokens per second per GPU, you need to take a lot of users' queries and batch them all together. And that then allows you to get reuse on that decode part because you can actually reuse the weights across each user. Although when accessing the KV cache, each user has their own KV cache. And so if you look at what it takes to make these curves look good, if you have a large enough batch size, all it takes to move up is more math bandwidth. You just need to do more computations per second. But to move to the right is a lot harder. It takes two things. First, you need a lot of memory bandwidth because you have to read every weight for every token. And when you start to get down to this part of the curve here, you wind up basically being down to a batch of one. You're running one user and having to read the entire model for that user. And then you also need really low communication latency because between each of these stages you're communicating typically from one GPU to another. If you have a lot of overhead on that communication, that's driving your latency up. And this isn't our Blackwell generation. We're seeing numbers like 250 here. Our aspirations are to see numbers like 2,500 or 5,000 here, and that requires us to really push very hard on the design of these GPUs to get more effective memory bandwidth and reduce latencies.
And while we do this, we have to be very careful not to specialize too much because we constantly get thrown curveballs by very clever software people who come up with new models. This is the DeepSeek V3 model that was introduced a little over a year ago. And when this came out, I think people have known before there was a mixture of experts model, and that wasn't something new. Google had been doing mixture of experts in various ways for easily eight or 10 years before this. But that was something that had not been in a lot of these very large language models before. And then it had a different approach to attention where they sort of projected down to a smaller place, what was called MLA attention. And because of that, if you had your solution too specialized, it was too hardwired, you wouldn't be able to track advances like this that come along in the software. So where are we going to go with future GPUs? We're going to go in a couple different directions. And one way is sort of prioritizing where we want to spend our energy is to look at where the energy gets spent in the GPU. And so this is a pie chart, not actually of a GPU, but of a special accelerator we built in NVIDIA Research to do deep learning. And what you see is the bulk of the energy, this big blue part here, is going into doing math. That's data flow path plus MAC. That's basically doing all the math. So if you want to improve things a lot, you better make your math more efficient. And to make the math more efficient, there are a bunch of things we can do. Part of them is better number representation. I already talked about that a little bit, but I'll talk about three aspects of that here: going to logarithmic base numbers, vector scaling, and optimal clipping. The other thing you can do, which by the way the rest of this is largely moving data around, much of which is under the guise of memory. But if you think about memories, it doesn't actually cost any energy to actually operate a bit cell. It's a negligible amount of energy. All the energy in memories is reading and writing the memory. It's moving data. And then some of these arcs here are actually, the green one is actually data movement itself. The other four, the blue, orange, gray, and yellow, are different types of reading and writing memory, which are just moving data around within those memories. And so the things we can do there better as well. If we can organize the computation, what's called tiling, to order the things to do less data movement and less reading and writing, we can wind up reducing those. We can also do a lot of circuit tricks to make memories and communication better.
Let me start with number representation. And so here are a bunch of different ways of representing numbers. With integers, we weight each bit by a power of two. And so we get a particular set of scaling. We can take something weighted by a power of two and add an exponent to it, sort of the scientific notation in binary form, and that gives us a floating-point number. With log numbers, we get rid of the mantissa part and we simply have an exponent. Although to make that be effective, it can't be an exponent of two because then the jumps between the numbers are too big. So it's typically a fractional power of two. Maybe 2 to the 1/8 or 2 to the 1/16th. We can use a symbol table. So for example, if I have an 8-bit number, I could basically have a table of 256 things and I can put my actual values that I can represent anywhere I want on the number line. And so whatever I come up with to represent my numbers, I really want to ask two questions about it. The first is what is the cost, and the second is what is the accuracy. And the cost has two aspects to it. One is the operation energy. What does it cost to do adds and multiplies these numbers? And the other is the movement energy. What does it cost to move them? And the movement energy is an easy question to answer. It's just the number of bits. Your communication channels, your wires don't care what those bits represent. Moving a bit costs the same no matter what it is. The accuracy, there's two aspects. The first is dynamic range. How big a range of numbers can I represent from the largest to the smallest? And the other is what is the error? What is the biggest error I can have in representing a value I want to the nearest representable number? And so here is how a bunch of different number systems compare to that. I'm going to ignore the spiking for now. If somebody wants to bring that up in the Q&A, they can. What you'll see is that integers have a very, very bad accuracy. What you really care about here is the blue line, which is the maximum error. So bad max error and poor dynamic range. Log and floating-point are very comparable in what they have. These are all in the 8-bit range. Log is slightly better, and I'll show a graph that explains why that is. And they have both way better dynamic range, the gray bar, than integer, and way better error. The blue bar moving to the right means the error is smaller. And then symbol table has this great property. Here's a plot from a paper that Song Han and I published a number of years ago where if you actually prune a network to exploit sparsity, you wind up getting a distribution of values that's bimodal like this because you've taken all the values near zero and you've pruned those out. And so if you were to represent, say, a four-bit number system and use integers, you'd get evenly spaced values like the x's here. And what you see is you're wasting a lot of your symbols out here in the outlier areas where nothing interesting is happening. And then you only have a few symbols under these lobes where all the interesting stuff is happening. But if on the other hand you build a symbol table where you can basically take, I get 16 values, I can put them wherever I want. I can put them under the lobes where the interesting things are happening and wind up getting a much smaller mean squared error in representing this distribution of data. And in fact, you can use backpropagation to train those symbol tables, which is one of the things that was in this paper that was at ICLR about a decade ago. Of course, if you don't do pruning, your distributions look more like this. But still, what you can take away from this is most of the interesting values are near zero and you have very few values out in these tails, and that's something we'll come back and exploit later.
Now, the optimum way of representing numbers if you don't use a simple table is to do a logarithmic representation. For those of us who started school at a certain period in time, we're very familiar with devices like this. This is a slide rule, which is a device that would let you do a multiply by basically doing a logarithmic lookup where the log table was basically engraved on the scale of this device. You would put one on one of your numbers you're trying to multiply, find the other one, and read the answer off. And log numbers have these wonderful properties of having a high dynamic range and very, very good accuracy. So the way to think of this log number is you need a sign bit because you need to represent both positive and negative numbers, and then you have some exponent bits that we'll call the integer exponent bits that are to the left of the binary point, and it's basically 2 to the ei. But you also need some exponent bits to the right of the binary point so that you can represent jumps between numbers that are smaller than factors of two. So in this case we'll have three fractional and four integer exponent bits. And this gives us a good dynamic range and pretty good worst-case accuracy. So the way to think about the accuracy here is that if you have an integer representation, and this is int4, so you get 16 values, and these are the numbers you're trying to represent, and then the closest representable number sort of on the y-axis, you get big steps. So you're at one and then you still call it one until you get to one and a half, then boom, you have to jump to two. And the problem with that is at small numbers your error is large. It's 33% there. You basically lose, the worst case, your error is a third of the value you're representing. And then as you get to the big numbers your error is very small. But remember most of the numbers we're trying to represent are small. So averaged over a typical distribution you have a very large worst-case error and a very large MSE. With log numbers, you get this great property that the proportional error is the same everywhere. You start out small and the errors are small as you step up because you're jumping up by a certain fraction. If you had, I think this was L22, so you have two bits to the left of the binary point, two bits to the right. So you're jumping up by 2 to the 1/4 each time. And so you jump up by 1 plus 2 to the 1/4, and then do that again, and do that again, and do that again, and each time you do it, the error is proportional to the value you have. It's a constant 9% error across the whole range rather than being really big at the low numbers and really small at the high numbers. Now the reason why log numbers haven't caught on is that floating-point numbers are almost as good. They do a compromise between these two things where the log numbers are jumping up by a proportional amount each time. The floating-point numbers start with that same small bump. They do that say four times here if you have four mantissa bits, and then you bump the exponent and the error jumps a bunch, and you do that four times and the error jumps a bunch. You do that four times. So it's not as good, right? In this case it's a 13% error versus 9%. But it gives you that same error which is roughly proportional to your value across the entire range. The other really nice thing about log numbers is that multiplies are really cheap. Remember the slide rule, to multiply two log numbers together what you do is you add them, and that basically creates the multiplication. The problem is that adds are hard. If you have two numbers in the logarithmic representation to do an add, you actually have to convert them back to integer, do the add, and then convert them back to log. The good news is that typically when you're doing a deep network, you're not adding two numbers, right? Because if you did that, you'd be having to do this conversion every time. You're adding tens or hundreds of thousands of numbers together. And therefore, if you can basically do that conversion outside of the loop, basically refactor the loop, so you do all the adds and then you do the conversion, and then you do the conversion back, you can save that. And I'll refer you to this US patent application that sort of...
The integer part of the exponent is easy to handle. That's just a shift. You simply take one, shift it by the integer part, and then bin it into one of these partial accumulators depending on what the fractional part is. All the ones that are going to be multiplied by, let's say we have eight of these, all the ones that are going to be two to the 1/8 get summed together, two to the 2/8 get summed together, and so on. And then after you've summed 100,000 numbers, all you have to do is at one point do the conversion, which is basically looking up this constant which is two to the 0, which is one, two to the 1/8, two to the 2/8, and so on, multiply by that, and then sum the final things in. But because you've amortized that to the very end, it's a negligible cost. So log numbers are great. They've not been hugely commercially successful because floating-point numbers are almost as good and people are very familiar with those. But whatever number system you pick, you want to pick the range optimally. And let me explain what I mean by that.
So suppose I want to represent this distribution. The way we really started out doing things is we would say, okay, I've got to represent numbers from minus 0.8 to 0.8. So I'm going to scale whatever number system I have, in this case, I think it's again a four-bit integer, to fit minus 0.8 to 0.8. So I'll compute a scale factor that scales all the numbers to fit in that range. But by doing that, I've unconsciously made a trade-off in a bad direction because there's two sources of noise here. There's quantization noise, which is really the difference between the number you're trying to represent and what you quantize it to. It's the size of the gap between these points that I can represent. Then there's clipping noise, of which there's zero on this side because I'm not clipping anything. But on the other hand, if I simply say, okay, you guys out here are outliers, I'm not going to worry about you. I'm going to basically take everything from minus 0.2 to minus 0.8 and treat it like it was minus 0.2, and then I can basically put all 16 of my values that I can represent under the part of the curve which is dense. So I'm creating a bunch of noise by not representing the outliers as well. But in exchange for that, I'm making my quantization noise much smaller.
And you can plot what happens on typical weights from typical layers of interconnection of neural networks. And I'm going to use one of the low ones here so I can point to it. What's happening out here is at this far right side, I'm doing no clipping at all. Right? So this is all quantization noise. But as I move to the left, I'm increasing my clipping noise, but it's not hurting very much. But I'm decreasing my quantization noise, which is gaining a lot. Look at the four-bit line. I come down here, and when I get to this point here at about seven or eight, I hit a minimum. And what's happening at that minimum is that my quantization noise and my clipping noise are the same. I'm trading them off. If I go any further, the clipping noise is going to start getting bigger and I'm going to start losing and losing very rapidly. But right at this point, if you walk across, I'm getting almost what I would get with six bits with four by clipping it optimally. And this optimal clipping is the subject of a paper that we published a couple years ago. It winds up basically making NVFP4 work by letting us get out of four bits what you typically require six or so to get.
So then the question is how do you do this clipping? So if you're doing this for weights, it's easy. You can basically solve this by summing over all the weights, figuring out what the optimum value is, and then scaling by that. If you're doing it with activations, which you also want to do, you have to do it quickly. And the closed-form solution is up on top here. You have to solve that integral, but it turns out there's actually an iterative way of estimating this which is shown on the bottom that you can do fairly quickly. So with a couple iterations you get close enough to get optimal clipping, and this winds up making four bits almost as good as six.
The next trick we played to make NVFP4 actually work is something called vector scaling, which is basically a paper we published in MLS back in 2021. The reference is shown down here. And the way to think about it is when you're doing this scaling, you have to decide what granularity you're going to scale. And traditionally what we did is we would scale a whole layer of a neural network together. We take all the weights of the layer and decide how to scale them. But if you have a big distribution you have to scale, you wind up not being able to precisely represent things. And as you pick smaller granularities, you can basically get tighter distributions and represent things much more tightly, whether or not you did the optimum clipping or not. And so what we wind up settling on is scaling order of 16 numbers. Every 16 numbers we'd have a scale factor. And the trade-off here is as you get a finer granularity, you get a tighter distribution of better scaling, but you have more overhead for carrying the scale factor along. With 16, we wind up, an NVFP4 number is really four and a half bits because it's carrying an 8-bit FP8 scale factor along with every 16 numbers, and so it's carrying half a bit per number of that overhead. But this winds up allowing us to get another order of a bit out of a number system.
The next trick to play is exploiting sparsity. This is a figure from another paper that Song and I wrote about a decade ago where it turns out you can lop most of the connections out of a neural network and not lose any accuracy. And so we wrote this paper back in 2015 and for the next four or five years I tried to figure out how to actually make this work well in practice because the problem is the computation on the left is very regular and hardware likes things that are regular. You just make the numbers kind of march by each other and they multiply and everything is good. And when you start introducing sparsity, now things are irregular. There's sort of control to be done and things don't hop by each other in such a regular way. And I could tell you about the numerous papers I published on ideas that are actually bad for how to try to exploit sparsity. What actually wound up working was what we call structured sparsity where you force the sparsity to fit a pattern so it becomes regular again.
So you start out and you say okay here are my weights, dense trained weights, and then what I do is I'm going to lock the ones that are near zero all the way to zero. That's you're doing the pruning, but I'm going to do it in a way that I restrict that I can only have two non-zeros out of every four. And then I'll retrain the weights. I'll basically fine-tune by training the algorithm again to get the weights that remain to pick up some of the slack from the ones that went away. And then what I do is I basically compress out all the zeros and I add this metadata that says where the non-zeros are. And now I basically can take only the non-zero values and I take the input activations and I use the same metadata that says okay the non-zeros are in positions zero, three, five, and six and select only those in. So now I have the red ones aligned with the green ones and I just do a dense multiply and everything becomes regular again. And this winds up basically giving us a factor of two performance boost on everything since Ampere.
Let me see how I'm doing on time here. So I'll briefly talk about accelerators. So one way we prototype a lot of ideas what we want to do in future GPUs at NVIDIA is building accelerators and we've done a series of accelerators over the years, some in collaboration with academics at Stanford and MIT. SCNN was one of these failed attempts to exploit sparsity which basically operated by taking all the non-zeros of the activations and all the non-zeros on the weights and multiplying them all by each other and then sorting it out on the output, which is exactly the wrong thing to do because on the output first of all you have greater precision. You typically have lower precision on the input side. You start with four-bit weights and activations. Then you'll multiply them and you wind up with eight-bit results. You have to sum before you re-quantize down to four bits. And then the other is you wind up having to do what we called a scatter-add to basically take these results and add them into a memory array and that was an expensive operation. So that wound up being almost exactly the wrong way of doing it.
But if you look at these accelerators, you have to ask why are they better than doing it on a programmable engine and is there a lesson for us in designing GPUs that we can exploit. So the first way accelerators really get performance is by eliminating instruction overhead. Even for GPUs which have very efficient pipelines, you saw that trying to do a single multiply-add you had 20x as much energy fetching and decoding instructions. This is a result, the numbers for a very simple CPU, an ARM A15. A state-of-the-art x86, the numbers are even bigger, but these are from this particular paper which I think was in 2008. And basically the overhead here wasn't 200% overhead, it was 250 picojoules versus 32 femtojoules, right? So it was tens of thousands. And what we can do with an accelerator is get rid of all that overhead. Get rid of that fetch and decode and speculation and use just the energy on the math.
The other thing with accelerators is you have to sort of understand what everything costs. And this is a table from a paper my colleague Mark Horowitz had at ISSCC where we're basically for 45 nanometers sort of tallied what different things cost. And the real takeaways here is that low-precision math is way cheaper than high-precision math. I said that earlier. It's a quadratic thing. But just reading even a small SRAM, reading 32 bits out of a small SRAM is more expensive than doing a 32-bit floating-point operation and moving things even a few millimeters is more expensive than that. So data movement is way more expensive than doing the math. And so you need to keep things very local. One of my rules is sort of the order of magnitude rule of locality which is if I'm reading something out of a small SRAM, say an 8 kilobyte SRAM, that's about five picojoules per word. If I have a big SRAM that's spread out over the whole chip, that's about 50 picojoules per word. Now it turns out that it's the same SRAM. You build a big SRAM out of small SRAM banks. And so of that 50 picojoules, five picojoules is reading the word out of the bank. And the other 45 picojoules is getting the address to that bank and getting the data back. It's all data movement. In fact, even in the local SRAM read, it's all data movement. Reading the cell is a negligible amount of energy. Then if I have to go off chip, it's another order of magnitude.
So one of our more successful accelerator projects was something called Magnet where we basically tried to look at what is the most efficient way of organizing a floating-point accelerator and a lot of this was iterating over different ways of doing the tiling, different ways of what a lot of people call the data flow. Do we want to keep the weights stationary and move the activations or keep the activations stationary and move the weights or keep the outputs stationary? And so we came up with this sort of weight-stationary pseudo-output-stationary approach to doing this which wound up being the right compromise and we wound up building a prototype of this accelerator that wound up getting about 100 TOPS per watt with a 50% dense four-bit input.
Now, one of the visions in the long term we have for accelerators is that if you think about what is hard in building a chip, it's actually not doing the thing that the chip is doing. Whether it's doing a deep learning matrix multiply computation or whether it's, I built a custom chip with one of my PhD students to do bioinformatics calculations and the core of it was doing a Smith-Waterman dynamic programming algorithm. You could write the Verilog for that in an afternoon and get it working. The hard part of building these chips is everything that goes around that. It's building the memory system, building the on-chip network, the general-purpose programming that you need to sort of do the scaffolding around it. So the vision we have in the future is that you'll have a GPU which has a great on-chip network and off-chip network. A great on-chip memory system and off-chip memory system. It'll have general-purpose programmable SMs, the streaming multiprocessors are the programmable units of the GPUs. And then it will have these yellow boxes which is the configurable part. Whatever thing you want to accelerate whether it's doing matrix multiplies for deep learning or dynamic programming for bioinformatics. You'll have your program and your compiler will decide what parts of that will run on the general-purpose processor and what part will be synthesized into custom compute blocks that will build your accelerator. So that's a long-term vision. We have a long way to go to get there.
So I think I've talked probably long enough. So let me wrap up. I think deep learning is really affecting every aspect of the human experience and it's been enabled by hardware and its progress is gated by hardware. Remember the algorithms and the data were around. It wasn't until the GPUs came around to sort of be the spark that ignited that fuel, our mixture that lit off our current revolution. And since then, progress has been gated by how fast a GPU is and how many of them you can wire together to get that 10 million x that we've needed to train the foundation models today. On a single GPU, we've gotten a 1000x in the last 10 years. It's 5,000x over the 12 that I showed in the graph. And then we scale up and scale out to get the additional 2,000 we've needed to meet the demand. The hardware is one aspect of that. We need software both to cover the applications and to provide the performance. And what we're looking at today are challenges that come from agentic systems and disaggregated inference where especially the autoregressive part of inference puts huge pressure on latency within the GPU and on memory bandwidth. And then even though we're trying to get very low latency to support chain of thought and agentic systems, at the same time we want to keep it programmable because we don't know when somebody's going to come up with the next clever algorithm and we have to be able to run it.
I showed you some details on the number representation side of optimum clipping where we can trade off quantization error against clipping error and minimize the mean squared error. It's worth almost two bits worth of precision and I'm applying that clipping to the appropriate granularity and it's much more effective to apply it at a small granularity at vectors of like 8, 16, or 32 than it is to an entire layer because the way to think about it is if I take the entire layer and I apply one scale factor to it, if I cut that in half and I apply two scale factors to it, I've incurred almost no overhead and I've improved my clipping and I just need to keep repeating that process until it becomes expensive to add that next scale factor. So with that, I think I should wrap up and we should move on to the Q&A. I should say this is a picture of where I live and you see it looks a lot cooler than it is here.
E
Esther Lim54:35
Thank you, Dr. Dally. May we invite you as well as Prof. Tan to take your seats on stage for the question and answer session. The floor is now yours, Prof. Tan.
T
Tan Eng Chye54:49
Thank you, Dr. Dally, for a very fascinating and insightful talk. So you went through the history of deep learning accelerators and what was surprising is that a lot of the benefit actually came from the number representation. I saw the analog which you did not discuss. But also at the same time you touched upon the specialization versus the generalization and the models might evolve but you need to keep things general. But once you try to keep things general, the control overhead comes in and when you are at FP4 or even smaller, then that becomes like a huge problem. So when the dust settles, where do you see this tension between generalization versus specialization and where do you see we are going after FP4?
W
William Dally55:52
Yeah, so let me, those are two distinct questions, let me address them separately. So we want to be general but we can't be completely general, but we have to necessarily specialize. For example, by deciding on certain ratios when we design a GPU, we have to decide how much external memory bandwidth do we want to have? How much internal memory bandwidth do we want to have? How much arithmetic bandwidth do we want to have? And so we're setting a bunch of ratios. And if the problem we're solving hits those ratios right, all parts of the chip are going to be busy. On the other hand, if I run the decode part of inference and it's very heavy on memory bandwidth and light on math, all of a sudden my memories are busy all the time and my math units are all idle. But you have to make that provisioning decision for any chip about how you allocate your chip resources to different things. There's no compromise there. What you don't want to do is tie your hands by making a restriction that doesn't buy you anything. You basically doing that allocation buys you something. You've made a decision on allocating resources. But if I were to say, I can only multiply matrices that are of dimension 8, then that's an artificial restriction. It's not something fundamental about math versus memory bandwidth. And so I've made something which makes models that didn't fit that not run well, but didn't gain anything by doing it. Or trying to hardwire in a particular attention algorithm, something like that doesn't buy you anything. And so you want to keep the programmability as much as you can but not artificially constrain it. And one thing we've done to eliminate the overhead but still retain programmability is to have these complex instructions, to have the matrix multiply instructions. But those are fairly fundamental. You need them for almost anything you do. And so I think the art of good architecture is figuring out how to come up with a complex enough instruction that amortizes out all of the overhead but is general enough that it's what you need for almost any application and then you're down again to provisioning, basically deciding what the ratio of bandwidth to math is going to be.
T
Tan Eng Chye58:03
Thank you. So I take this opportunity for the moderation to ask...
W
William Dally58:07
I didn't answer your other question. And I should answer that which is what is after NVFP4. And so I think there are a couple directions we can still go to continue mining data precision. And there's certain things I can't talk about that because they're going to be in upcoming GPUs, but you can imagine you can go quite a bit less than four bits. And lest you think that one bit is the limit, there are these things like arithmetic coding where you can actually code symbols with less than one bit per symbol. And so I think cleverness will continue pushing us forward.
T
Tan Eng Chye58:41
Okay. And you don't see emerging technology playing a role in the future?
W
William Dally58:46
But we're always trying to use the best technology that our various fab partners have. And we fabricate some of our chips with TSMC and some of them with Samsung and we're constantly evaluating other potential foundries. But we see diminishing returns from future technologies, from future semiconductor technology. We're getting less out of that and more out of the architecture as time goes on.
T
Tan Eng Chye59:10
Thank you. Maybe questions from the audience.
A
Audience Member59:14
You can, yeah. I think somebody has the mic.
So Bill, I have a question. You know, you showed that we advanced 10 million times in 10 years to this point, but based on what you propose and in terms of number representation, precision, doesn't seem like we're going to get another 10 million the next 10 years.
W
William Dally59:35
Well, remember the single GPU has advanced 5,000 times and it's really 2,500 because it's two dies stuck together. But you know, if we can keep doubling our performance every GPU generation, we'll still get there. I do think that some of the easy things have been mined, going from FP32 to FP16 was really easy. And going from FP4 to what comes next is going to be harder. But I think there's still room for a bunch of doublings there.
A
Audience Member1:00:06
So my understanding is that the DeepSeek people showed the benefit one could get from optimization technology, right? That it's not the process technology and so on but really how you do the software parts.
W
William Dally1:00:22
Well, yeah, I think they certainly showed a lot of creativity in, you know...
T
Tan Eng Chye1:00:26
And also they showed that if you're willing to do all the hard compilation stuff, maybe partly manually, that you could get a lot of benefit.
W
William Dally1:00:37
I don't think there's a lot of hard compilation of that. I mean, the two big things with DeepSeek V3 were the MLA attention mechanism, which reduced the amount of math and attention quite a bit, and then the mixture of experts. And that algorithmic creativity I think allowed them to get the same level of performance out of a smaller number of math operations. But we've seen that over the entire history of deep learning. So if you go back to the AlexNet days in the 2012 ImageNet, it was AlexNet that won and then the next year it was this network called VGG and VGG was kind of brute forcing it. It did a lot of operations, but it was still an AlexNet-like organization with a bunch of convolutional layers followed by some fully connected layers. But the year after that, this network called GoogLeNet won. And GoogLeNet took a completely different approach and wound up beating the performance of VGGNet with fewer, like an order of magnitude fewer operations by being clever about how it did that. And that's the history of deep learning. We're constantly seeing people coming up with more efficient models. They get the same accuracy with fewer operations. And I think we're going to see that over and over again. A lot of the efficiency gain is coming from better models.
A
Audience Member1:01:50
So if I could explore that a little bit more. You talked about the algorithmic creativity. But there is also this concept of hardware lottery. That algorithm creativity is generally bound by what the GPU is offering.
W
William Dally1:02:04
We think this is a very good thing. You see that as a very good thing.
A
Audience Member1:02:10
So basically what you're saying is that because everybody's running their algorithms on our GPUs, the quality of an algorithm is judged by how well it runs on GPUs. So people then develop algorithms that run well on our GPUs and that's good because then they buy more GPUs.
But if you put your academic hat on, is there a way to break that?
W
William Dally1:02:32
Why would we want to?
A
Audience Member1:02:37
I am Ming Tan on the board of NUS. Could the future be non-silicon based?
W
William Dally1:02:42
Possibly, but I haven't seen any alternatives that look competitive yet.
A
Audience Member1:02:50
Is NVIDIA doing any work in non-silicon based computing?
W
William Dally1:02:54
Yeah, we have a couple exploratory projects, but most of them fall under the heading of like photonic communications. But we're not trying to do compute in a non-silicon-based way. We don't have anything that we're looking at there.
A
Audience Member1:03:09
So following up maybe on the communication, clearly your work has shaped the communication, the interconnect and one of my colleagues, your student, Palewan, was working on that. So if you, and some of these interconnect problems have come back in the context of modern AI and the agentic AI workload. I think you didn't have much time to talk about that. But do you see that as a different problem or you think some of the things that have been done before can be rediscovered in this context?
W
William Dally1:03:42
I don't think you even have to rediscover them. You can just reapply them. So if you look at running a modern deep learning model, they're big, the true foundation models are 10 trillion parameter models. They don't fit on one GPU. So even doing an inference is something where that model is split over tens of GPUs and it's really critical to have a very good scale-up fabric to do that. Especially if you're interested in the part of that curve where I showed the throughput and interactivity, if you're interested in getting good interactivity, that's limited by latency and so you need to have good communication latency between those things. But I think all, if you get my textbook which was written in 2005, most of what you need to do to design a good interconnection network, given a traffic pattern which is induced from the model and how you've done the mapping, how much tensor parallelism, how much pipeline parallelism you've unrolled it to and how you've placed those on the GPUs, that induces a traffic pattern and given that traffic pattern there's a good way of designing an optimal interconnection network for it and if you have a good interconnection network for it you will get way more interactivity than you will if you try to just hook up a bunch of Ethernet or something.
A
Audience Member1:04:56
Yeah. And do you see with agentic workload that traffic is becoming unpredictable or you...
W
William Dally1:05:03
Well, no, I think it's very predictable because again the agentic workload basically does more work but each piece of work is an API call that's basically putting an input string into an LLM. Then you have a prefill computation, a decode computation, you get an output string and the communication pattern from that is very, very predictable and it's very diverse because you have a different communication pattern for the prefill part and another communication pattern for the decode part.
T
Tan Eng Chye1:05:33
Yep. Okay, other questions.
A
Audience Member1:05:37
Thanks for the talk. My name is Zan. So I'm a current NUS EMBA student and also a CEO of a company that we're doing mineral research by applying machine learning. My question is for a very long period of time NVIDIA GPU was mostly used for a lot of gaming purpose and now as it seems attention has shifted to AI. So I'm curious about NVIDIA's plan, your plan regarding the gaming community and how to make advanced gaming chips more affordable as well. Thank you.
W
William Dally1:06:08
Yeah. So it's interesting. You know, when I started consulting for NVIDIA in 2003, I thought it was great that at the time I'd actually been working on parallel computing for over 20 years. And I'd seen many parallel computing companies fail. And the reason was that there really wasn't enough demand for scientific computing, which was the main driver of parallel computing in that time. And so when I first started talking to Jensen about this, he says, 'Well, what's good about NVIDIA is our GPUs have a day job doing graphics and then it's like they're playing in the rock band at night doing scientific computing, but at least they can pay the bills because of the graphics.' And so for a long time, we took the results of the Stanford stream processing project and put them into GPUs and made them programmable so we could do scientific computing. And Jensen was a real believer in that because it didn't make money for 10 years. It lost money for the first 10 years that we put programmability and developed CUDA. But he stuck with it and it was kind of graphics was a day job. Scientific computing was playing the rock band at night. And then when it started making money, it wasn't because of scientific computing. It was because of AI, right? And AI decided, okay, we have these programmable GPUs. They're great for doing AI. And that started paying the bills. And now that's the day job. And it supports the graphics because every generation we do the B100 chips and the B200, B300, but we also do the RTX versions of those. And they're different chips because the RTX versions still have the tensor cores because we want to do AI for graphics. A lot of the denoising that's used to make ray tracing practical relies on that. But they also have the RT cores. They have the tree traversal units that are not in the data center chips. So, we're still producing graphics chips every generation and a lot of effort goes into making them really good at graphics. And I think if you look at computer graphics today, the AAA games that are path traced, it's truly stunning the video quality you get compared to what it was even 10 years ago.
A
Audience Member1:08:24
Hi, I'm Joel. I'm a graduate student here at NUS. I'd like to know if you think just in computing in general, are there areas that you would like to see being worked on more that is currently not being worked on?
W
William Dally1:08:41
By worked on you mean academic researchers? Yeah.
A
Audience Member1:08:44
Just more brain power being put into it.
W
William Dally1:08:46
Oh that's an interesting one. So, to me, one of the big bottlenecks we have is that it takes too many people to turn out a new GPU. And I think that especially with modern AI, it ought to be possible to think about what the specification is and give my little team of agents that and have it figure out the optimum parameters, how much memory, how much math, what the interconnect between all these things should look like, what the control structure should look like, how I synchronize it and I come back in the morning, I can go out and play on the lake all day and come back and it will have the answer for me and then it will be able to produce a mask set from that. So to me, I would love to see people look at applying AI to reducing the amount of energy, people energy, required to turn out a new GPU and the amount of elapsed time it takes to do that. It's something that we spend easily 3,000 person years. And I actually did a study once, what would it take to do the null GPU and the null GPU means new process technology, no changes in the design from the old one and it was order of thousands of person years just because of the way the design process works.
A
Audience Member1:10:03
And how far do you think we are from AI designing the new...
W
William Dally1:10:08
We're a long way. But I'm hopeful that it can be done, right? That we'll all be able to give AI the problem and go off and do more fun things.
T
Tan Eng Chye1:10:19
Okay. I think there is a...
A
Audience Member1:10:21
I think it's fine. Thank you.
So, thank you very much, Dr. Dally. So I have a question about the energy efficiency because as you know today many people are criticizing AI for its large energy consumption and I was wondering do you think there is like kind of a glass ceiling in terms of energy efficiency that we can achieve in terms of computing?
W
William Dally1:10:43
Well yeah, I think that we're constantly striving to make more efficient GPUs but there's a tension there. So, I can always make my GPU more efficient by operating at a lower voltage and running it slower. But that uses more die area and ultimately what we wind up optimizing for is performance per cost. And one aspect of cost is energy, but another aspect of cost is die area. And so if energy cost becomes more expensive, then we'll switch that optimization to burn more die area to burn less energy. But every generation we also get a win-win because for example going from 8-bit floating-point to NVFP4, we got essentially a factor of four improvement in energy efficiency without giving anything up. And so I'm hoping there are ways that we'll still continue discovering and actually there's some features we're discussing that I unfortunately can't talk about for our next generation GPU that should have comparable energy efficiency gains that we don't give anything up. And so I think there's still a lot of room to do better on energy efficiency and it's one of our major goals to keep pushing that frontier.
T
Tan Eng Chye1:11:52
Okay, I think we are almost coming to the end of this session. So maybe one final question for you. We have quite a number of young students in this audience. And I have two questions for them on behalf of them. So first is that in general, people are, we are in School of Computing, people are more interested in doing AI software than hardware. So computer architecture is not a very popular course. So what is it for them to take these architecture courses and be in NVIDIA? I think that's my first question. And the second question is that you have been in academia and now in the industry and I think increasingly people feel that it's probably not so easy to compete with NVIDIA. And I actually today lunch met with one of the co-founders of Groq, so this was interesting and he has spent like 25 years to make this possible. So what do you think are the distinctive things that academia could do to have maximum impact?
W
William Dally1:13:13
Oh, those are two very good questions. Let me deal with the first one. And so actually I see an interesting trend going on at a bunch of US universities like Stanford which is that enrollments in computer science are way down and enrollments in electrical engineering are way up. And it's I think it's an observation that the students have that introductory entry-level programming jobs are going away because the AI agents are doing all of that but the AI agents haven't taken over the hardware design yet. And I think that hardware design, I'm a hardware designer, I think there's a lot of inherent human creativity that's required to get the hardware designs that's less automatable. And so I think that there's going to be more opportunities for students on that side going forward. I think there's still going to be lots of opportunities for students on the software side, but it's going to be less coding and more managing the agents that code. And then for academia... what should... so I think academics have a huge advantage in that one thing I say about NVIDIA Research is we have a huge advantage over our product development people is that we can afford to fail. If you're working on product development, NVIDIA, Intel used to start five CPU designs and one of them would actually ship so everybody going to work says oh my design doesn't really have to work the other one of the other four will work. But everybody at NVIDIA realizes is we're doing one and it has to work or we're going out of business, we're all going to starve. And so there's tremendous pressure not to do anything that might not work. And so you don't have the opportunity to fail. In NVIDIA Research, we can actually do things that wind up being the lead feature that give 2x or 4x performance per unit energy on the next generation because we can try a lot of things that do fail. And that opportunity of not having to work is a huge advantage. And I think the academic world has an even bigger opportunity that way. And that you can take a longer term perspective. You can try a lot of ideas that may not work. And I think that gives you an advantage over in industry even though in NVIDIA Research we can fail. I have to deliver a certain number of things that work every year or Jensen will lose his patience with me and cut off the funding.
T
Tan Eng Chye1:15:34
So with that, thank you very much again, Dr. Dally, for this very succinct...