Dylan Patel14:33
Yeah, there's sort of two metrics that I like to think about a lot. So in training, everyone just talks about MFU, right? But then on inference, which I think is, you know, one, LLM inference will be bigger than training or multimodal, whatever, inference will be bigger than training, you know, probably next year in fact, at least in terms of GPUs deployed. The other thing is like, what's the bottleneck when you're running these models? So like the simple stupid way to look at it is training is you there's six FLOPs, floating point operations, you have to do for every byte you read in, every parameter you read in. So if it's FP8, then it's a byte, if it's FP16, it's two bytes, whatever, on training. But on inference side, the ratio is completely different. It's two to one, right? There's two FLOPs per parameter that you read in and parameters maybe one byte, right, because that's INT8, right, eight bits. But then when you look at the GPUs, the GPUs are very, very different ratio. The H100 has 3.35 terabytes a second of memory bandwidth and it has 1,000 teraFLOPs of FP16, BF16. So that ratio is like, well, I'm going to butcher the math here and people are going to think I'm dumb, but 256 to 1, right? Call it 256 to 1 if you're doing FP16. Same applies to FP8 because, anyways, per parameter read to number of floating point operations. If you quantize further, then you also get double the performance on that lower quantization. That does not fit the hardware at all. So if you're just doing LLM inference at batch one, then you're always going to be underutilizing the FLOPs. You're only paying for memory bandwidth. And the way hardware is developing, that ratio is actually only going to get worse. H200 will come out soon enough, which will help the ratio a little bit, improve memory bandwidth more than improves FLOPs, just like the A100 80 gig did versus the A100 40 gig. But then when the B100 comes out, the FLOPs are going to increase more than memory bandwidth. And when future generations come out, and the same with AMD side, MI300 versus 400, as you move on generations, just due to fundamental semiconductor scaling, memory is not scaling as fast as logic has been. And so you're going to continue to, and you can do a lot of interesting things on the architecture. So you're going to have this problem get worse and worse and worse. And so on training, it's very, you know, who cares, right? Because my FLOPs are still my bottleneck. I mean, memory bandwidth is obviously a bottleneck, but like, well, batch sizes are freaking crazy. People train like 2 million batch size is trivial, right? Like that's what Llama, I think, did. Llama 70B was 2 million batch size. And like you talk to someone at one of the frontier labs and they're like, yeah, just 2 million, right? 2 million token batch size, that's crazy. Or sequence, sorry. But when you go to inference side, it's like, well, it's impossible to do 2 million batch size. Also your latency would be horrendous if you tried to do something that crazy. So you kind of have this differing problem where on training everyone just kept talking MFU, model FLOP utilization, right? How many FLOPs, six times the number of parameters basically more or less, and then what's the quoted number. So if I have 312 teraFLOPs out my A100 and I was able to achieve 200, that's really good. Some people are achieving higher, some people are achieving lower. That's a very important metric to think about. Now you have like people thinking MFU is like a security risk. But on inference, MFU is not nearly as important. It's memory bandwidth utilization. Batch one is, you know, what memory bandwidth can I achieve? Because as I increase batch from batch size one to four to eight to even 256, right, is sort of where the crossover happens on an H100 inference-wise, where it's FLOPs limiting you more and more. But like, you should have very high memory bandwidth utilization. So when people talk about A100s, like 60% MFU is decent. On H100s, it's more like 40-45% because the FLOPs increased more than the memory bandwidth. But people over time will probably get above 50% on H100 on MFU on training. But on inference, it's not being talked about much, but MBU, model bandwidth utilization, is the important factor. So of my 3.35 terabytes a second memory bandwidth I have on my H100, can I get two, can I get three? That's the important thing. And right now, if you look at everyone's inference stuff, so I dogged on this in the GPU Poor thing, but it's like Hugging Face's libraries are actually very inefficient, like incredibly inefficient for inference. You get like 15% MBU on some configurations, like eight A100s and Llama 70B, you get like 15%, which is just horrendous. Because at the end of the day, your latency is derived from what memory bandwidth you can effectively get. So if you're doing Llama 70 billion, 70 billion parameters, if you're doing it in INT8, okay, that's 70 gigabytes a second, gigabytes you need to read for every single inference, every single forward pass, plus the attention, but again, we're simplifying it, 70 gigabytes you need to read for every forward pass. What is an acceptable latency for a user to have? I would argue, you know, 30 milliseconds per token. Some people would argue lower, but at the very least you need to achieve human reading level speeds and probably a little bit faster because we like to skim. To have a usable model for chatbot-style applications, now there's other applications of course, but chatbot-style applications, you want it to be human reading speed. So 30 tokens per second, 30 tokens per second is 33 or 30 milliseconds per token is 33 tokens per second. Times 70 is, let's say 3 times 7 is 21 and then add two zeros, so 2,100 gigabytes a second to achieve human reading speed on Llama 70B. So one, you can never achieve Llama 70B human reading speed on, even if you had enough memory capacity on a model, on an A100, right? Even on an H100 to achieve human reading speed, of course you couldn't fit it because it's 80 gigabytes versus 70 billion parameters. So you're kind of butting up against the limits already. 70 billion parameters being 70 gigabytes in INT8 or FP8, you end up with one, how do I achieve human reading level speeds? So if I go with two H100s, then now I have, call it 6 terabytes a second of memory bandwidth. If I achieve just 30 milliseconds per token, then I'm, which is 33 tokens per second, which is 2.1 terabytes a second of memory bandwidth, that I'm only at like 30% bandwidth utilization. So I'm not using all my FLOPs on batch one anyways, because the FLOPs that you're using there is tremendously low relative to inference. And I'm not actually using a ton of the tokens on inference. So with two H100s, I only get 30 milliseconds per token. That's a really bad result. You should be striving to get, you know, so upwards of 60%. And 60% is kind of low too, right? Like I've heard people getting 70-80% model bandwidth utilization. Obviously you can increase your batch size from there and your model bandwidth utilization will start to fall as your FLOPs utilization increases. But you know, there you have to pick the sweet spot for where you want to hit on the latency curve for your user. Obviously as you increase batch size, you get more throughput per GPU, so that's more cost-effective. There's a lot of things to think about there, but I think those are sort of the two main things that people want to think about. And there's obviously a ton with regards to like networking and inter-GPU connection because most the useful models don't run on a single GPU, they can't run on a single GPU.