Dylan Patel

Founder, CEO, and Chief Analyst, SemiAnalysis

The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis

🎥 Aug 28, 2023 📺 Latent Space ⏱ 67m 👁 9740 views

If Charles Dickens was alive in 2024, A Tale of Two Cities might be the divide between the “GPU poor” and the “GPU rich”. We mentioned these terms in some of our previous episodes; they were originally coined by Dylan Patel of SemiAnalysis in his “Gemini Eats the World” post, put on blast by Sam Altman. SemiAnalysis are one of the most in depth research and consulting firms in the semis world, and have a unique insight into the design, production, and supply chain of GPUs based on their ground presence in Asia. In this episode we break down the State of Silicon: when are more GPUs coming? Ar...

Watch on YouTube

About Dylan Patel

Dylan Patel, founder and CEO of SemiAnalysis, has been speaking at several industry events in early 2026 about AI infrastructure, benchmarking, and market dynamics. At an Aria Networks launch event in April, Patel stated that AI inference demand has grown so rapidly that the rental price of three-year-old H100 GPUs has risen from around $160-170 per hour to over $240 per hour in six months, with no spare capacity available. He also discussed the InferenceX project, which he described as a free and open-source benchmarking effort with over a thousand GPUs donated by companies including OpenAI, Microsoft, and Nvidia. In a March interview at the Daytona Compute Conference, Patel said that hyperscalers like Google, Amazon, and Microsoft were slow to move into AI, creating an opportunity for "NeoClouds" that could skip complex legacy software. He also noted that the entire cloud market had run out of CPUs, with Amazon's CPU server installations tripling year-over-year. In an April interview with Patrick O'Shaughnessy, Patel said his firm's AI token spend had skyrocketed from tens of thousands of dollars annually to $7 million, driven by non-technical staff using AI for coding. He stated that "ideas are cheap and plentiful but execution is very easy," and warned that people who do not use more tokens, generate value from them, and capture that value will "never escape the permanent underclass." Patel also predicted a "large scale protest against Anthropic and AI," citing a Pew survey that he said showed AI is less popular than politicians. In a panel at the Beyond Summit, Patel asserted that vendor benchmark claims are "lies, impossible to achieve," and that "if you're not pissing off people with your benchmark, then you're not testing something useful."

Source: AI-verified profile updated from Dylan Patel's recent appearances. Browse all interviews →

Transcript (56 segments)

✨ AI-enhanced transcript with speaker attribution

Alesio Partner0:06

Hey everyone, welcome to the Latent Space podcast. This is Alesio, partner and C-resident at Deible Partners. I'm joined by my co-host Swix, founder of Small AI, and today we have Dylan Patel and the P-Min Studios. Welcome.

Dylan Patel0:17

Well, thank you for having me. And it was very short notice, right?

Alesio Partner0:22

Yes, yes, just hours. I was thinking you were in Taiwan somewhere and I was like, it's going to be hard to schedule this guy. But I'm sure you visit San Francisco and obviously you just DM'd me on the day of and you go like, let's set something up.

Dylan Patel0:34

Yeah, yeah. Well, you know, the folks at To gave me this hat and then they mentioned you and I was like, oh yeah, we talked about something. And then you know, you mentioned from Taiwan you didn't see this. I was talking to Swix about this, but this is a mooncake from Taiwan that I brought back. So you know, hopefully you'll enjoy that.

Alesio Partner0:49

Nice, thank you. Amazing. So you're the author of the extremely popular SemiAnalysis blog. We have both had a little bit of credentials or claim to fame in breaking details of GPT-4. George Hotz came on our pod and talked about the mixture of experts thing, and then you had a lot more detail.

Dylan Patel1:08

To be clear, I talked about mixture of experts in January. It's just people didn't really notice it, I guess. I don't know.

Alesio Partner1:13

You went into a lot more detail and I'd love to dig into some of that. But anyway, so welcome. Congrats on all your success so far.

Dylan Patel1:20

Yeah, thank you so much. You know, it's really interesting. I've been doing consulting in the semiconductor industry since 2017. And like, you know, 2021 got bored and in November I started writing a blog. And then like 2022 I was good and I started hiring folks from my firm. And then all of a sudden 2023 happens and it's like the perfect intersection because I used to do data science but not like AI, not really. Like multivariable progression is not full AI, right? But also I've been involved in the semi industry for a long, long time, posting about it online since I was 12. And so it's like the perfect time and place because semiconductors became important. All of a sudden it wasn't like this boring thing. And then also the shortage in 2021 also mattered. But like, all of a sudden this all kind of came to fruition. So it's cool to have the blog sort of blow up. I used to cover semi at Basni as well. And it was for a long time it was just a mobile cycle and then a little bit of PCs, but not that much. And then maybe some cloud stuff, you know, public cloud semiconductor stuff. But it really wasn't anything until this wave. And I was actually listening to you on one of the previous podcasts that you've done and it was surprising that high performance computing also kind of didn't really take off. Like, AI is just the first form of high performance computing that worked.

Alesio Partner2:46

One of the theses I've had for a long time that I think people haven't really caught on, but it's really, really coming to fruition now, is that the largest tech companies in the world, their software is important, but actually having an operating, a very efficient infrastructure is incredibly important. And so, you know, people talk about, hey, Amazon is great for AWS is great because yes, it is easy to use and they built all these things. But behind the scenes, and no one really talks about it that much, but it's like behind the scenes they've done a lot on the infrastructure that is super custom that Microsoft Azure and Google Cloud just don't even match in terms of efficiency. Like if you think about the cost to rent out SSD space, so the cost to rent, you know, offer database service on top of that, obviously a cost to rent out a certain level of CPU performance, Amazon has a massive advantage there. And likewise, Google spent all this time doing that in AI with their TPUs and infrastructure there and optical switches and all this sort of stuff. And so like in the past it wasn't immediately obvious, but I think with AI especially, like the scaling laws are going, it's incredibly important for, you know, infrastructure is so much more important. And then like when you just think about software cost, right, like the cost structure of it, there was always a bigger component of R&D and like SaaS businesses, you know, all over SF, right, like all these SaaS businesses did crazy good because you know, they just start as they grow and then all of a sudden they're so freaking profitable for each incremental new customer. And AI software looks like it's going to be very different in my opinion. The R&D cost is much lower in terms of people, but the cost of goods sold in terms of actually operating the service I think will be much higher. And so in that same sense, infrastructure matters a ton for that.

Dylan Patel4:30

I think you wrote on that training costs effectively don't matter. Yeah, in my opinion, I think that's a little bit spicy, but yeah, it's like training costs are irrelevant. Like GPT-4, right, like 20,000 A100s, that's like, I know it sounds like a lot of money, 500 million all in.

Alesio Partner4:46

Is that a reasonable estimate?

Dylan Patel4:48

Yeah, I think for the supercomputer it's slightly more, but yeah, I think the 500 million is a fair enough number. I mean, if you think about just the pre-training, right, 3 months, 20,000 A100s at a dollar an hour is like, that is way less than 500 million. Of course, there's data and all this sort of stuff.

Alesio Partner5:05

Yeah. So people that are watching this on YouTube, they can see a GPU Poor and a GPU Rich hat on the table, which is inspired by your Google Gemini blog post. Did you know that this thing was going to blow up so much? Sam Altman even tweeted about it. He said, incredible, Google got the SemiAnalysis guy to publish their internal marketing re-cooking chart. And yeah, tell people who are the GPU Poor, who are the GPU Rich, like what's this framework that they should think about.

Dylan Patel5:33

It's, you know, some of this work we've been doing for a while is just on infrastructure and like, hey, like when something happens, I think it's like a sort of competitive advantage of our firm, right, me myself and my colleagues, is like we go from software all the way through to like low-level manufacturing. And it's like, who, you know, oh, Google's actually ramping up TPU production massively. And like, I think people in AI would be like, well, duh, but like, okay, like who has the capability of figuring out the number? Well, one, you can just get Google to tell you, but they won't tell you, right? That's like a very closely guarded secret and most people that work at Google DeepMind don't even know that number. Two, you go through the supply chain and see what they've placed in orders. But then three is sort of like, well, who's actually winning from this? Like, hey, oh, Celestica is building these boxes. Wow, oh, interesting. This company's involved in testing for them. Oh, okay. This company's providing design IP to them. Okay, okay. Like that's very valuable on a monetary sense. But you know, you have to understand the whole technology stack. But on the flip side, right, is like, well, why is Google building all these? What could they do with it? And what does that mean for the world and the state of the world? Is like, especially in SF, right, like I'm sure you folks have been to parties. People just brag about how many GPUs they have. Like it's happened to me multiple times where someone's just like, I'm just witnessing a conversation where somebody from Meta is bragging about how many GPUs they have versus someone from another firm. And then it's like, or like a startup person's like, dude, can you believe we just acquired, we have 512 H100s coming online in August. And it's like, oh, cool. But then you're like, you know, going through the supply chain, it's like, dude, you realize there's 400 to 500,000 being manufactured last quarter and like 530,000 this quarter being sold, right, of H100s. It's like, oh crap, that's a lot. You know, so sort of like, that's a lot of GPUs. But then like, oh, how does that compare to Google? And like, there's one way to look at the world which is just like, hey, scale is all you need. Like, obviously data matters, obviously all this stuff matters, but given any data set, a larger model will just do better. I think it's going to be more expensive, but it's going to do better. There's the view of like, okay, there's all these GPUs going to production. Nvidia is going to sell well over 3 million total GPUs next year, over a million H100s this year alone. There's a lot of GPU capacity coming online. It's an incredible amount. And like, well, what are people doing? What are people working on? I think it's very important to just think about what are people working on. What actually are you building that's going to advance, you know, what is monetizable, but what also makes sense. And so like, a lot of people were doing things that I thought felt counterproductive in a world where in less than a year there's going to be more than 4 million high-end GPUs out there. We can talk about the concentration of those GPUs, but if you're doing really valuable work as a good person, right, like you're contributing in some way, should you be focused on like, well, I don't have access to any of those 4 million GPUs, right? I actually only have access to gaming GPUs. Should I focus on like being able to fine-tune a model on that? Like, no, it's not really that important. Or like, should I be focused on batch one inference on a cloud GPU? Like, no, that's pointless. Like, why would you do batch size one inference on an H100? That's just ridiculously dumb. There's a lot of counterproductive work. And at the same time, there's a lot of things that people should be doing. And so like, you know, kind of you can tier the world into like, hey, like, I mean, obviously most people don't have resources. And I love the open source and I want the open source to win. And I hate the people who want to like, you know, just like, no, we're xAI and we think this is the only way you should do it and if people don't do it this way they should be regulated against it and all this kind of stuff. But I hate that attitude. So I want the open source to win. Companies like Mistral and like what Meta are doing and Mosaic and all these folks together, all these people doing huge stuff for the open source, want them to succeed. But it's like, there's certain things that are like hyper-focusing on leaderboards at Hugging Face. Like, that's just like, no, TruthfulQA is a garbage benchmark. Some of the models that are very high on there, if you use it for 5 seconds, you're like, this is garbage. And it's just like, you're gaming a benchmark. There was things I wanted to say also, you know, we're in a world where compute matters a lot. Google is going to have more compute than any other company in the world, period, by a large, large factor. And so it's just like framing it into that mindset of like, hey, what are the counterproductive things? What do I think personally or what have people told me that are involved in this should they focus on? And what is the world where, hey, the pace of acceleration from 2020 to 2022 is less than 2022 to 2024. We are growing, you know, GPT-2 to 4, 2 to 4 is like 2020 to 2022, is less than I think from GPT-4 in 2022, which is when it was trained, to what OpenAI and Google and Anthropic would do in 2025. I think the pace of acceleration is increasing. And it's just good to think about that sort of stuff. I don't know where I'm rambling with this, but yeah.

Alesio Partner10:54

Yeah, that makes sense. And the chart that Swix mentioned is about Google TPU v5s completely overtaking by orders of magnitude. Let's talk about the TPU a bit. We had Chris Lattner on the show, which I know you know. He used to work on TensorFlow at Google and he did mention that the goal of Google is like make TPUs go fast with TensorFlow. But then he also had a post about PyTorch kind of stealing the thunder, so to speak. How do you see that changing if like now that a lot of the compute will be TPU-based and Google wants to offer some of that to the public too?

Dylan Patel11:32

I mean, Google internally, and I think, you know, is obviously on JAX and XLA and all that kind of stuff. But externally, like they've done a really good job. Like, I wouldn't say like TPUs through PyTorch XLA is amazing, but it's not bad. Some of the numbers they've shown, some of the code they've shown for TPU v5e, which is not the TPU v5 that I was referring to in the GPU Poor post, TPU v5e is like the new one, but it's mostly an inference chip. It's a small chip, it's about half the size of a TPU v5. That chip, you can get very good performance on like Llama 70B inference. So like when you're using PyTorch and XLA, now of course you're going to get better if you go JAX XLA. But I think Google is doing a really good job after the restructuring of focusing on external customers too. Like, hey, TPU v5, we probably won't focus too much on TPU v5 for everyone externally, but v5e, we're also building a million of those. A lot of companies are using them or will be using them because it's going to be an incredibly cheap form of compute. The world of frameworks and all that, that's obviously something a researcher should talk about, not myself. But the stats are clear that PyTorch is way, way dominating everything. JAX is doing well, there's external users of JAX. But in the end, like there's the front end, right? The front end is like what we're referring back to, maybe it's something we do later and you guys are going to edit after, the layers of abstraction. Forever shouldn't be that the person doing PyTorch level code should also be writing custom CUDA kernels. There should be different layers of abstraction where people hyper-optimize and make it much easier for everyone to innovate on separate stacks. And then every once in a while someone comes through and pierces through the layers of abstraction and innovates across multiple, or a group of people. But I think frameworks are important, but compilers are important. Chris Lattner's what he's doing is really cool. I don't know if it'll work, but it's super cool and it certainly works on CPUs. We'll see about accelerators. Likewise, there's OpenAI's Triton, like what they're trying to do there. And like, everyone's really coalescing around Triton. Third-party hardware vendors, there's Pallas. I don't know if you've heard about that, but I don't want to mischaracterize it, but you can write in Pallas and it'll go through lower-level code and it'll work to TPUs and GPUs, kind of like Triton. But like there's a backend for Triton, I don't know exactly everything about it. But I think there's a lot of innovation happening on making things go faster. How do you go faster? Because every single person working in ML, it would be a travesty if they had to write like custom CUDA kernels always. That would just slow down productivity. But at the same time, you kind of have to.

Alesio Partner14:25

Yeah. Good. By the way, I like to quantify things. When you say make things go faster, is there a target range of like MFU that you typically talk about?

Dylan Patel14:33

Yeah, there's sort of two metrics that I like to think about a lot. So in training, everyone just talks about MFU, right? But then on inference, which I think is, you know, one, LLM inference will be bigger than training or multimodal, whatever, inference will be bigger than training, you know, probably next year in fact, at least in terms of GPUs deployed. The other thing is like, what's the bottleneck when you're running these models? So like the simple stupid way to look at it is training is you there's six FLOPs, floating point operations, you have to do for every byte you read in, every parameter you read in. So if it's FP8, then it's a byte, if it's FP16, it's two bytes, whatever, on training. But on inference side, the ratio is completely different. It's two to one, right? There's two FLOPs per parameter that you read in and parameters maybe one byte, right, because that's INT8, right, eight bits. But then when you look at the GPUs, the GPUs are very, very different ratio. The H100 has 3.35 terabytes a second of memory bandwidth and it has 1,000 teraFLOPs of FP16, BF16. So that ratio is like, well, I'm going to butcher the math here and people are going to think I'm dumb, but 256 to 1, right? Call it 256 to 1 if you're doing FP16. Same applies to FP8 because, anyways, per parameter read to number of floating point operations. If you quantize further, then you also get double the performance on that lower quantization. That does not fit the hardware at all. So if you're just doing LLM inference at batch one, then you're always going to be underutilizing the FLOPs. You're only paying for memory bandwidth. And the way hardware is developing, that ratio is actually only going to get worse. H200 will come out soon enough, which will help the ratio a little bit, improve memory bandwidth more than improves FLOPs, just like the A100 80 gig did versus the A100 40 gig. But then when the B100 comes out, the FLOPs are going to increase more than memory bandwidth. And when future generations come out, and the same with AMD side, MI300 versus 400, as you move on generations, just due to fundamental semiconductor scaling, memory is not scaling as fast as logic has been. And so you're going to continue to, and you can do a lot of interesting things on the architecture. So you're going to have this problem get worse and worse and worse. And so on training, it's very, you know, who cares, right? Because my FLOPs are still my bottleneck. I mean, memory bandwidth is obviously a bottleneck, but like, well, batch sizes are freaking crazy. People train like 2 million batch size is trivial, right? Like that's what Llama, I think, did. Llama 70B was 2 million batch size. And like you talk to someone at one of the frontier labs and they're like, yeah, just 2 million, right? 2 million token batch size, that's crazy. Or sequence, sorry. But when you go to inference side, it's like, well, it's impossible to do 2 million batch size. Also your latency would be horrendous if you tried to do something that crazy. So you kind of have this differing problem where on training everyone just kept talking MFU, model FLOP utilization, right? How many FLOPs, six times the number of parameters basically more or less, and then what's the quoted number. So if I have 312 teraFLOPs out my A100 and I was able to achieve 200, that's really good. Some people are achieving higher, some people are achieving lower. That's a very important metric to think about. Now you have like people thinking MFU is like a security risk. But on inference, MFU is not nearly as important. It's memory bandwidth utilization. Batch one is, you know, what memory bandwidth can I achieve? Because as I increase batch from batch size one to four to eight to even 256, right, is sort of where the crossover happens on an H100 inference-wise, where it's FLOPs limiting you more and more. But like, you should have very high memory bandwidth utilization. So when people talk about A100s, like 60% MFU is decent. On H100s, it's more like 40-45% because the FLOPs increased more than the memory bandwidth. But people over time will probably get above 50% on H100 on MFU on training. But on inference, it's not being talked about much, but MBU, model bandwidth utilization, is the important factor. So of my 3.35 terabytes a second memory bandwidth I have on my H100, can I get two, can I get three? That's the important thing. And right now, if you look at everyone's inference stuff, so I dogged on this in the GPU Poor thing, but it's like Hugging Face's libraries are actually very inefficient, like incredibly inefficient for inference. You get like 15% MBU on some configurations, like eight A100s and Llama 70B, you get like 15%, which is just horrendous. Because at the end of the day, your latency is derived from what memory bandwidth you can effectively get. So if you're doing Llama 70 billion, 70 billion parameters, if you're doing it in INT8, okay, that's 70 gigabytes a second, gigabytes you need to read for every single inference, every single forward pass, plus the attention, but again, we're simplifying it, 70 gigabytes you need to read for every forward pass. What is an acceptable latency for a user to have? I would argue, you know, 30 milliseconds per token. Some people would argue lower, but at the very least you need to achieve human reading level speeds and probably a little bit faster because we like to skim. To have a usable model for chatbot-style applications, now there's other applications of course, but chatbot-style applications, you want it to be human reading speed. So 30 tokens per second, 30 tokens per second is 33 or 30 milliseconds per token is 33 tokens per second. Times 70 is, let's say 3 times 7 is 21 and then add two zeros, so 2,100 gigabytes a second to achieve human reading speed on Llama 70B. So one, you can never achieve Llama 70B human reading speed on, even if you had enough memory capacity on a model, on an A100, right? Even on an H100 to achieve human reading speed, of course you couldn't fit it because it's 80 gigabytes versus 70 billion parameters. So you're kind of butting up against the limits already. 70 billion parameters being 70 gigabytes in INT8 or FP8, you end up with one, how do I achieve human reading level speeds? So if I go with two H100s, then now I have, call it 6 terabytes a second of memory bandwidth. If I achieve just 30 milliseconds per token, then I'm, which is 33 tokens per second, which is 2.1 terabytes a second of memory bandwidth, that I'm only at like 30% bandwidth utilization. So I'm not using all my FLOPs on batch one anyways, because the FLOPs that you're using there is tremendously low relative to inference. And I'm not actually using a ton of the tokens on inference. So with two H100s, I only get 30 milliseconds per token. That's a really bad result. You should be striving to get, you know, so upwards of 60%. And 60% is kind of low too, right? Like I've heard people getting 70-80% model bandwidth utilization. Obviously you can increase your batch size from there and your model bandwidth utilization will start to fall as your FLOPs utilization increases. But you know, there you have to pick the sweet spot for where you want to hit on the latency curve for your user. Obviously as you increase batch size, you get more throughput per GPU, so that's more cost-effective. There's a lot of things to think about there, but I think those are sort of the two main things that people want to think about. And there's obviously a ton with regards to like networking and inter-GPU connection because most the useful models don't run on a single GPU, they can't run on a single GPU.

Alesio Partner21:39

Is your TPU Mellanox?

Dylan Patel21:41

So the TPUs, so the Google TPU is like super interesting because Google's been working with Broadcom, who's the number one networking company in the world. So Mellanox was nowhere close to number one. They had a niche that they were very good at, which was the network card, the card that you actually put in the server. But they didn't do much, they weren't doing successfully in the switches, which is, you know, you connect all the network cards to switches and then the switches to all the servers. So Mellanox was not that great. I mean, it was good, they were doing good, and Nvidia bought them, you know, 2019 I believe or 2018. But Broadcom has been number one in networking for a decade plus. And Google partnered with them on making the TPU. And they, you know, TPU, all the way through to TPU v5, which is the one they're in production of now, and v6, and all these, these are all going to be co-designed with Broadcom. So Google does a lot of the design, especially on the ML hardware side, on how you pass stuff around internally on the chip. But Broadcom does a lot on the network side. They specifically, you know, how to get really high connection speed between two chips. They've done a ton there and obviously Google works a ton there too. But this is sort of like Google's less discussed partnership that's truly critical for them. And why Google's tried to get away from them many times. Their latest target to get away from Broadcom is 2027. But like, you know, that's four years from now, chip design cycle, four years. So they already tried to get away in 2025 and that failed. But yeah, they had this equivalent of very high-speed networking. It works very differently than the way GPU networking does. And that's important for people who code on a lower level.

Alesio Partner23:19

I've seen this described as like the ultimate limit on how big models they build. It's not FLOPs, it's not memory, it's networking. Like it has the lowest scaling law, it's like the lowest Moore's law. So the all of them, and I don't know what to do about that because no one else has any solutions.

Dylan Patel23:36

Yeah, yeah. So I think what you're referring to is that network speed is increased slower, much slower than the other, than FLOPs and bandwidth. And yeah, that's a tremendous problem in the industry. But like, that's why Nvidia bought a networking company. So Broadcom is working with on Google's chip right now. But of course on Meta, Meta's internal AI chip, which they're on the second generation of working on that, and what's the main thing that Meta is doing interesting is networking stuff. Multiplying tensors is kind of, you know, anyone can, there's a lot of people who made good matrix multiply units. But it's about like getting good utilization out of those and interfacing with the memory and interfacing with other chips really efficiently makes designing these chips very hard. And most of the startups obviously have not done that really well.

Alesio Partner24:21

Yeah, I mean, I think the startups point is the most interesting. You mentioned companies that are GPU Poor, they raised a lot of money, and there's a lot of startups out there that are GPU Poor and did not raise a lot of money. What should they do? How do you see the space dividing? Are we just supposed to wait for the big labs to do a lot of this work with a lot of the GPUs? Like what's the GPU Poor's beautiful version of the article?

Dylan Patel24:47

Like the whole point was that Google, OpenAI, who everyone would be like, oh yeah, they have more GPUs than anyone else, but they have a lot less FLOPs than Google. That was the point of the thing. But not just them, it's like, okay, it's like a relative totem pole right now. Of course, Google doesn't use GPUs as much for training, in France they do use some, but mostly TPUs. So kind of like the whole point is that everyone is TPU Poor because we're going to continue to scale faster and faster and faster and faster. And compute will always be a bottleneck, just like data will always be a bottleneck. You can have the best data set in the world and you can always have a better one. And same with you have the biggest compute system in the world and you can, but you'll always want a better one. Like Mistral, right? They trained a freaking awesome model on relatively fewer GPUs. And now they're scaling up higher and higher and higher. There's a lot that the GPU Poor can do though. Like, hey, we all have phones, we all have laptops. There is a world for running models on device. The Repet folks are trying to do stuff like that. Their models can't be that, they can't follow scaling laws. Why? Because there's a fundamental limit to how much memory bandwidth and capacity that you can get on a laptop or a phone. You mentioned the ratio of FLOPs to bandwidth on a GPU is actually really, really good compared to like a MacBook or like a phone. Hey, to run Llama 70 billion requires 2 terabytes a second of memory bandwidth, have 2.1 at read human reading speed. Yeah, but my phone has like 50 gigabytes a second. Your laptop, even if you have an M1 Ultra, has what, like, I don't remember, like a couple hundred gigabytes a second of memory bandwidth. You can't run Llama 70B just by doing the classical thing. So there's stuff like speculative decoding and then Together did something really cool and they put it in the open source, of course, Medusa. Things like that that work on batch size one, they don't work on batch size high. And so there's the world of cloud inference. And so in the cloud, it's all about what memory bandwidth and MFU I can achieve. Whereas on the edge, I don't think Google's going to deploy a model that I can run on my laptop to help me with code or help me with XYZ. They're always going to want to run it on the cloud for control, or maybe they let it run on the device, but it's like only their Pixel phone, it's kind of like a walled garden thing. There's obviously a lot of reasons to do other things for security, for openness, to not be at the whims of a trillion-dollar-plus company who wants my data. There's a lot of stuff to be done there. And I think folks like Repet are, I love it. That's exactly the stuff. They open-sourced their model. Things like what Together I just mentioned, developing Medusa, that didn't take much GPU at all. They're very GPU Poor. While they do have quite a few GPUs, they made a big announcement about having 4,000 H100s, that's still relatively poor when we're talking about hundreds of thousands of like the big labs, like OpenAI and so on and so forth, or millions of TPUs like Google. But you know, still they were able to develop Medusa with probably just one server, one server with H100s in it. And its usefulness of something like Medusa, something like speculative decoding, is on device. And that's what a lot of people can focus on. People can focus on all sorts of things like that. I don't know, right? New model architecture, right? Are we only going to use Transformers? I'm pretty pilled to think like Transformers are it, right? Just because like my hardware brain can only know something that loves hardware. But like, so like, you know, people should continue to try and innovate on that. Asynchronous training, that kind of stuff is super, super interesting. Like Tim Dettmers, yeah, distributed, like not in one data center. I think it's Tim Dettmers, he had like the Swarm. Yeah, there you go, sorry. Swarm. Yes, they had the Swarm paper and Pedal. And well, I think Pedal is whatever. That research is super cool. It's like SETI at home, right? It's not been, yeah, I mean, yeah, but like I like research, that kind of stuff. Like, hey, like the universities will never have much compute, but like, hey, you know, to prepare to do things, to all these sorts of stuff, like they should try to build super large models. Like you look at what Tsinghua University is doing in China, like actually they open-sourced their model too. I think the largest parameter count at least open-source models. I don't remember the name. Yeah, this from Tsinghua University though, right? Yeah, I think it was like a 1.7 trillion. Yeah, I mean, of course they didn't train it on much data, but it's like, you know, it's still like you can do some cool stuff like that. I don't know, I think there's a lot that people can focus on. Because you know, one, scaling out a service to many, many users, distribution is very important. Figuring out distribution, figuring out useful fine-tunes. Like, you know, doing LLMs that OpenAI will never make, you know, sorry for the crassness, a porn Dolly 3. But open source is doing crazy stuff with Stable Diffusion. And there is a legitimate market. I think there's a couple companies who make tens of millions of dollars of revenue from LLMs or diffusion models for porn, or that kind of stuff. There's a lot of stuff that people can work on that will be successful businesses or doesn't even have to be a business but could advance humanity tremendously that doesn't require crazy scale.

Alesio Partner30:07

How do you think about the depreciation of like the hardware versus the models? Like we covered open models for a while. If I think about the episodes we had like in March with like MPT-7B, nobody talks about that exactly. It's like the depreciation is like three months. No one should be talking about Llama 13 billion anyway, right? Because Mistral just showed them up. So I'm really curious, it's like, you know, if you buy an H100, sure the next series is going to be better, but like at least the hardware is good. If you're spending a lot of money on like training a smaller model, like it might be super obsolete in like three months and you got now all this compute coming online. I'm just curious if like companies should actually spend the time to like fine-tune them and like work on them when the next generation is going to be out of the box so much better.

Dylan Patel30:55

Unless you're fine-tuning for on-device use, I think fine-tuning current existing models, especially the smaller ones, is a useless waste of time. Because the cost of inference is actually much cheaper than you think once you achieve good MBU and you batch at a decent size, which any successful business in the cloud is going to achieve. And then two, fine-tuning, like people like, oh, you know, this 7 billion parameter model if you fine-tune on a data set is almost as good as 3.5. It's like, yeah, but why don't you just fine-tune 3.5? Why don't you fine-tune 3.5 and look at your performance? And like, there's nothing open source that is anywhere close to 3.5 yet. There will be, there will be. And people also don't quite, Falcon was supposed to be Falcon 140B. It's less parameters than 3.5 and also, I don't know about the exact token count, I believe it's less. The PR is 3.5, it's not 175 billion, saying because we know GPT-4, but we don't know 3.5. It's definitely smaller. No, it's bigger than 175, but it's, I think it's sparse. I think it's, I'm pretty sure. You can do some like gating around the size of it by looking at their inference latency. Which is also, you look at upper bounds. Yeah, you can look at like, well, what's the theoretical bandwidth if they're running it on this hardware and doing tensor parallel in this way, so they have this much memory bandwidth and maybe they get, maybe they're awesome and they get 90% memory bandwidth utilization, I don't know, that's an upper bound. And you can see the latency that 3.5 gives you, like especially at like off-peak hours or if you do fine-tuning and you have a private enclave, Azure will quote you latency. So you can figure out how many parameters per forward pass, which I think is somewhere in the like 50 to 40 billion range, but I could be very wrong. That's just like my guess based on that sort of stuff. You know, 50ish. But then the 16 experts are, I have no clue. I have no clue. There's no way to figure that out just yet. Yeah, yeah. There's actually, there's someone I've talked to at one of the labs who thinks they can figure out how many experts are in a model by querying it a crap load. But that's only if you have access to the logits, the like the percentage chance, yeah, before you did the softmax. I don't know. But yeah, there's like a ton of competitive analysis you could try to do. But anyways, I think open source will have models of that quality. I think like, you know, I mean, I assume Mosaic or like MLOps will open source and Mistral will be able to open source models of that quality. Now furthermore, right, like if you just look at the amount of compute, obviously data is very important and the ability, all these tricks and dials that you turn to be able to get good MFU and good MBU, right, like depending on inference or

Training is, there's a ton of tricks, but at the end of the day, there's like 10 companies that have enough compute in one single data center to be able to beat GPT-4. Straight up, if not today, within the next 6 months. 4,000 H100s is, I think you need about 7,000 maybe, and with some algorithmic improvements that have happened since GPT-4 and some data quality improvements, you could probably get to even less than 7,000 H100s running for three months to beat GPT-4. Of course, that's going to take a really awesome team, but there's quite a few companies that are going to have that many. Open source will match GPT-4, but then it's like, what about GPT-4 Vision or what about five and six and all these kind of stuff and like tool use and Dolly. That's the other thing, there's a lot of stuff on tool use that the open source could also do that the GPT-4 could do. I think there are some folks that are doing that kind of stuff, agents and all that kind of stuff. I don't know, that's way over my head.

Alesio Partner34:51

The agent stuff, yeah, it's over everyone's head. One more question on just like the sort of Gemini GPU rich. We've had a very wide-ranging conversation already so it's hard to categorize, but I tried to look for the 'Mina Eats the World' document.

Dylan Patel35:05

It's, we find your article? No, so Noam Shazeer read it. Yeah, I read it. So Noam Shazeer is like, I don't know, I think he's like the GOAT. The GOAT, yeah, I think he's the GOAT. Obviously, in one year he published... exactly, it's like all this stuff that we were talking about today was like he knew. And obviously there's other people that are awesome that were helping and all that sort of stuff, just to be clear. But there was a couple other papers. So like, 'Mina Eats the World' was basically he wrote an internal document around the time where Google had Mina, right? And Mina was one of their LLMs that is a footnote in the history, most people will not think about Mina's relevance. But he wrote it and he was basically predicting everything that's happening now, which is that large language models are going to eat the world in terms of compute. He's like, the total amount of deployed FLOPs within Google data centers will be dominated by large language models. And back then a lot of people thought he was silly for that internally at Google. But now if you look at it, it's like, oh wait, millions of TPUs, you're right, you're right, you're right. We're totally getting dominated by both Gemini training and inference, right? Like whatever 2, 3, 4 plus 1, 2, 3 for Gemini and all these other things. Total FLOPs being dominated by LLMs was completely right.

Alesio Partner36:31

So my question was, he had a bunch of predictions in there. Do you think there are any underrated predictions that may not have yet come true by your kind of...

Dylan Patel36:40

I think, obviously, I read the document but I read it on someone else's device, they didn't send it to me so I can't really send it, sorry. And they were okay with me talking about the document and calling Noam a GOAT because they also think Noam is a GOAT. But I think now most everybody is like scaling law pilled and LLM pilled and all this sort of stuff and it's a very clear line of sight. Was he wrong with anything? I mean, Mina sucked, right? I mean it was great for the T5 parameter bot, I remember off top of my head, but if you look at the total FLOPs, parameters times tokens, 10^6, it's like a tiny tiny fraction of GPT-2 which came out just a few months later. So he was right about everything, but maybe he knew about GPT, I have no clue. OpenAI clearly was way ahead of Google on LLM scaling even then. It's just people didn't really recognize it back in GPT-2 days maybe, or the number of people that recognized it was maybe hundreds, tens, I don't know.

Alesio Partner37:46

You mentioned Transformer alternatives. The other thing is GPU alternatives. So the TPU is obviously one, but there's Cerebras, there's Graphcore, there's MatX, Luminous Labs, there's a lot of them. Thoughts on what's real, who's alive, who's kind of like a zombie company walking?

Dylan Patel38:03

So if you go back and like, I mentioned Transformers were the architecture that won out, but I think the number of people who recognized that in 2020 was, as you mentioned, probably hundreds. For natural language processing maybe in 2019 at least. You think about a chip design cycle, it's years, so it's kind of hard to bet your architecture on the type of model that developed. But what's interesting about all the first wave AI hardware startups is you kind of have this ratio of memory capacity, compute, and memory bandwidth. And so everyone kind of made the same bet, which is I have a lot of memory on my chip, which is really dumb because the models grew way past that. Even Cerebras, right? I mean, I'm talking about like Graphcore, it's called SRAM which is the memory on chip, much lower density but much higher speeds versus DRAM which is the memory off chip. So everyone was betting on pretty much more memory on chip and less memory off chip. And to be clear, for image networks and models that are small enough to just fit on your chip, that works, that is the superior architecture. But scale, scale, scale, scale. So Nvidia was the only company that bet on the other side of more memory bandwidth and more memory capacity external, also the right ratio of memory bandwidth versus capacity. Because there was a lot of people like Graphcore specifically, they had a ton of memory on chip and then they had a lot more memory off chip but that memory off chip was a much lower bandwidth. Same applies to SambaNova, same applies to Cerebras. They had no memory off chip but they thought, hey, I'm going to make a chip the size of a wafer. Those guys, they're silly. Hundreds of megabytes, we have 40 gigabytes. And then oh crap, models are way bigger than 40 gigabytes. Everyone bet on sort of the left side of this curve. The interesting thing is that there's new age startups like Luminary, like MatX, I won't get into what they're doing but they're making much more rational bets. I don't know, it's hard to say with a startup like it's going to work out, right? Obviously there's tons of risk embedded. But those folks like Jay, Dan, Emad, and Mike and Renee, they understand models, they understand how they work. And if Transformers continue to reign supreme, whatever innovations those folks are doing on hardware are going to need to be fitted for that, or you have to predict what the model architecture is going to look like in a few years and hit that spot correctly. So that's kind of a background on those.

But now you look today, it's like, hey, Intel bought Nirvana which was Naveen Rao's Mosaic ML. He started Mosaic ML and sold it to Databricks recently, obviously leading LLMs and stuff there. But Intel bought that company from him and then shut it down and bought this other AI company. And now that company has got new chips, they're going to release a better chip than the H100 within the next quarter or so. AMD, they have a GPU MI300 that will be better than the H100 in a quarter or so. Now it says nothing about how hard it is to program it, but at least hardware-wise on paper it's better. Why? Because it's a year and a half later than the H100, or a year later than H100 of course, and a little bit more time and all that sort of stuff. But they're at least making similar bets on memory bandwidth versus FLOPs versus capacity, kind of following Nvidia's lead. The questions are like, what is the correct bet for three years from now? How do you engineer that? And will those alternatives make sense? The other thing is if you look at total manufacturing capacity for this sort of bet, you need high bandwidth memory, you need HBM, and you need large 5nm dies, soon 3nm, whatever. You need both of those components and you need the whole supply chain to go through that. We've written a lot about it, but to simplify it, Nvidia has a little bit more than half and Google has like 30% through Broadcom. So the total capacity for everyone else is much lower and they're all sharing it. Amazon's training and inference, Microsoft's in-house chip, and you go down the list, Meta's in-house chip and also AMD, all of these companies are sharing a much smaller slice. Their chips are not as good, or if they are, even though I mentioned Intel and AMD's chips are better, that's only because they're throwing more money at the problem kind of. Nvidia charges crazy prices, I think everyone knows that their gross margins are insane. AMD and Intel and others will charge more reasonable margins and so they're able to give you more HBM and etc for a similar price. And so that ends up letting them beat Nvidia, if you will. But their manufacturing costs are twice that in some cases. In the case of AMD, their manufacturing cost for MI300 are more than twice that of H100 and it only beats H100 by a little bit from performance stuff I've seen. So it's tough for anyone to bet the farm on an alternative hardware supplier. In my opinion, you should either just be like, a lot of like ex-Google startups are just using TPUs, and hey, that's Google Cloud. After moving the TPU team into the cloud team, infrastructure team, they're much more aggressive on external selling. You see companies like even Apple using TPUs for training LLMs as well as GPUs. But either bet heavily on TPUs because that's where the capacity is, bet heavily on GPUs of course and stop worrying about it and leverage all this amazing open source code that is optimized for Nvidia. Or okay, if you do bet on AMD or Intel or any of these startups, then you better make damn sure you're really good at low-level programming and damn sure you also have a compelling business case and that the hardware supplier is giving you such a good deal that it's worth it. And also, by the way, Nvidia is releasing a new chip, they're going to announce it in March and they're going to release it and ship it Q2, Q3 next year anyways. And that chip will probably be three or four times as good. And maybe it'll cost twice as much or 50% more. I hear it's 3x the performance on an LLM and 50% more expensive is what I hear. So it's like, okay, yeah, nothing is going to compete with that even if it is 50% more expensive. And then you're like, okay, well that kicks the can down further. And then Nvidia's moving to a yearly release cycle so it's very hard for anyone to catch up to Nvidia really. Are you investing all this in other hardware? Like if you're Microsoft, obviously who cares if I spend $500 million a year on my internal chip, who cares if I spend $500 million a year on AMD chips. If it lets me knock the price of Nvidia GPUs down a little bit, puts the fear of God within Jensen Huang, then it is what it is. And likewise with Amazon and so on and so forth. Of course the hope is that their chips succeed or that they can actually have an alternative that is much cheaper than Nvidia. But to throw a couple hundred million dollars at a company, his product, is completely reasonable. And in the case of AMD, I think it'll be more than a couple hundred million dollars. But yeah, I think alternative hardware really does hit like a peak hype cycle kind of end this year, early next year because all Nvidia has is H100 and then H200 which is just better, more memory, more bandwidth with higher memory capacity H100. But that doesn't beat what AMD are doing, doesn't beat what even Intel's Gaudi 3 does. But then very quickly after Nvidia will crush them and then those other companies are going to take two years to get to their next generation. It's just a really tough place. And no one besides, the main thing about hardware is like, hey, that bet I talked about earlier is very oversimplified, just memory bandwidth, FLOPs, and memory capacity. There's a whole lot more bets. There's a hundred different bets that you have to make and guess correctly to get good hardware, not even have better hardware than Nvidia, get close to them. And that takes understanding models really, really well. That takes understanding so many different aspects whether it's power delivery or cooling or design layout, all this sort of stuff. And it's like, how many companies can do everything here? I'd argue Google probably understands models better than Nvidia, I don't think people would disagree. Nvidia understands hardware better than Google. And so you end up with Google's hardware is competitive. But does Amazon understand models better than Nvidia? I don't think so. And does Amazon understand hardware better than Nvidia? No.

Alesio Partner46:36

Like Anthropic's investment or the investment in Anthropic?

Dylan Patel46:41

I'm also of the opinion that the labs are useful partners, they're convenient partners, but they are not going to buddy up as close as people think. I don't even think, I expect in the next few years that the OpenAI Microsoft probably falls apart too. That'll be huge. I mean, they'll still continue to use GPUs and stuff there, but I think the level of closeness you see today is probably the closest they get. At some point they become competitive if OpenAI becomes its own cloud. I think OpenAI wants to not just become a trillion dollar company, 10 trillion dollar, I mean not a company, but the level of value that they deliver to the world. If you talk to anyone there, they truly believe it'll be tens of trillions if not hundreds of trillions of dollars. In which case, obviously, weird corporate structure aside, this is the same playing field as companies like Microsoft and Google. Google wants to also deliver hundreds of trillions of dollars of value and it's like obviously you're competing. And Microsoft wants to do the same and you're going to compete. And like, yeah, I think in general, these lab partnerships are going to be nice but they're probably incentivized to, hey Nvidia, can you design the hardware in this way? Nvidia's like, no, it doesn't work like that, it works like this. And they're like, oh, so this is the best compromise. I think OpenAI would be stupid not to do that with Nvidia but also with AMD. But also, how much time do I actually have? Should I do that? Should I spend all my super smart people's time and limited this caliber of person's time doing that? Or should they focus on like, hey, can I get asynchronous training to work or figure out this next multimodal thing? It's probably better, hey, can I eke out 5% more MFU and work on designing the next supercomputer? These kind of things, how much more valuable is that? So it's tough to see even OpenAI helping Microsoft enough to get their knowledge of models so good. Microsoft's going to announce their chip soon. It's worse performance than the H100 but the cost effectiveness of it is better for Microsoft internally just because they don't have to pay the Nvidia tax. But again, by the time they ramp it and all these sorts of things, and oh hey, that only works on a certain size of models, once you exceed that then it's actually better for Nvidia. So it's really tough for OpenAI to be like, yeah, we want to bet on Microsoft. And hey, we have, I don't know what's their number of people they have now, like 700 people, of which how many do low-level code? Do I want separate code bases for this and this and this and this? It's just a big headache. I think it'd be very difficult to see anyone truly pivoting to anything besides a GPU and a TPU, especially if you need that scale. And that scale that the labs require is absurd. Google says millions of TPUs, OpenAI will save millions of GPUs. I truly do believe they think that, that number of next generation GPUs. The numbers that we're going to get to are like, I bet you, I mean I don't know, but I bet Sam Altman would say, yeah, we're going to build a $100 billion supercomputer in three years or two years. And after GPT-5 releases, if he goes to the market and says, hey, I want to raise $100 billion at a $500 billion valuation, I'm sure the market would give it to them. And then they build that supercomputer. I think that's truly the path we're on. And so it's hard to imagine.

Alesio Partner50:23

Yeah, I don't know. One point that you didn't touch on, and Taiwan companies are famously very chatty about the fruit company. Should we take Apple seriously on all this game or are they just in a different world altogether?

Dylan Patel50:36

I think, I know just from my view of Apple, I don't personally use Apple products, but every, I mean, my mom, I buy her a new iPhone every year, just to be clear. Yeah, no, Mom, you know, new Apple Watch every couple years, of course. So I respect their products but I don't think Apple will ever release a model that you can get to say really bad things or racist things or whatever. I don't think they can ever do that. But frankly, I'm sure OpenAI releasing 3.5 and 4 has had people jailbreaks for the, kind of old terminology from iPhone jailbreak the model and get to do bad things. Teach me how to make Anthrax or say these hateful things, rank the races of the world. I mean, I've seen it on Twitter, I've seen all these things. It's like, my grandma's dying, please help, she needs the cure, she needs to know how to make Anthrax to live. But there's all these jailbreaks but also as soon as they happen, it gets fed back into OpenAI's platform and it gets them. Being public and open is accelerating their ability to make a better and better model, the RLHF and all this kind of stuff. I don't see how Apple can do that structurally as a company. The fruit company ships perfect products or else. That is their mentality. They kill the car before you even see it. And that's why everyone loves iPhones. I have a Samsung, I can tell you how many, I buy a new Samsung every other year. Maybe I'll buy Pixel this year, the new one looks nice. But it's like, how many bugs are on these things? How many times do I just have to restart my phone? It's not often but it's like, hey, if once a week I need to, an app just crashes, it's like, no, what the heck. It's like Bing was only ever a few percent behind Google truly for the last decade, a few percent. But that few percent is enough to make people be like, Bing sucks. So I think that sort of applies to Apple. Are you willing to deploy a model of 3.5 capabilities that can say really and do really bad things potentially? What about 4? And the possibility of it doing worse things is even higher. Well, what about 5? You can't get on that iteration cycle. To build 4, you need to be able to build a 3.5, build 3.1, you need to be able to build 3, of quality. And Meta is clearly doing that. And all these open source firms and all these folks are doing exactly that, building a bigger and better model every few months. And I don't know how Apple gets on that trade. But at the same time, there's no company that has more powerful distribution, maybe. Maybe Google does, maybe Microsoft does, you can argue that. But obviously Apple will be deploying things and Siri will always suck but it'll be embarrassing. Hey, if I have a Siri which is GPT-3.5 level in two years, I think a lot of people still use Siri. People still use Siri to this day. So the same thing's going to happen. So I don't know, Tim Cook is not in the AI safety discussions, he doesn't want to be, he's just in the product side.

Alesio Partner53:55

And I know you had some safety hot takes and I think it's an interesting dynamic because Anthropic came out of OpenAI and then you can kind of make the case that by having more labs, if you're really worried about safety, you're accelerating the unsafe because you have more FLOPs and more compute. What's your thought on this whole space?

Dylan Patel54:19

So obviously I think safety is probably important, but like I think it is important. I mean, I've read sci-fi novels, right? It's clearly right. I can easily see how an LLM could, I wrote about this the other day, it was like, hey, if you just look at the demographics across the world, there's like 30 to 50 million more men than there are women and they will never get married. Obviously on population level dynamics, you know, their LGBTQ, all that stuff happens and it's great. But like, there's 30 to 50 million more men across the world, they'll always be single. Why can't an LLM radicalize them by being its AI girlfriend and then all of a sudden inciting, and also, I don't know, there's all sorts of stuff like that can happen of course. Or like, teach some person to create a manufacturer what they thought was a good thing and it ends up wiping out humanity. All these sorts of stuff can happen. But at the end of the day, I think security through obscurity doesn't work. So that's the approach that the labs take. I truly do believe it. They're very open internally at least, Anthropic and OpenAI are. I know Google's a lot more gated with Gemini information. But of these three, it's like security through obscurity and it's like this doesn't ever work. And two, innovating in the open is going to have more people figuring out what doesn't work, also figuring out how to maybe try and align things better.

Alesio Partner55:44

Maybe the SemiAnalysis analyst point of view is, is it feasible to build this capacity up in the US?

Dylan Patel55:51

No, no. People don't understand how fragmented the semiconductor supply chain really is and how many monopolies there are. The US could absolutely shut down the Chinese semiconductor supply chain, they won't. And China could absolutely shut down the US one actually, by the way. But more relevantly, Austria has two companies, the country of Austria, and Europe has two companies that have super high market share and very specific technologies that are required for every single chip, period. There is no chip that is less than 7nm that doesn't get touched by this one Austrian company's tool. And there is no alternative. And there's another Austrian company, likewise, everything 2nm and beyond will be touched by their tool. But both of these companies are doing well less than a billion dollars in revenue. So it's like, you think it's so inconsequential. There's like three or four Japanese chemical companies, same idea. The supply chain is so fragmented. People only ever talk about where the fabs, where they actually get produced. But it's like, TSMC in Arizona, TSMC is building a fab in Arizona. It's quite a bit smaller than the fabs in Taiwan. But even ignoring that, those fabs still have to ship everything to Taiwan back anyways. And also they have to get what's called a mask from Taiwan and get sent to Arizona. And by the way, there's these Japanese companies that make these chemicals that need to ship to Shin-Etsu. And it's like, and hey, it needs this tool from Austria no matter what. It's like, oh wow, wait, actually the entire supply chain is just way too fragmented. You can't re-engineer and rebuild it on a snap. It's just complex to do that. Semiconductors are more complex than any other thing that humans do, without a doubt. There's more people working in that supply chain with XYZ backgrounds and more money invested every year in R&D plus CapEx. It's just by far the most complex supply chain that humanity has. And to think that we could rebuild it in a few years is absurd.

Alesio Partner57:41

Yeah, in an alternative universe the US kept Morris Chang and people, right? Like it was just one guy that...

Dylan Patel57:46

Yeah, an alternative universe Texas Instruments communicated to Morris Chang that he would become CEO and so he never goes to Taiwan, you know, blah blah blah. But I think the world would probably be further behind in terms of technology development if that didn't happen. Technology proliferation is how you accelerate the pace of innovation. So the dissemination to, oh well, hey, it's not just a bunch of people in Oregon at Intel that are leading everything, or a bunch of people in Samsung Korea, or Hsinchu, Taiwan. It's actually all three of those plus all these tool companies across the country and the Netherlands and Japan and the US. It's millions of people innovating on a disseminated technology that's led us to get here. I don't even think, if Morris Chang didn't go to Taiwan, would we even be at 5nm? Would we be at 7nm? Probably not. There's innovations that happened because of that.

Alesio Partner58:42

Let's get a quick lightning round done, SemiAnalysis branded one. So the first one is, what are foundational readings that people that are listening today should read to get up to speed? Our audience is a lot of software engineers.

Dylan Patel58:56

Yeah, so I think the easiest one is the PyTorch 2.0 and Triton one that I did. There's the advanced packaging series. There's the Google Infrastructure Supremacy piece, I think that one's really critical because it explains Google's infrastructure quite a bit from networking through chips through all that sort of history of the TPU a little bit and all this sort of stuff. AMD's MI300 piece, it talks a lot about the one that we did on that. Chip Wars, right? Chris Miller, who doesn't recommend that book, right? It's a really good book. I would say Gordon Moore's book is freaking awesome because you got to think about, LLM scaling laws are like Moore's Law on crack, kind of in a different sense. If you think about all of human productivity gains since the 70s is probably just off of the base of semiconductors and technology. Of course, people across the world are getting access to oil and gas and all this sort of stuff. But at least in the western world since the 70s, everything has just been mostly innovated because of technology. We're able to build better cars because semiconductors enabled us to do that. We're able to build better software because we're able to connect everyone because semiconductors enabled that. So it's like, that is why it's the most important industry in the world. But seeing the frame of mind of what Gordon Moore has written, he's got a couple papers, books, etc. Only the Paranoid Survive. I think that philosophy and thought process really translates to the now modern times, except maybe humanity has been an exponential S-curve and this is another exponential S-curve on top of that. So I think that's probably good readings to do.

Alesio Partner1:00:39

Has there been an equivalent pivot? So Gordon, like that classic tale was more of like the pivot to memory from memory to logic. And then has there been an equivalent pivot in the history of that magnitude?

Dylan Patel1:00:54

I mean, some people would argue that Jensen, he basically didn't care about, he only cared about gaming and 3D professional visualization and rendering and things like that until he started to learn about AI. And then all of a sudden he's going to universities, like, you want some GPUs? Here you go. I think there's even stories of not so long ago, NeurIPS when it used to have the more unfortunate name, he would go there and just give away GPUs to people. There's stuff like that, very grassroots, pivoting the company. Now you look on gaming forums and it's like everybody's like, oh, Nvidia doesn't even care about us, they only care about AI. And it's like, yes, you're right, they mostly only care about AI. And the gaming innovations are only because they're putting more AI into it. But also, they're doing a lot of chip design stuff with AI. I think that's a big one, not I don't know if it's equivalent pivot quite yet because the digital logic is a pretty big innovation. But likewise, it's like, what did OpenAI do? How did they pivot? They left the culture of Google Brain and DeepMind and decided to build this company that's crazy cool and does things in a very different way and is innovating in a very different way. So can you consider that a pivot even though it's not inside Google? I don't know.

Alesio Partner1:02:12

A very different path with the DOTA games and all that before they eventually found GPTs as the thing. So it was a full, like, started in 2015 and then really pivoted 2019 to be like, right, but HC company.

Dylan Patel1:02:27

Yeah, yeah. If I could, then I don't, I'm sure there's OpenAI people yelling at me right now.

Alesio Partner1:02:35

Okay, so maybe just a general question, but I'm a fellow writer on Substack. You are obviously managing your consulting business while you're also publishing these amazing posts. How do you, what's your writing process? How do you source info? Like when you sit down and go like, here's the theme for the week, do you have a pipeline going out? Just anything you describe.

Dylan Patel1:02:55

I'm thankful for my teammates because they are actually awesome. And they're much more directed, focused to working on one thing, or not one thing, but a number of things. Someone who's this expert on X and Y and Z and the semiconductor supply chain, so that really helps with that side of the business. I most of the times only write when I'm very excited or it's like, hey, we should work on this and we should write about this. So one of the most recent posts we did was we explained the manufacturing process for 3D NAND, flash storage, gate-all-around transistors, and 3D DRAM and all this sort of stuff because there's a company in Japan that's going public, Kioxia Electric. It was like, okay, well we should do a post about this and we should explain this. But it's like, okay, we, and so Myeon, he did all that work, Myeon, she, and most of the work, and awesome. But usually it's like there's a few very long, in-depth, back-burner type things that took a long time, took over a month of research. And Myeon knows this stuff already really well. But furthermore, there's stuff like that that we do and that builds up a body of work for our consulting and some of the reports that we sell that aren't newsletter posts. But a lot of times the process is also just like, well, 'Mina Eats the World' is the combination of reading that, having done a lot of work on the supply chain around the TPU ramp and CoWoS and HBM capacities and all this sort of stuff to be able to figure out how many units and that Google's ordering all that sort of stuff. And then also looking at open sources, just all that culminated in, I wrote that in four hours, sent it to a couple people and they're like, no, change this, this, this. Oh, you know, add this because that's really going to piss off the open source community. I'm like, okay, sure. And then posted it. So there's no specific process unfortunately. The most viral posts especially in the AI community are just like those kind of pieces rather than the really deep, deep, like what was in the 'Mina Eats the World' post. Obviously, hey, we do deep work, there's a lot more factual, not leaks, it's just factual research. We go across the team, we go to 40 plus conferences a year, all the way from a photoresist conference to a photomask conference to a lithography conference all the way up to AI conferences and everything in between, networking conferences, and piecing everything across the supply chain. So that's the true work. And yeah, I don't know, it is sometimes bad to have the infamy of only people caring about this and the GPT-4 leak or the 'Google Has No Moat' leak. But that's just stuff that comes along. It's really focused on understanding the supply chain and how it's pivoting and who's the winners, who's the losers, what technologies are inflecting, things like that, where is the best place to invest resources, sort of like stuff like that, and accelerating or capturing value, etc.

Alesio Partner1:05:42

Awesome. And to wrap, we're trying a new question. If you had a magic genie that could answer any question that would change your worldview, what question would you ask?

Dylan Patel1:05:55

That's a tough one. You operate based on a set of facts about the world right now, then there's maybe some unknowns where you're like, man, if I really knew the answer to this one, I would do so many things differently or I would think about things. Everything that we've seen so far is that large scale training has to happen in an individual data center with very high speed networking. Now, everything doesn't need to be all-to-all connected, but you need very high speed networking between all of your chips. I would love to know, hey, magic genie, how can we build artificial intelligence in a way that it can use multiple data centers of resources where there is a significantly lower bandwidth between pools of resources? Because that would instantly, one of the big bottlenecks is how much power and how many chips you can get into a single data center. So like, Google and OpenAI and Anthropic are working on this. I don't know if they've solved it yet. But if they haven't solved it yet, then what is the solution? Because that will accelerate the scaling that can be done by not just a factor of 10 but like orders of magnitude because there's so many different data centers across the world. And if I could effectively use 256 GPUs in this little data center here and then with this big cluster here, how can you make an algorithm that can do that? I think that would be the number one thing I'd be curious to know if, how, what, because that changes the world significantly in terms of how we continue to scale this amazing technology that people have invented over the last five years.

Alesio Partner1:07:28

Awesome. Oh, thank you so much for coming on, Dylan.

Dylan Patel1:07:30

Thank you so much for having me. Hopefully my rambling, especially on AI safety, was not poorly taken because I think it will be poorly taken.