Back
Jonathan Ross
Chief Software Architect, Nvidia (formerly Groq CEO)

"An endless demand for compute" | Jonathan Ross, founder of Groq

🎥 Jun 11, 2026 📺 Julia Turc ⏱ 30m 👁 3914 views
Jonathan Ross is the founder of Groq, a hardware chip specifically designed for LLM inference, which entered a $20 billion strategic agreement with NVIDIA. Topics covered: • The "success disaster" at Google that led to the TPU and eventually Groq • LPU vs. GPU: Pareto curves, cost-per-token, and when each wins • Static scheduling • Mixture-of-experts models • Auto-regressive vs. diffusion models • How Groq and NVIDIA's Vera Rubin work together at inference time • Jevons Paradox: why cheaper AI will increase total compute demand • Will AI replace CUDA kernel engineers? • What skills kids shoul...
Watch on YouTube

About Jonathan Ross

Jonathan Ross, Chief Software Architect at Nvidia and founder of Groq, has discussed the growing demand for compute and the integration of Groq’s LPU architecture with Nvidia’s GPUs. In a June 2026 interview, he described the combination as analogous to “18-wheelers and delivery vans,” with LPUs handling projection layers and GPUs managing attention mechanisms for LLM inference. He stated that “as long as there are unsolved problems in civilization, we will have a need for more compute” and predicted that cheaper AI would increase total demand, citing Jevons Paradox. Ross also recommended revamping education to focus on asking better questions, saying “if you can come up with the right question, AI can go answer it for you.” At the Sohn Investment Conference in May 2026, Ross argued that “there’s no way to satiate the appetite for intelligence” and distinguished between intelligence and sentience, defining the latter as “your rate of improvement in your intelligence.” At Nvidia GTC in March 2026, he presented on CUDA features, including upstreaming the Blackwell compiler into LLVM 21, green contexts for dynamic scheduling, and the use of checkpointing for elasticity. He noted that “portability across architectures is diminishing” and that porting has become “more of an economic handicap.”

Source: AI-verified profile updated from Jonathan Ross's recent appearances. Browse all interviews →

Transcript (39 segments)
✨ AI-enhanced transcript with speaker attribution
J
Jonathan Ross0:00
Agentic is a little bit like the Nvidia of AI. It's about being able to break things up into parallel tasks. Why that's leading to an absolute explosion in the usage is AI using AI using AI. One of the interesting things about the Grok architecture, the LPUs, is that we actually have a kernel-free architecture. I think you're going to see many, many smaller players making the chips because the stakes are so high. And so if you don't have a chip to launch, that's a very expensive mistake. As long as there are unsolved problems in civilization, we will have a need for more compute. Right now, cancer isn't cured.
People still get old. With AI becoming as inexpensive as it is, it's going to increase the demand for AI to the point where people are going to spend more and more on it. They're going to need more compute. And so I just think we're going to have this endless demand for compute.
I
Interviewer1:00
Jonathan, we were actually both Google alumni. When I was at Google, we had this running joke in my team that if we ran out of quota for the day to train our models on TPUs, we might as well just take the day off. I know you pioneered TPUs and then you left to build your own company that makes a chip. What did you see at Google that made you want to build something different?
J
Jonathan Ross1:26
We didn't have enough compute. What had happened was the speech recognition team had trained a model and that model was better than human beings at transcribing. This was the first time they had ever achieved that. The problem was they couldn't put it into production. They had actually limited the deployment to the Nexus phone, the old Android phone.
I
Interviewer1:47
I had one.
J
Jonathan Ross1:48
Oh yeah, okay. So they limited it to Nexus, not so much as a feature, but because they had so little compute, they could only support the Nexus user base. I happened to be having lunch with the speech recognition team in New York City and they mentioned this problem. I started as a 20% project porting their model to an FPGA, created a general architecture, and then it turns out inference was needed quite badly and it became a chip. Actually, Jeff Dean did an analysis and was like, given what we're going to spend on this in capacity, let's just do an ASIC instead. My response was, how hard could it be? Turns out it was very hard, but we didn't know, so we jumped in.
I
Interviewer2:32
I heard you use the term 'success disaster' in the past. I think that captures it very well, and I've experienced that at Google multiple times.
J
Jonathan Ross2:41
Yeah. So, Nvidia GPUs are great for training, but at inference time they're memory bottlenecked. How does Grok change the memory architecture to address that?
I
Interviewer2:53
Well, first, it's important to think about the trade-offs. There's no such thing as a free launch. What you're trying to do is get the lowest cost per token because the cost actually determines your capacity, right? Everyone's in a race. People would pay more for more capacity, but if I'm spending twice as much, I'm getting about half as much capacity per dollar. That's really what I care about: can I get this many tokens per dollar? You also need speed. The trade-off is if you want the absolute best cost per token, you're just going to use a GPU as it is, with a very large batch size, and it's not going to be as fast as it otherwise could be. What we did with the LPU was we were able to scale to multiple chips without using any external memory and spread the model out across those chips. So we could use much faster SRAM, which allowed us to generate tokens faster but at a lower cost. If you know what a Pareto curve is, if you look at the Pareto curve of GPUs versus LPUs, they're quite different. There are portions of the curve where the GPU is better economically, and there are portions where the LPU is better economically, usually at the faster end. When you put the two together, it fills out that middle zone. So between the GPU, the GPU plus LPU, and the LPU, you now have the best cost per token. You have the most capacity at any speed you want to run at.
Another differentiator for Grok is that it uses static scheduling. The order of operations is predefined at compile time. Why is that an advantage for LLM inference?
J
Jonathan Ross4:57
Let's use an example in calendar scheduling. If I want to have a bunch of short meetings, 15 minutes long, I have to schedule them because the person I'm meeting with has to show up at the right time. If I'm going to have a five-hour-long meeting, I don't really need to schedule it; you just show up. If you're 30 minutes late, it's 30 minutes out of five hours. With inference, you're doing super low latency, small batch size. You need to schedule all those operations so that each portion of the computation gets done quickly and frees up the hardware for the next part. You're not stalling all the work that has to come behind it. In training, this is less important. For inference, it's absolutely crucial.
I
Interviewer5:47
A lot of state-of-the-art LLMs today use a mixture of experts architecture. At inference time, for every query, a different set of experts potentially gets activated. How does that work on a chip that does static scheduling?
J
Jonathan Ross6:03
The question is what's being statically scheduled. I have this 15-minute slot on my calendar, but who I meet with can change. In the LPU, we have the ability to do scatters and gathers. That means depending on which expert is needed, we will fetch a different expert. It still runs for the same amount of time; it's just a different expert. If we have different-sized experts, we could even route to another chip, but then there's a little bit of a bubble in the pipeline. The determinism gives you a lot more ability to predict the timing; it doesn't restrict you to what you can run. The LPU architecture is particularly good at experts because the smaller the batch size, the better. Experts are particularly bad on batch size because if you're using external memory, you're pulling in one expert from outside and then you have to amortize that across a bunch of computation. If I'm reading from DRAM and then running that expert, I might need a batch size of hundreds to make economic sense. If I'm using an LPU, I might need a batch size of 10 for it to make sense, which means I don't have to wait for as many queries to batch up in order to run it. That brings the time down and the efficiency up. LPUs are almost perfect for expert models.
I
Interviewer7:42
Speaking of architectures, when the transformer is replaced by the next shiny architecture, do the LPUs have to be completely redesigned, or are they orthogonal to the current shape of LLMs?
J
Jonathan Ross7:56
The age-old question. When we designed the LPU, the 'Attention Is All You Need' paper had not yet been published. There are a lot of things that rhyme between attention and some of the other things that existed at the time, like convolutions. They are quite different, but it's all linear algebra. If you've built an optimal chip for linear algebra, then you've built an optimal chip for most of these architectures. You may decide to optimize for the size of the matrix multiplication, which could be different from one architecture to another. I've seen some people try to go very specialized, but what ends up winning the most is flexibility almost every time. If I was to tell you I was going to limit your ability to change the model, you could never change it again, but I could run it 10 times as fast, would that be interesting? The answer is probably not, because there's probably going to be a 10x improvement in algorithms. There's just been a recent change in the way attention works to shrink things 10x. Algorithmic improvements happen so quickly that flexibility is often what matters more. The LPU architecture is specifically designed to be super easy to program so that when new architectures come out, they can be adopted and the latest algorithm can be running very quickly.
I
Interviewer9:26
The 'L' in LPU stands for Language. Does that mean that currently vision and audio models don't benefit from the same speedups? Or, as you mentioned, it's a general enough architecture to support other modalities as well?
J
Jonathan Ross9:42
One of the biggest users of the Grok cloud that exists is speech-to-text users, and we also had some text-to-speech for a while. The reason is it's super sensitive to being real-time. Many of those models have convolutional layers embedded. This is where having a general architecture matters, because otherwise you wouldn't be able to run all this voice stuff on it. Where it really comes into play is it actually improves the quality by being faster. It's a little counterintuitive. You can split up audio segments into very small chunks and just run that chunk. But if you only hear a small clip, you don't get the full context, so it's much harder to predict what those words are. When people are using slower chips to do audio processing, they chunk it up more to make it real-time, which increases the error rate. It's like having two different people transcribe a speech at the same time, but each only heard five seconds of the speech at a time. They would have a lot more errors because they wouldn't have the full context. Because the LPUs were able to do speech transcription at hundreds of times faster than real-time, they could work on much larger chunks and actually improve the error rates on these models.
I
Interviewer11:03
The use cases we talked about so far—language inference, audio—are mostly autoregressive, but vision models today are diffusion-based, and some LLMs are diffusion-based as well. Diffusion LLMs are a lot faster on a GPU than autoregressive LLMs. Does that ranking still hold on a Grok chip?
J
Jonathan Ross11:27
Diffusion models benefit from the total amount of compute you have. Let's get down to what autoregressive means for the audience. Autoregressive basically means I'm going to figure out what this word is, then figure out what the next word is. Sort of like playing chess: I figure out one move before making the next, as opposed to figuring out all the moves in parallel. In language, it's hard to figure out what the 100th word is until you've figured out the 99th word. However, you can start breaking that down, saying some words matter more than others. I'll predict the important word and fill in other words around it. I'm seeing a lot of people trying to use diffusion models to generate language, and they're not getting great results. The reason is it's really hard until you've made a decision on what you're going to say here to figure out what you're going to say here. It's like a hundred people writing a speech, each not being able to see what the others do. In diffusion, information diffuses in time and space; the further away things are, the less impact they have. From a quality point of view, if you want to generate music using autoregressive versus diffusion, the autoregressive version will be soulful, deeper; you'll like it, but it might have a crackle or odd noise. Purely diffusion will be crystal-clear elevator music with no soul. Put the two together, and the important moments can be done autoregressively with context, and you fill in the rest with diffusion. Just like we paired the LPU and GPU together for decode in LLMs, I think the successful versions of diffusion in LLMs will likely combine autoregressive and diffusion together.
I
Interviewer13:40
You've already alluded to putting together GPUs with LPUs. Nvidia announced earlier in March at GTC the Vera Rubin supercomputer, which is dedicated for inference, especially for agents. Can you tell us in what ways GPUs and Grok work together at inference time?
J
Jonathan Ross14:03
Let me start with an analogy. Suppose I ask you to build a logistics network for the entire United States, starting from scratch. You can use either 18-wheelers or delivery vans. Delivery vans can go into any driveway but can't carry a lot, so they're more expensive per unit. The best answer is both. In this analogy, the GPU is the 18-wheeler: it can handle a whole bunch of tokens all at once but takes a little while to load up. The delivery van is more like the LPU: not as efficient for bulk, but for the last mile, it's more efficient. Putting the two together is like putting 18-wheelers and delivery vans together; you get a better network. With LLMs, there are two parts: the projection layers and the attention. We put the projection on the LPUs and the attention on the GPUs, getting the best of both worlds.
I
Interviewer15:28
After the Nvidia agreement, should we expect Grok chips to be sold independently as they've been so far, either as a standalone chip or via Grok Cloud, or should we expect more hybrids, LPUs plus GPUs being sold together?
J
Jonathan Ross15:45
I think you're going to see hybrids. We still recommend doing the prefill, which is reading the text, on GPUs only because GPUs are really good at that and it's not as latency-sensitive per token. It's latency-sensitive to complete but not per token; it's a very parallelizable problem. Stick that on the 18-wheeler GPU. When it comes to the decode, for some cost-sensitive applications, like free users, you'll probably do decode entirely on GPUs. If you have a professional user base paying for more speed, you'll see a GPU and LPU combo. For extreme users doing very performance-critical tasks, you might even see LPU-only on decode. In any data center, you'll see prefill fully on the GPU and decode partially on LPUs and partially on GPUs.
I
Interviewer16:50
This Vera Rubin supercomputer was advertised mostly for agentic inference. Over the past year, we've seen agents taking over. How does that change the unit economics and the cost of inference at scale?
J
Jonathan Ross17:07
First, most people don't know what agentic is. It's a buzzword, but let's define it because it's really important. Agentic is a little bit like the Nvidia of AI: it's about breaking things up into parallel tasks. CPUs are sequential, GPUs are parallel. If you have a task you do by yourself, you get blocked waiting for things. But if you can split it up, multiple people can work on it. With AI, you can't produce the 100th token until you produce the 99th. But if you can break the problem into things that don't have those dependencies, you can have multiple agents, multiple context windows, work on it at a time. For some problems that doesn't work, but for most it does. Now, AI is using AI: it hands off tasks to other AI. Why that's leading to an absolute explosion in usage is AI using AI using AI. It becomes exponential growth. The quality of answers gets better the more independent subtasks get run, because it's like having a bigger team that runs more checks and ensures the answer is better informed.
I
Interviewer19:21
Speaking about AI, I want to talk about its impact on engineers. CUDA kernels are notoriously hard to write manually. Do you think AI is going to get good enough to write them itself?
J
Jonathan Ross19:36
I think it may already have gotten good enough, but it's not as binary as it sounds. You don't just write a kernel or not write a kernel; it's about how good the kernel is: how efficient, how performant, how easily it fuses with other kernels, how general and reusable it is. As AI gets better, kernels will get better, but the more time you spend on a particular kernel, the better it gets. One interesting thing about the Grok architecture, the LPUs, is that we have a kernel-free architecture. The inspiration was that when Grok was started, we didn't have LLMs to write the software. We had a small team, so we built a chip that didn't have a lot of complexity around compiling to it. It's an easier problem. If the hardware you're compiling to is easier to reason about, AI will produce even better kernels. We've been using AI to program the LPUs and getting very good results because it's so easy to wrap your head—or in this case, your LLM—around the problem.
I
Interviewer21:08
That's very interesting. Zooming out, AI has lowered the barrier to writing software. From what you're saying, it sounds like that's starting to happen for hardware as well. Will we see more people building hardware because it's easier?
J
Jonathan Ross21:23
Absolutely, you'll see more people trying to design hardware. One thing that's going to be a problem is that hardware is a physical thing; it requires experiments. In software, you see results instantaneously and can iterate. In hardware, there are supply chains and big bets. You'll see a lot of people trying to do chips because it's easy to design one, but it's very hard to take one to production. It becomes the baby turtle problem. There's only a finite amount of supply. Customers want to bet on something they know will work. You'll see more companies doing it, but an even smaller number going to production because it'll be so hard to choose; you'll want to go with ones you can depend on.
I
Interviewer22:32
Sounds very similar to software. It's easy to build a prototype in your bedroom, but a lot harder to put it on the market and make it reliable.
J
Jonathan Ross22:41
With one difference. If you release software with a bug, you can patch it. If you have an error in your chip, it takes four to six months to respin it because chips are physical. There are 60 to 70 layers of chemical deposition, taking a day or more per layer. The mask costs tens of millions of dollars. If you get it wrong, that's a cost, but it's nothing compared to telling customers it'll be another six months. On top of that, because of how the supply chain works, you have to commit to building something. If you don't have a chip to launch, that's a very expensive mistake. So I don't think you'll see a bunch of cowboys throwing chips at the wall. You'll see many smaller players making chips because the stakes are so high, and you'll want to go with people you can depend on, especially as costs skyrocket.
I
Interviewer24:01
Are there any ways in which AI makes hardware design easier that's not obvious to an outsider? Perhaps something related to efficiencies in the supply chain or something I can't think of because I've never built hardware before.
J
Jonathan Ross24:17
One unusual thing about my background is that I was a software engineer who became a hardware engineer—that rarely happens. Building a chip is painful because you go from quick results to waiting four to six months. In our team, hardware engineers who never wrote software before would always ask a software engineer to write software. Now they go and implement a little software test to see if something works, getting immediate feedback. Hardware and software development are distinct but have overlapping parallels—different languages and ways of reasoning. With LLMs, a hardware engineer can ask the LLM to write software to run on their hardware, and if it doesn't work, they realize they need to change something. It's empowered self-service, blurring the lines between disciplines. People are reaching into neighboring disciplines. We're seeing the same with software engineers and designers: a software engineer doesn't need to wait for a design to implement something, and designers use coding tools to put something out there. If a software engineer debates a hardware engineer, they can just implement it and show it works.
I
Interviewer26:13
We started our conversation talking about the success disasters at Google. What are some good success disasters you hope to happen in the future with Grok and Nvidia by extension?
J
Jonathan Ross26:27
This goes to Jevons' paradox: the need for compute is limitless. As long as there are unsolved problems in civilization—cancer isn't cured, people get old, there isn't enough compute—we'll need more compute. We'll need smarter AI and more compute to run that AI in parallel to solve more problems. As we improve, the cost per unit of intelligence goes down. Then you get Jevons' paradox: the lower the cost, the more people spend. It comes from a treatise on coal: every time steam engines got more efficient, the total coal consumed increased. As an activity becomes less expensive, it becomes possible to do activities that weren't profitable before. With AI becoming inexpensive, demand increases. People will need more compute. We're going to have endless demand for compute. Another analogy: if you pull twice as much oil out of the ground, it doesn't enable twice as many people to get a transportation benefit—you need a car. But once you train a model, if you provide twice as much compute, twice as many people can use it, and you can solve twice as many problems. Every AI factory built immediately enables more things, pushing costs down and fueling Jevons' paradox. Success disasters are inevitable.
I
Interviewer28:58
Is there anything else you want to communicate to a hyper-technical and curious audience?
J
Jonathan Ross29:04
A lot of people ask me what their kids should do. My answer is simple: education today is about information age thinking—teaching them to come up with an answer to a question. With AI, it flips to coming up with the right question. If you can ask the right question, AI can answer it for you. My big recommendation is to learn how to ask better questions. Teach your kids how to ask better questions. Revamp the education system to be question-oriented. If kids are solving problems too easily by feeding questions into AI, you're not teaching them to be successful. But if you give them a problem where they have to come up with the questions themselves, you're preparing them for the future.
I
Interviewer30:08
That rings true. I took a career hiatus from research and startup life because I found so much pleasure in just talking to the AI, asking questions, and learning new things. I'm learning about hardware right now for this video, asking questions that would never be in a paper. Thank you so much for joining us today. I really appreciate your time. It's been a real pleasure.
J
Jonathan Ross30:36
Well, thanks for having me.