Back
Ion Stoica
Cofounder, Databricks

Distributed systems to AI platforms with Mark Russinovich & Ion Stoica | BRK227

🎥 Jun 09, 2026 📺 Microsoft Developer ⏱ 45m 👁 456 views
What will it take to build AI platforms for the agent era? Join Azure CTO Mark Russinovich and UC Berkeley professor Ion Stoica (co-creator of Apache Spark and co-founder of Databricks and Anyscale) to explore how AI infrastructure must evolve as systems become agentic, multimodal, and globally distributed. Get practical insights on next generation architectures, from training to real time serving, and why open source, security, and governance are now core platform concerns. Seating for this session is first-come, first-served. Add it to your schedule to plan your day and arrive early to secu...
Watch on YouTube

About Ion Stoica

Ion Stoica, a UC Berkeley professor and co-founder of Databricks and Anyscale, has recently spoken at several events about the challenges of building reliable AI systems and the evolution of AI infrastructure. In a June 2026 fireside chat at Microsoft Build with Azure CTO Mark Russinovich, Stoica discussed the increasing complexity of AI platforms, noting that traditional modular system design is being broken by the need for cross-layer optimizations. He described a typical AI stack as including Kubernetes or Slurm at the lower level, SkyPilot for cross-region movement, and frameworks like Ray or Pathways for distributing AI workloads. Stoica also argued that systems are becoming more stateful due to the high cost of saving and initializing state on GPUs, which complicates fault tolerance. In a separate keynote at a conference in June 2026, Stoica argued that reliability is "arguably the biggest challenge, biggest barrier to AI adoption, in particular in enterprises." He stated that moving an AI system from prototype to production requires 10 to 50 times more resources and time, and that LLMs, as key components of agentic systems, rarely have clear specifications and are difficult to debug because they are black boxes. Stoica noted that reliability is not unique to AI, but is a problem for any system people depend on. In a May 2026 conversation with UC Berkeley professor Shankar Sastry, Stoica said that universities "need to smell the coffee" and that taxpayer money should serve as the most patient capital for research, rather than forcing the capitalist system to be more patient. He also commented on the state of proprietary AI models, saying the use of human capital is "extremely inefficient" because many engineers are doing similar work without incentive to share knowledge.

Source: AI-verified profile updated from Ion Stoica's recent appearances. Browse all interviews →

Transcript (70 segments)
✨ AI-enhanced transcript with speaker attribution
M
Mark Russinovich0:00
Good afternoon, everybody.
I
Ion Stoica0:02
Good afternoon.
M
Mark Russinovich0:08
So it's good to see you all here at Build. I'm Mark Russinovich.
I
Ion Stoica0:11
And I am Ion Stoica.
M
Mark Russinovich0:13
And we've got kind of a fireside chat today, but it's a fireside chat where we don't have a moderator. We're moderating ourselves. So, we're going to be talking through a bunch of questions related to AI, AI infrastructure, models, architectures for AI, and it should be a pretty engaging conversation.
I
Ion Stoica0:32
I hope so.
M
Mark Russinovich0:33
Yeah. So, I think we can start by talking a little bit of our background and why we're up here talking about these subjects. So, Ion, you've got a background where you've had some amazing impact on the industry in data analytics, in AI, in your role as a professor, in your role as a startup founder, Ray, vLLM, Databricks, Spark. Why don't you tell us a little bit about your background?
I
Ion Stoica0:59
Yeah, so actually, starting around 15 years ago, at Berkeley, we started to develop quite a bit of systems, initially in the domain of big data. And started initially with Mesos and Spark. Then after that, Ray, and more recently, vLLM and Golang. As a high level, we're always trying to understand and figure out what is a problem you try to solve, and also a little bit careful about the trends. A problem you have today, is that problem going to be bigger tomorrow than today? Spark was scaling up data analytics and making it more interactive, and much better support iterative patterns like machine learning. Then Ray, complex scaling, complex AI workloads like reinforcement learning on heterogeneous hardware back then. And of course, vLLM is trying to exploit the power limits inherent of the GPUs. But then the memory became a bottleneck. So, trying to alleviate that bottleneck. The one thing I want to say, because we are at the Microsoft conference here, is that I'm very happy that some of the things we've done has impact also with our collaboration and partnerships. As you know, Spark and Databricks, great partnership with Azure Databricks. And actually, today, we also announced a partnership between Azure and Anyscale, and Anyscale is around Ray. So, yes, very happy about that and very much looking forward to this partnership to grow.
M
Mark Russinovich3:16
Likewise, yeah.
I
Ion Stoica3:18
Yeah. So, you know, on my side, you, Mark, you have so many contributions. You wrote Windows Internals, then you architected Azure infrastructure, which I did use. And then you are now building and architecting AI infrastructure for Azure, and security are leading. So, one question I have, you've seen all these phases, and you've been in distributed systems for a long time and security. So, what is new and what is old? What still applies and what is no longer applying?
M
Mark Russinovich4:09
Yeah, well, that's a good question. I think one of the things that still holds, and this holds, I think, you'll even say for infrastructure and systems, is the fundamentals are still in place. I think if you take a look at fundamentally what's changed, it's these non-deterministic systems that are now carrying out workflows that they create on the fly. And that introduces a bunch of new risk areas. In some sense, humans are non-deterministic agents creating workflows on the fly. But humans you can hold accountable. Also, humans, I think in general, are more reliable than the AI systems we have today, and certainly, when it comes to humans, we designed our systems around human speed and scale. And when you have agents now that are intelligent running around, they can perform lots of damage in a very short period of time. So I think if we get into it a little bit later, we'll talk about some of the things like prompt injection risks around that, risks around jailbreaking, risks around hallucination; all of those things can cause workflows to go awry and for the system to not behave as you intend.
I
Ion Stoica5:13
Yeah, and what about the fundamentals still remain? But again, to be more precise about concepts like consensus, consistency, distributed scheduling, which of these concepts remain more or less the same, and which are changing and morphing to support new requirements?
M
Mark Russinovich5:37
Yeah. Well, and I think when we talk about AI and running these agent workloads, and you talk about the LLMs and hosting the LLMs, the things like the consensus around the infrastructure, knowing what is available and online, things like load balancing still matters. Locality matters still. So you want to route requests, especially when you talk about KV caches, which is the context. You need to make sure that the requests are going back to the same systems. Otherwise, you're going to be wasting compute and bandwidth as you're moving things around. So all of those things still apply. I think, and this is an interesting question, I think, for you. Are these systems stateless? Do you consider them stateless or stateful?
I
Ion Stoica6:23
I think it very much depends. Certainly, I also see a serverless trend. I think that quite a few systems become more stateful. That is because the trend of bringing the state in and saving the state if needed for the next operation, in some cases, it's too high. And it's not only the state, loading the state and saving the state, it's initializing. If you are talking about initializing the GPUs and so forth, it can take a long time. So this forces you to be more stateful. The problem with that is that it complicates, like we all know, the fault tolerance.
M
Mark Russinovich7:19
Yeah. So when it comes to these distributed infrastructures for AI, which include the GPUs, high bandwidth interconnections, the KV cache, when you look at distributed systems through its evolution, what assumptions change with an architecture to support this kind of computation?
I
Ion Stoica7:39
Great question. It very much depends on where you are coming from. So you're coming from Spark, big data, it's about MapReduce plus plus. It's bulk synchronous processing, which is a pretty simple model. You partition the data, you have stages, at every stage you apply the same task on different partitions and do a shuffle between them. What happens with, for instance, you talk about KV cache and inference, things are becoming so much more coupled, tightly coupled. If you think about mixture of experts and then you have the KV cache and you need to distribute it across different GPUs on the same node and even across nodes, everything becomes much more coupled and more difficult because of that. Now, if you come from maybe HPC, maybe you can say, "I told you so." But even there, with HPC, it was much more — you don't have such a tightly coupling between the data and the compute. Because even there, you don't have a lot of data to start with, because you do simulations. So I think that's what it is. And this creates a lot of challenges. And even when you talk about load balancing, things are changing so much more quickly if you think about expert parallel load balancing, because you can change from layer to layer when you go in the forward pass to do inference to do serving. And the other thing is that things are evolving so quickly. By the time you build something, you have new requirements. It's never ending.
M
Mark Russinovich9:59
Yeah.
I
Ion Stoica10:04
So, we have this, I remember like it was 10, 15 years ago when people started cloud computing and so forth, and people were saying the data center is the new computer. I'm just curious, is that abstraction still actuality because now you can have training spanning different availability zones and things like that. I'm curious about that.
M
Mark Russinovich10:37
I mean, I think if you take a look when that statement was made, it was when a data center would hold all the compute that you could imagine in the data centers at that time. Even if you take a look at the evolution of Azure as a hyperscaler, the availability zone became a unit of compute, but even in an availability zone, we have multiple data centers typically. It's not just one data center per availability zone. Then you have a region, you have multiple availability zones. So even for traditional IT workloads or serverless computing, your application can be spread across many data centers. And now, you mentioned distributed training and AI training, and that's where you run into these AI super factories as we call them, where you've got massive datacenters just filled with GPUs specifically to host that training workload. And for that, we passed the line several years ago of a single datacenter being enough to hold the scale of the AI supercomputers that we're building now to train these frontier models. And then over the last couple of years, we've crossed a line of even one region of datacenters is not big enough to train these models. And now we have these Fairwater systems that we've launched, these AI super factories where there are multiple Fairwaters in Wisconsin, multiple in Atlanta, and they're connected with a dedicated network so that you can have a job that spans the GPUs across both regions with a very direct cable between them doing RDMA across the two regions.
I
Ion Stoica12:10
Yeah, it's fascinating.
M
Mark Russinovich12:11
Now, going back again, we discussed a little bit about serverless.
I
Ion Stoica12:14
My favorite services.
M
Mark Russinovich12:17
Yeah. So, what does it mean when you have today's long-lived agents, which can go hours, sometimes days?
I
Ion Stoica12:32
Well, if you talk about serverless, it's kind of people speak about it synonymously with containers. Serverless equals containers. And when people think containers, they think the original serverless containers, which were like Lambda or Azure Functions, which are short-lived, and so people conflate serverless with short-lived applications or pieces of code, when in fact serverless doesn't have an opinion about what code runs in these containers, and so you could put SQL Server inside of a container, and it can run for years. And it also can be monstrous. It can have terabytes of memory. So I think for me, I separate the two. And I say serverless is you don't worry about the infrastructure, it is containerized for the most part, although we talk about lightweight containers now and Wasm sandboxes which are also a form of serverless. But that's where I think the whole trend — if you take a look at Azure's architectural evolution, we're moving everything to serverless. We're moving the whole infrastructure to containers that are on a serverless infrastructure that is doing the scheduling and placement for the job, which is expressing constraints.
So now continuing on that note. We have our infra, we build these infrastructures mostly to serve requests initiated by people directly or via BI tools and things like that. But now we are moving rapidly to an age of agents, to having agent loops instead of just requests as a unit of work. How does this impact, do you think, about the architecture stack?
M
Mark Russinovich14:25
Well, I think we're still at the early days of this, actually, because just in the last few years, we went from chat to now agentic systems, which are this loop where the agents are talking to other systems as part of their computation. And they might even be talking to other agents. And so when it comes to how you define an agentic application that consists of these things, I think we're still in the early days of specifying the application model for these. And a lot of it is still ad hoc. But we're getting towards what is an agentic application? What are the components? How do you place them for efficiency? How do you automatically deploy them instead of taking piecemeal and deploying and connecting them together after the fact? So in my opinion, I think that we're still early. But I'd be curious to hear about what you think.
I
Ion Stoica15:19
So, yeah, I think we are early. The one thing I see is a constant increase in complexity. Even the definition of agents, what is an agent? Well, was an agent an LLM invoking a tool? But now it's much more than that. You have a tool, you have an entire harnessing, which can be very complex. So now, and then you are talking about some kind of continual learning. People talk more and more about whether this happens in context, continuously evolving the prompt. And in some cases even updating the weights, kind of like RL. So, you need something to support very diverse and complex applications. You need to do, for sure, inference, then basically running arbitrary application because that's part of the harnessing. And then maybe you are also going to do some kind of training if you want to update the policy. And when you have the models, of course, the models are going to be co-located — it has to be co-located because if you want to update it, or it can just be something running outside, it can be like OpenAI or a new model that's released by Microsoft. So it's so complex. So you need to have something which is very flexible to support these workloads. And yeah, I think that we'll see. I assume that in time, things will stabilize one way or another, but we are very early.
M
Mark Russinovich17:16
Yeah. So one of the things that I'd be really curious to hear your thoughts on, in the systems that we've designed, they've largely been deterministic. You design the system, you've got a workload, you optimize for it. These systems are inherently non-deterministic, as we talked about. And so how does that change the way you design a system?
I
Ion Stoica17:38
That's a great question. And just to be clear, one of the reasons they are non-deterministic is because, obviously, the LLMs themselves are non-deterministic. If you send the same prompt over and over again, you shouldn't expect the same results. Actually, in some cases, you rely on that to get diversity in order to get better solutions. So I think there are a few things to consider. One, first of all, what is your output? And in some cases, like with coding and so forth, the artifact can be still deterministic. You use this to create, to optimize the system, to synthesize even a system, and that one is deterministic. You can have harnessing tests and so forth to check, and then you can have metrics to hill-climb. If you want to optimize for throughput, for latency, the output is still deterministic. I think now this doesn't answer the question about how you debug your evolution systems or agentic system. And unfortunately, there, you have to presumably record the inputs and outputs and the traces and then try to replace the traces to find what the problem is if there is a problem.
M
Mark Russinovich19:07
So when you talk about an AI stack, you talk about GPUs, network, you talk about different types of accelerators. Where do you see the optimizations happening, the most important ones? Or have we reached some limit someplace where there's the bottleneck that we need to focus on?
I
Ion Stoica19:31
I think clearly it will happen. I think there are three levels of optimizations will happen. At the top level, there will be many algorithmic optimizations still. So I assume that a few orders of magnitude improvement over the next few years will come from the algorithms. What I mean by algorithms, if you look at just inference or training, right now, of course, the workload changes, you have agents, you have much larger prompts or outputs, the context is much larger. You see a lot of optimizations. And architectural optimizations, software architecture, like sparse attention, different kinds of attention. You see different layers, different attentions. And now you see even within the same layer different attentions. Even maybe you'll have per-token learned attentions and things like that. So that can give you a big improvement. Of course, this comes with a lot of complexity. Then, obviously, you are going to have GPUs, the accelerators, which are going to still make progress, 1.5 or 1.6 every year improvements. And that's coming, again, with complexity because it's not like Moore's law used to be. The instructions and architecture are more or less the same. How you get these improvements, a lot of them is by adding new instructions which are targeting particularly new segments of the workload. I expect you are going to have more and more instructions in the GPUs to target this kind of sophisticated attentions. And then, from the inherent, this will go a little bit slower. You are still going to have smaller transistors, you are going to be more power-efficient, power per flops will improve and things like that. But these are multiplicative. So when you multiply them, you can get massive gains in the next few years.
M
Mark Russinovich21:45
Okay, I have another question for you. Because you've been involved so much with open source in AI infrastructure and analytics, do you see a canonical open source-based stack? And do you think it will converge or there will be diversity?
I
Ion Stoica22:03
I think that there is some stack which is used by companies. At the lower level, Kubernetes or Slurm. Under the hood, if you want to move, because everyone wants to have the freedom to move across regions or from region to region, maybe you have something like SkyPilot. On top of that, you have to have some framework like Ray or Pathways or something like that, which helps you to distribute these AI workloads; then obviously you have PyTorch and so forth. So I think you can put together this stack, and then you have vLLM, or Golang, whatever, TensorRT-LLM as an inference layer. And then post-training, you have WELL or SkyRL, and things like that. So you can create a stack, and a pretty good stack. The question is about what will happen and how this stack will evolve because the one thing I see in terms of complexity increasing is that we used to build — one of the things you teach students, and we learn ourselves, is how do you build a complex system? You build it in a modular way, with components with nice, clean abstractions which abstract away the implementation, so you can evolve each of them independently. Now, the problem with that — and this is when we are talking about these layers and all these components — as we know, it also comes with some overhead. And it used to be perfectly fine to spend a few percent, 10% overhead to have this nice, evolvable system, because we could be even faster because we are going to evolve these systems quickly. But now it's so much breaking the layering, and it's called cross-layer optimizations. This is what I see. It's like, in some sense, all these layers are going to get crossed and broken. And I see, you can do maybe in an orderly way, but I don't know where you are stopping. Is this collaboration between Ray and Kubernetes, with you guys? But what is happening here is that something like Kubernetes wants to know more information about the applications from Ray to make more intelligent decisions, and something like Ray, which is up there, wants now to have more control to tell Kubernetes what to do. So you can allow that. But I don't know when you stop because, of course, you are no longer going to have these clean, narrow abstractions. I think that's very interesting.
M
Mark Russinovich25:11
So, we've got about five minutes left, I think.
I
Ion Stoica25:18
So, one question I do have for you, we are talking about the workflows, agents, and so forth. Do you think about a boundary, two different kind of patterns of workloads, or do you think they are —
M
Mark Russinovich25:41
For agents and workflows?
I
Ion Stoica25:43
And workflows. Traditionally, what we call them.
M
Mark Russinovich25:44
You know, the beauty of agents is that they can make workflows dynamically. And I think that instead of you having to go and prescribe a workflow and write it out, you can give AI a problem, and it'll go figure out how to execute that task and come up with a workflow dynamically. When I look at those systems, if you've got anything where the workflow is going to be repeated many times, you want to make that deterministic and not rely on an AI to make it up every time you've got the problem. So I think where things are not well-defined, where they're very ad hoc, then AI and agents are great for making up the workflows and solving the problem. But for anything where you've got a repetitive process, you should go try to make it deterministic, because it's going to be much more efficient, much more reliable, and much cheaper. So all of those things are going to be true. And I think that this is where I see people over-rotate on, "Oh, I can have an agent go do this thing." When I look at it and go, actually you could just write a simple workflow.
I
Ion Stoica26:54
You can have the agent synthesize that workflow.
M
Mark Russinovich26:57
That's right. That's the other way to do it is have the agents make the workflow, and then the workflow is deterministic.
I
Ion Stoica27:02
Yep, that's what I mentioned about. So the other thing is that we used to have training and serving separate deployments and things like that. But now with this post-training, starting even earlier from reinforcement learning with human feedback, and now it's continual learning and things like that, it seems that it's a blur out there. So what should the platform look like —
M
Mark Russinovich27:37
This is the multi-hundred billion dollar question, isn't it? I think because we run into two things. I mean, you heard about Maia.
I
Ion Stoica27:46
Yes, of course.
M
Mark Russinovich27:47
And Maia is optimized for inference. Then you hear about NVIDIA's chips, which are optimized, for the most part, for training. And then you mentioned GRP or DRPO or any of these post-training things that are happening in the loop with it, where you're doing inference at the same time you're doing the training. And so whenever you've got a situation where you can optimize for efficiency, obviously, that's going to be what you want to do, especially at the scales you're at, where you can save literally tens of millions, hundreds of millions of dollars by optimizing for some specific path. But when you optimize for that specific path and then things change, and they're not a good fit, that can be a big problem. And so it's this tension now because we haven't even — like you said, the evolution of these architectures hasn't finished. And so if you over-optimize for one workload, you could run into trouble. So the question is, how general should we be and leave money on the table in the short term? Because in the long term, we're going to be protected against changes in the architecture or in the application requirements versus taking advantage of optimization now and potentially being bitten by it later. I think that that's the tension that you see.
I
Ion Stoica29:06
Absolutely.
M
Mark Russinovich29:10
All right. So, we're down to a couple of minutes left, according to the clock up here. So I want to thank you for coming and spending time with us at Build. Your impact on the industry is monumental, and it's gracious of you to come and take time out of your day to share your thoughts on the evolution of AI infrastructure.
I
Ion Stoica29:28
Yeah, thank you. And we haven't touched on —
M
Mark Russinovich29:31
On confidential computing.
I
Ion Stoica29:33
On confidential computing. So before going, I know this is a topic dear to your heart, Mark. So can you say something about security governance and so forth?
M
Mark Russinovich29:42
Well, absolutely. From security governance, we didn't really get time to deep dive on this. But when it comes to jailbreaks and prompt injections, I've been doing a lot of research on this. This is, like I mentioned earlier, a big topic of evolution where we haven't bottomed out yet. I think I view it as kind of the early days when people were downloading stuff from the internet and just your fingers were crossed. This is what happens now when you give agents access to your company's private data or your private data or your access tokens is you're kind of crossing your fingers and hoping that things don't go off the rails and you end up with a problem. And we've seen some high-profile examples just in the last few months of people getting burned by prompt injections. So I don't think we've come to the reckoning yet on that and people understanding the risk. And then with confidential computing, that is another layer of — the clock just reset to five minutes. So we've got another five minutes.
I
Ion Stoica30:34
So now you can, please.
M
Mark Russinovich30:37
I think backstage they liked the answer so much they wanted me to continue. So I can talk a little bit about confidential computing now, which is it's independent of AI, of course, but related in some ways. Because if you think about the sensitivity of the data that people are processing with agentic systems, if it's in a personal context, you're saying, "Hey, I really want you to help me with my medical advice. I want you to help me with my financials." You're giving these agents so much very personal and very sensitive data about yourself. And if you're in an enterprise context, the way you get the value out of these models is giving them access to the sensitive data of your enterprise. And especially when you're hosting this in a cloud platform or some hoster that's out of your control, or even in your own environment, you want to minimize the trusted computing base, the code that has access to that data, to the smallest amount possible where you can have the highest assurance that it's going to behave properly and where bad actors can't get to it. And confidential computing — now we've got another... It reset again. I was wondering. I was like, wow, we burned through that time really fast. So now we have 13 minutes left.
I
Ion Stoica31:53
Let me just wrap this up. And I think there were some more topics that we can discuss. And maybe we can take some questions from the audience. But confidential computing, for those of you that aren't familiar with it, it's hardware-based protection of data while it's in use. So the hardware itself is encrypting memory and creating a privileged boundary around the data such that nothing else on the system can get access to what's inside of that enclave, they're called. And it comes with a cool property the hardware provides, which is an attestation, where it can measure what's inside the enclave and then the application in that enclave can offer that attestation to another system that then can look at it and say, "Oh, your compute inside of an enclave, I trust that enclave, I trust your configuration, I'm going to give you a secret," typically a key, which allows you to go and process that data in the clear. And it's protected against so many of the threats that not having this enclave protecting that boundary would expose. And when it comes to agents, having an attestation of what is the agent, what is its provenance, do I know that the data that I'm giving to the agent is protected with the highest assurance as possible inside that enclave. You're going to see a trend with AI and confidential computing that I think is going to really drive forward confidential computing. Confidential GPUs from NVIDIA, confidential GPUs from AMD, and even Microsoft's Maia processor.
Yeah, that's fascinating.
M
Mark Russinovich33:25
Since we have more time on the clock now, which is kind of nice. Let me go back to some questions that we skipped earlier for you. And I guess this is something we really haven't talked about is developer experience. We've been talking a lot about infrastructure, but you've got all this complexity. When it comes to developer experience, there's so many options out there. There's GitHub Copilot, CLI, there's the SDK, there's OpenClaw, like what — and there's doing your own loop, calling an LLM directly. We saw Simon Willison doing this with his own application. He didn't use any harness. What's the best approach? When do you decide when to use one harness versus a custom orchestrator? How do you code this, and how do you measure what's the best answer? And in fact, I wrestle with this too when I'm approaching code.
I
Ion Stoica34:28
And me as well. But let me just give... So, I think that... My TLDR is that now we have very powerful coding agents and so forth. And many of you, I'm sure — how many of you use coding agents? Everyone, right? I do think that — actually, the biggest challenge is that it's about how do you know that the code which is generated is going to work correctly for some definition of correctness. It is doing what you hope it to do. I think that's the key. And the more code it generates, the harder it is to do that. And I think this was also true before. Everyone here probably developed a production system or has been involved in developing a production system. Where do you spend most of the time? You don't spend most of the time in prototyping or writing the code; you spend the time in debugging, then maintaining the code. It's easily 10 to 1 effort. So, in some sense, what these coding agents are doing is going to commoditize that writing the code. But the bottleneck will be more and more in verifying. And, yes, you can use these agents to say, "Hey, check that, is that correct?" But it's very hard. Some agents are lazy. So, without you understanding the code, at least at this point today, and having continuously testing it and enhancing the test, I don't think you can trust that code is doing what it's supposed to do. Because that kind of reward hacking is very different from what you can think about. Let me give you just one example. I have so many examples about that, but I'm giving you the simplest one. You mentioned earlier about load balancing. So, say you want to develop a load balancer algorithm using the agents. And we did that a few months ago using EPLB, Expert Parallelism Load Balancer. And what you want to do is the metrics, you want to maximize the load you are going to handle. Now, one solution that we've seen is, because that's a metric the "hill climbs", the load, is just dropping requests. You wouldn't think that when you tell someone, even when you write the PRD about the load balancers, "Hey, you shouldn't drop the request." It's understandable. But these models are going to do that because they are going to maximize the things you told them to maximize, in the simplest and easiest way. So that's what I'm saying. And there are many things which can pass all the unit tests and so forth, and then when you look deeper, there are things which are, in a way, overfit to your test you couldn't imagine. That's what you still need to, I think, at this stage. I don't know what will happen in a few months, in one year, but right now, you have to be able to specify. If you didn't specify something, you can almost guarantee that the systems will find something around it. And there are only two places a specification comes from. One is the training data set. For instance, one year ago, when you generated code, you had a lot of vulnerabilities. Now, you have fewer vulnerabilities when you generate code with agentic. The other one is from the prompt. You have to specify what it should do or not. If it's not in the prompt or it's not in the training data set, it's going to find a way around.
M
Mark Russinovich39:02
It'll fill in the blanks.
I
Ion Stoica39:04
Yeah, so that's the main bottleneck.
M
Mark Russinovich39:07
All right. So, now we really have six minutes, and I'll give you a question that I think is the question that I get a lot, and I've got my own opinion about this, but I'm curious about yours.
I
Ion Stoica39:16
So you'll answer it after I answer it.
M
Mark Russinovich39:17
Yeah, I'll answer it. And this is, I talk to people, and I explain I've got so many war stories of AI going off the rails. I mean, me and Scott, we have got this podcast. I talk about how I basically sculpt, which I always have. But even with AI, I'm watching what it's doing. I'm controlling it. I'm telling it, "That's not what I want you to do." And I see it creating this test or going off in some direction, filling in the blanks with something that is ridiculous or that I don't want. And I tell people this and I say, you just can't give it a prompt or a PRD and expect that you're going to get a system that works reliably and efficiently and does all the things that you imagine you would want it to do. And even if you go and try to specify it, you cannot specify it deeply enough to guarantee that it's not going to do things that you don't want. A lot of what we do is based off of environmental context and just the taste that we've developed. And so I'm like — I just don't see a path to this world where you don't need engineers that understand what's going on and don't know how to guide the system when it goes off the rails. And then, I get this pushback of, "No, Mark, you're just — you know? That's a bet that I would take with you every day, because it's going to be solved. Maybe two years from now, maybe five years from now, maybe 10 years from now, the AI will be so good that you won't need that. Your example where you said the load balancer was dropping requests. What if you have an adversarial code reviewer go look at the system and say, you know what, you shouldn't be dropping a request. And it is, is this system well-behaved? It'll say, no, it's not. And so you just address that. So where do you lie on this? Do you think that — would you be bold enough to say it's never going to be solved or do you think we'll get to the point where we'll have the ability to give a specification, it'll come out great, and you won't need a human?
I
Ion Stoica41:21
I think that from the theoretical perspective, it's very hard for me to see how it's going to be ever solved. And fundamentally, because we know from Gödel that no axiomatic system can be complete.
M
Mark Russinovich41:44
Yeah.
I
Ion Stoica41:44
So that tells you something. I think that when we look at other techniques to do it, like for instance, assume that you have a formal specification. You can write in Lean, on Coal, and so forth. So from that, if I have the formal specification, then you can generate the code, and then you can generate the proof that the code satisfies the specification. So that is good, and you can see that. The problem is that obviously then you need to write that specification, which is not easy. And even when you write the specification, it cannot — if it's incomplete, you don't specify a property because, in the case of the load balancer, I don't write, "Oh, you shouldn't drop the packets." So maybe if there is a conservation law there, you can argue that you should put it in. But there can be assumptions, because any specification has to have some assumptions to make the proof. But you are not going to imagine all of the assumptions.
M
Mark Russinovich43:00
I got to — you know what I think? You know what is a full specification? Actually, programming languages.
I
Ion Stoica43:07
Almost, almost. Even that is not. Let me give you an example — even that is not. I remember we had this peer-to-peer court system and so forth. I was a young faculty there, still implementing all of these things. I thought it was bulletproof. And we were testing it on Planet Lab — twenty years ago, Planet Lab, an internet infrastructure where each organization has servers, and you deploy applications on the servers and then you communicate. And it breaks. And I don't know how it breaks. It has to work. And when we looked, it was an assumption we missed. The assumption we made implicitly is a kind of transient connectivity. If A communicates with B and B communicates with C, then A should communicate with C. And in the Internet as a whole, because it's not behind NATs and so forth, this has to hold. And it should have held, except that one university misconfigured their servers, which blocked some communication. So that's what I'm trying to say. It's extremely hard.
M
Mark Russinovich44:36
Yeah. So, now we are really, I think, at time. I haven't seen the clock jump up again. But it's been a fun conversation, a really fun conversation. And if you like this, this is the experiment. It's the first time we've done something like this. If you like this, let us know. So we'll try to do it again. If you find it boring, also let us know. And I think the people that thought it was boring probably already left, so I think —
I
Ion Stoica45:01
We have a biased audience here.
M
Mark Russinovich45:03
But thanks very much for coming to Build, and I hope to see you tomorrow morning in my Azure Innovation Talk, where I'll talk a little bit about some of these things.
I
Ion Stoica45:10
Thank you very much.