Ethereum Founder Vitalik Buterin On Data Availability Proofs

🎥 Jan 26, 2018 📺 Founder School ⏱ 41m 👁 19 views

Join And Support Founder School Pro For New Interviews Everyday https://tinyurl.com/JoinFounderSchoolPro.

About Vitalik Buterin

Vitalik Buterin, co-founder of Ethereum, has been speaking at multiple events in 2026 about the future of the Ethereum protocol and its intersection with artificial intelligence. In a panel on the Ethereum Economic Zone (EEZ), Buterin described the concept as an effort to rethink layer-2 solutions in a way that integrates them more deeply with Ethereum, rather than treating them as separate chains. He argued that without strong pressures toward interoperability and censorship resistance at the user layer, the result can be "walled garden monopolies" that use the base layer primarily for virtue signaling. He also identified oracles as a "skeleton in the closet" of Ethereum, noting that their security has not received the same level of rigor as layer-2 scaling solutions. In separate talks, Buterin has emphasized Ethereum's role as a "public billboard" and a "shared computation layer" for high-value guaranteed execution, rather than a platform meant to compete with high-frequency trading or chase maximum speed. He stated that Ethereum needs to pass a "walk away test," meaning it should remain reliable even if no core developers remain. On AI, Buterin argued that while local and open-weight AI models have improved significantly, the mainstream open-source ecosystem does not by default prioritize privacy, security, or censorship resistance. He expressed hope that the Ethereum community can create tools that optimize for these properties, including ZK-based payment channels that make API requests private and unlinkable. Buterin also contrasted crypto's approach to safety with centralized visions that he described as "trust the uncle in the sky," saying crypto aims to create systems that preserve user agency and privacy.

Source: AI-verified profile updated from Vitalik Buterin's recent appearances. Browse all interviews →

Transcript (27 segments)

✨ AI-enhanced transcript with speaker attribution

Vitalik Buterin0:05

I don't know, I'm not sure if 'closing remarks' is the best title for this presentation. It feels a bit grand. All I wanted to do is talk about fraud proofs and data availability proofs. So here you go: fraud proofs and data availability proofs.

Let me set the stage. The idea is that if you look at blockchain protocols like Bitcoin or Ethereum—specifically full nodes in those protocols as they exist today—there are some protocol guarantees that get broken if more than 50% or 33% of nodes stop following the protocol. You might have 51% attacks, finality failures, and so forth. But there are a lot of properties that you still keep even if arbitrary 51% attacks or dishonesty is possible.

One example of this property is that all the blocks in whatever chain you accept as the main chain are going to be valid. This implies that people can't steal your money, can't give themselves money, the coin supply can't get inflated, and applications can't get randomly screwed around with. Another guarantee is that the data in the chain you accept is available. This means that whatever chain you accept as the leading chain, if you want to download any particular piece of it, you will be able to. This seems trivial in non-scalable blockchain land, but it's actually very non-trivial for reasons I'll go into.

This guarantee gives you the ability to ensure that the block data is correct and to generate new blocks or transactions. To get directly into things, let's talk about fraud proofs. The setting is: suppose a malicious miner or validator publishes a block, and we assume all the block data is present (some magic Oracle verifies the full body has been published to the open internet and at least one honest node has it). But some of the block data may be invalid. What we do is design the state transition function such that any specific invalidity in the block can be proven to a light client by providing a small subset of the Merkle tree.

For those familiar with how Merkle branches work, if a transaction is processed incorrectly, you just need to provide a few leaves that touch the affected accounts to prove something invalid happened. Once someone makes a fraud proof, they can broadcast it, and any client can immediately verify there's something wrong with the block.

With this technique, you can create a system where clients download all the data to verify availability, but after that, they don't need to process all the data themselves. Instead, each client might randomly choose to verify some random 0.1% of the data. If any client discovers something wrong, they broadcast a fraud proof, which gets traversed through the network. Other clients verify the fraud proof and can see something is wrong. If you've downloaded the data and no one has sent you a fraud proof for a while, that's evidence the chain is correct.

Under what assumptions does this break? Number one: if the number of honest nodes is extremely low (under a thousand), so data goes unchecked. Number two: if network latency is extremely high. But notice this does not break even if 90% of nodes are malicious. We've reduced the security assumptions for verifying validity of data.

We've reduced the computation needed to verify data is valid, but we haven't reduced the data you need to download. Data unavailability is the biggest problem in blockchain scaling. Suppose a malicious miner publishes a block where parts are simply not published—they publish most of the Merkle tree but not all. The missing parts could be valid or invalid, but either way it's an attack.

To think about proving data availability, it helps to separate out the two cases. Consider a chain that only holds data; a valid block is one that has data published and available. More formally, for every chunk of data, at least one honest node has that chunk, so you can ask and get it. The problem is: how does a client verify availability without downloading all the data?

You cannot use fraud proofs for this. If you publish a claim that chunk zero is unavailable, a clever attacker can just publish that chunk when they see your claim, making you look like an idiot. The attacker can keep doing this, eventually people stop trusting the claims or the participants have to download the entire data. So you can't solve data availability the same way you detect faults when all data is available.

Easy solution: check everything. This obviously takes bandwidth, okay for existing blockchains, but not okay in many contexts. Context one: higher capacity chains—Ethereum today, after the Istanbul hard fork reducing call data gas cost from 68 to 16, and sharded Ethereum with 1024 chains will be very high. A few megabytes per second isn't very high in the grand scheme, but it's too high for most people to reliably download.

Context two: quick verification of history. Suppose you are syncing a chain for the first time or have been offline for a long time. You want to verify all the data is available and correct in sublinear time. Currently, light clients can find the correct head in sublinear time, but only with the trust assumption that the majority of miners or validators are good. Can we remove that assumption and still keep the guarantee? It turns out yes, if you accept assumptions about network latency and at least some minimal number of honest nodes existing.

Light client protocol for verifying data availability: suppose you have a block. You Merkle hash it and think of the data at the bottom. A simple probabilistic check: as a light client, privately select 80 random positions and ask the network for each. Only accept the block as available when all 80 queries are answered. If only 70% are answered, you wait. This is similar to what happens already, but here you focus on 80 positions instead of downloading the whole thing.

This doesn't guarantee the entire data is there—one missing chunk could go undetected. But it proves with very high probability that more than 50% of the data is available. An attacker could trick specific clients by publishing nothing and then responding to queries for certain chunks, but there are bounds on how many clients they can corrupt. Using onion routing makes it even harder to target specific clients.

So the light client protocol is: check 80 random positions, probabilistically verifying at least half the data is there. Problem: what if the one missing position is an invalid transaction that gives the attacker a bajillion coins? Enter erasure coding. We low-degree extend the data to a larger set, so any 50% of the extended data suffices to reconstruct all of it.

Now, if the extended data is constructed incorrectly (not a true low-degree extension), we can bring back fraud proofs. A naive fraud proof: if you as a full node download more than 50% and reconstruct all data, you can create a huge fraud proof showing the Merkle root doesn't match. This works but results in a gigabyte-long fraud proof for Ethereum 2.0 cross-links, which is undesirable.

The solution from a paper last year by myself and Mustafa Al-Bassam is to use two-dimensional erasure coding. Instead of extending along one axis, we make the data a small square, then low-degree extend it in both dimensions. A light client makes 80 random queries into this structure and also downloads the Merkle roots of every row and column—a few hundred kilobytes. A fraud proof just needs to prove fraud in one row or column, or an inconsistency between a position in a row and a column. This brings fraud proof size down to under a megabyte.

There are possible improvements using STARKs. One approach uses the ingredients in STARKs to give the same properties as the two-dimensional scheme on top of simple one-dimensional codes. If the extended data has a bug, instead of providing the entire data, you use a bit of clever math (like F(x)-F(y) over x-y arguments) to prove an inconsistency. Another approach is to directly use a STARK to verify correctness of the Merkle root—essentially placing all hashes along a polynomial and proving it's a correct low-degree extension. This would eliminate the need for fraud proofs and network latency assumptions, requiring only some minimal number of honest clients.

What do we use this for? First use case: verifying existing chains. This is a layer 2 technique that could improve security properties of existing light clients. You connect your node, receive a piece of data claimed to be a Merkle root of the chain, attach it to proof of work or a deposit, perform data availability challenges, wait for fraud proofs, and if they pass, accept the chain as valid. This makes 51% attacks considerably less powerful.

Second use case: scalable chains. In Ethereum 2.0, a sharded chain has way more data than most can download individually. Fraud proofs and data availability proofs serve as additional checks to guarantee security even if a subcommittee is corrupted. Third use case: layer 2, like a Plasma chain. Users need to verify data availability so they can make challenges; this scheme lets them check availability without downloading all data or relying on trusted watchtowers.

Conclusion: data availability verification and fraud proofs are two nice ingredients. Combined, you can quickly verify large amounts of data and computation indirectly. With lighter security assumptions than no 51% attack, you can guarantee the data is available with cryptographically high probability and that every piece of data is valid—because if it weren't valid, someone would broadcast a fraud proof. Succinct zero-knowledge proofs like STARKs can make things better and reduce the need for the network latency assumption. Thank you.

Audience Member24:20

Should I go ahead? You said you can use this type of data availability proof for Plasma and for watchtowers. I think I disagree because in Plasma, if data becomes unavailable and the state transition happens, the damage has already been done. There's no advantage to detecting it.

Vitalik Buterin24:42

The idea is that whenever you receive a new Plasma block, you try to download the data. If you fail, you exit. The scheme here would be: when you see a new Plasma block, you run a data availability check instead of having to download the full block. If the check fails, you exit.

Audience Member25:10

Okay.

This isn't a question, but I kept thinking through your presentation that when you say 'data availability,' I assume you mean the block data with the transactions. It makes me think about the assumption that somebody can watch the chain and basically in sublinear time figure out fraud proofs. But I'd argue it's not even possible in systems like Ethereum or Bitcoin because of the state. Even if you know the composition of blocks and transactions, you can't verify it without the state that precedes the transaction. To compute the state, you have to run everything from the beginning or do other tricks. That's why when people designed fast sync in Ethereum, they thought they could just fast sync to a snapshot and probabilistically verify everything before, but that didn't work because the state became so large that even probabilistic verification wasn't practical. So I'm saying fraud proofs will be very expensive to produce, and I'm challenging the assumption that there will be many people doing fraud proofs at all.

Vitalik Buterin26:59

In an Ethereum context, this scheme working on top of Ethereum as is does rely on at least one honest node having run an archive node with extra software. If the chain raises gas limits and then two years happen where there are zero people running archive nodes, then people won't be able to create fraud proofs. There is an implicit assumption that people keep storing data, and if that's broken, the scheme fails. However, if people do have the data, producing fraud proofs becomes easier. Another approach is a stateless client system where for every block, you generate the block plus transactions plus witness, and every node agrees to store 0.1% of witnesses. There are two directions: one is the current Ethereum 2.0 approach establishing strict state size control so you can do fraud proofs with just the block and a tiny extra data; the other is to ensure some nodes have these witnesses, potentially by sticking witnesses inside the data.

Thank you.