Vitalik Buterin0:05
I don't know, I'm not sure if 'closing remarks' is the best title for this presentation. It feels a bit grand. All I wanted to do is talk about fraud proofs and data availability proofs. So here you go: fraud proofs and data availability proofs.
Let me set the stage. The idea is that if you look at blockchain protocols like Bitcoin or Ethereum—specifically full nodes in those protocols as they exist today—there are some protocol guarantees that get broken if more than 50% or 33% of nodes stop following the protocol. You might have 51% attacks, finality failures, and so forth. But there are a lot of properties that you still keep even if arbitrary 51% attacks or dishonesty is possible.
One example of this property is that all the blocks in whatever chain you accept as the main chain are going to be valid. This implies that people can't steal your money, can't give themselves money, the coin supply can't get inflated, and applications can't get randomly screwed around with. Another guarantee is that the data in the chain you accept is available. This means that whatever chain you accept as the leading chain, if you want to download any particular piece of it, you will be able to. This seems trivial in non-scalable blockchain land, but it's actually very non-trivial for reasons I'll go into.
This guarantee gives you the ability to ensure that the block data is correct and to generate new blocks or transactions. To get directly into things, let's talk about fraud proofs. The setting is: suppose a malicious miner or validator publishes a block, and we assume all the block data is present (some magic Oracle verifies the full body has been published to the open internet and at least one honest node has it). But some of the block data may be invalid. What we do is design the state transition function such that any specific invalidity in the block can be proven to a light client by providing a small subset of the Merkle tree.
For those familiar with how Merkle branches work, if a transaction is processed incorrectly, you just need to provide a few leaves that touch the affected accounts to prove something invalid happened. Once someone makes a fraud proof, they can broadcast it, and any client can immediately verify there's something wrong with the block.
With this technique, you can create a system where clients download all the data to verify availability, but after that, they don't need to process all the data themselves. Instead, each client might randomly choose to verify some random 0.1% of the data. If any client discovers something wrong, they broadcast a fraud proof, which gets traversed through the network. Other clients verify the fraud proof and can see something is wrong. If you've downloaded the data and no one has sent you a fraud proof for a while, that's evidence the chain is correct.
Under what assumptions does this break? Number one: if the number of honest nodes is extremely low (under a thousand), so data goes unchecked. Number two: if network latency is extremely high. But notice this does not break even if 90% of nodes are malicious. We've reduced the security assumptions for verifying validity of data.
We've reduced the computation needed to verify data is valid, but we haven't reduced the data you need to download. Data unavailability is the biggest problem in blockchain scaling. Suppose a malicious miner publishes a block where parts are simply not published—they publish most of the Merkle tree but not all. The missing parts could be valid or invalid, but either way it's an attack.
To think about proving data availability, it helps to separate out the two cases. Consider a chain that only holds data; a valid block is one that has data published and available. More formally, for every chunk of data, at least one honest node has that chunk, so you can ask and get it. The problem is: how does a client verify availability without downloading all the data?
You cannot use fraud proofs for this. If you publish a claim that chunk zero is unavailable, a clever attacker can just publish that chunk when they see your claim, making you look like an idiot. The attacker can keep doing this, eventually people stop trusting the claims or the participants have to download the entire data. So you can't solve data availability the same way you detect faults when all data is available.
Easy solution: check everything. This obviously takes bandwidth, okay for existing blockchains, but not okay in many contexts. Context one: higher capacity chains—Ethereum today, after the Istanbul hard fork reducing call data gas cost from 68 to 16, and sharded Ethereum with 1024 chains will be very high. A few megabytes per second isn't very high in the grand scheme, but it's too high for most people to reliably download.
Context two: quick verification of history. Suppose you are syncing a chain for the first time or have been offline for a long time. You want to verify all the data is available and correct in sublinear time. Currently, light clients can find the correct head in sublinear time, but only with the trust assumption that the majority of miners or validators are good. Can we remove that assumption and still keep the guarantee? It turns out yes, if you accept assumptions about network latency and at least some minimal number of honest nodes existing.
Light client protocol for verifying data availability: suppose you have a block. You Merkle hash it and think of the data at the bottom. A simple probabilistic check: as a light client, privately select 80 random positions and ask the network for each. Only accept the block as available when all 80 queries are answered. If only 70% are answered, you wait. This is similar to what happens already, but here you focus on 80 positions instead of downloading the whole thing.
This doesn't guarantee the entire data is there—one missing chunk could go undetected. But it proves with very high probability that more than 50% of the data is available. An attacker could trick specific clients by publishing nothing and then responding to queries for certain chunks, but there are bounds on how many clients they can corrupt. Using onion routing makes it even harder to target specific clients.
So the light client protocol is: check 80 random positions, probabilistically verifying at least half the data is there. Problem: what if the one missing position is an invalid transaction that gives the attacker a bajillion coins? Enter erasure coding. We low-degree extend the data to a larger set, so any 50% of the extended data suffices to reconstruct all of it.
Now, if the extended data is constructed incorrectly (not a true low-degree extension), we can bring back fraud proofs. A naive fraud proof: if you as a full node download more than 50% and reconstruct all data, you can create a huge fraud proof showing the Merkle root doesn't match. This works but results in a gigabyte-long fraud proof for Ethereum 2.0 cross-links, which is undesirable.
The solution from a paper last year by myself and Mustafa Al-Bassam is to use two-dimensional erasure coding. Instead of extending along one axis, we make the data a small square, then low-degree extend it in both dimensions. A light client makes 80 random queries into this structure and also downloads the Merkle roots of every row and column—a few hundred kilobytes. A fraud proof just needs to prove fraud in one row or column, or an inconsistency between a position in a row and a column. This brings fraud proof size down to under a megabyte.
There are possible improvements using STARKs. One approach uses the ingredients in STARKs to give the same properties as the two-dimensional scheme on top of simple one-dimensional codes. If the extended data has a bug, instead of providing the entire data, you use a bit of clever math (like F(x)-F(y) over x-y arguments) to prove an inconsistency. Another approach is to directly use a STARK to verify correctness of the Merkle root—essentially placing all hashes along a polynomial and proving it's a correct low-degree extension. This would eliminate the need for fraud proofs and network latency assumptions, requiring only some minimal number of honest clients.
What do we use this for? First use case: verifying existing chains. This is a layer 2 technique that could improve security properties of existing light clients. You connect your node, receive a piece of data claimed to be a Merkle root of the chain, attach it to proof of work or a deposit, perform data availability challenges, wait for fraud proofs, and if they pass, accept the chain as valid. This makes 51% attacks considerably less powerful.
Second use case: scalable chains. In Ethereum 2.0, a sharded chain has way more data than most can download individually. Fraud proofs and data availability proofs serve as additional checks to guarantee security even if a subcommittee is corrupted. Third use case: layer 2, like a Plasma chain. Users need to verify data availability so they can make challenges; this scheme lets them check availability without downloading all data or relying on trusted watchtowers.
Conclusion: data availability verification and fraud proofs are two nice ingredients. Combined, you can quickly verify large amounts of data and computation indirectly. With lighter security assumptions than no 51% attack, you can guarantee the data is available with cryptographically high probability and that every piece of data is valid—because if it weren't valid, someone would broadcast a fraud proof. Succinct zero-knowledge proofs like STARKs can make things better and reduce the need for the network latency assumption. Thank you.