Scaling committee-based consensus

The Blockhouse Technology Ltd13 min read·Just now

How to safely reach agreement with small committees

Press enter or click to view image in full size

Original post: https://hackmd.io/@tbtl/rkvscqz2-g

Byzantine Fault-Tolerant (BFT) consensus is at the heart of blockchain systems, and scaling these protocols is at the forefront of distributed computing research. In this article, we explore a new technique proposed by our team at TBTL to significantly scale consensus protocols based on randomly elected committees. Our technique can dramatically reduce the size of these committees: the fewer the participants in the protocol, the quicker they can interact to reach a consensus. For some network settings, our technique can employ a committee of 543 participants to safely reach a consensus, whereas traditional algorithms would require 94,366 participants. Our research has been published at DISC 2025.

Why Committee Size Matters in Practice

In a blockchain network, consensus is the process by which nodes agree on a single, canonical history of transactions. This agreement is what produces finality: the guarantee that a transaction, once confirmed, is immutable and cannot be reversed. Finality is not just a theoretical property — it is what makes a blockchain useful. Without it, you cannot be certain that a payment you received will not later disappear. The size of the committee participating in consensus directly influences how quickly the network can reach that agreement, and consequently how fast finality is achieved. The more participants must coordinate, the longer it takes. To understand why this matters in practice, Ethereum offers a concrete and instructive case study.

Ethereum currently has over one million active validators participating in its consensus protocol — a figure that reflects its commitment to decentralisation, but that comes at a real cost. It takes about 15 minutes for an Ethereum block to finalize, because the protocol requires votes from a supermajority of those validators before a block is considered irreversible. From the perspective of a user, this means that after sending a transaction, the true cryptographic guarantee of immutability does not arrive for roughly a quarter of an hour. To put that in everyday terms: imagine paying at a shop with a card and being asked to wait fifteen minutes before the merchant would accept that the payment had gone through. For many use cases — exchanges, real-time payments, interactive applications — this latency is simply impractical, and it stems in large part from the sheer number of participants the protocol must coordinate.

Ethereum’s answer to this has been to push activity onto Layer 2 networks, which can offer much faster confirmations. But this speed comes with an important caveat. The fast pre-confirmations that L2s offer are not backed by the same finality guarantee as Ethereum itself; they are essentially promises made by the sequencer, not cryptographic proofs of immutability. If you want the same guarantee that Ethereum’s finality provides, you still have to wait the same amount of time — or longer. On optimistic rollups, withdrawing funds back to Ethereum requires waiting out a fraud-proof challenge window that can stretch to seven days, a consequence of the security model rather than a bug. Crucially, the speed advantage that L2s do offer today is largely a product of their centralised architecture: most rely on a single sequencer or a very small committee of sequencers, trading the decentralisation and security guarantees of a large validator set for responsiveness.

These examples illustrate that committee size sits at the heart of a fundamental tension in consensus protocol design: larger committees offer stronger security guarantees but slower finality, while smaller ones can be faster but may sacrifice either security or decentralisation. Our technique offers a way to navigate this trade-off more carefully, achieving a better balance between the two than either of these approaches. The goal is a point on the design curve that neither Ethereum’s large validator set nor today’s centralised L2 sequencers currently occupy: small enough to be fast, yet backed by rigorous safety guarantees. It is worth noting that while we use Ethereum as a concrete illustration, our technique is not specific to Ethereum — it applies to any consensus protocol based on randomly elected committees, regardless of the underlying blockchain.

The big picture: a two-committee architecture

Before diving into the technical details, we give an overview of what our approach can deliver with a two-committee architecture example. It can combine two consensus committees — a primary and a secondary — where each has a different role and a different tolerance for faulty nodes.

The primary committee is small. It can tolerate more than n/3 of its members being Byzantine. This higher fault tolerance is what lets it stay small when randomly sampled from the population. The trade-off is that it is not always guaranteed to reach a decision: if too many of its members happen to be faulty, it may be unable to agree on a value.

The secondary committee is larger. It follows the classical BFT requirement of tolerating fewer than one-third Byzantine members (t < n/3), and is therefore always guaranteed to eventually reach a decision. Its larger size is the price of that stronger guarantee.

In broad terms, the protocol works as follows. The primary runs first and attempts to reach a decision. There are two cases:

Optimistic path — the primary succeeds. Its decision is passed directly to the secondary, which simply accepts and propagates it without running its own agreement protocol. This is the common case when the proportion of faulty nodes is low, and it is where the efficiency gain is realised: the expensive secondary protocol never has to run.
Fallback path — the primary cannot decide. It does not simply crash or stay silent; instead, it hands over safely to the secondary, which then runs its own agreement protocol and produces a final decision.

The key word above is safely. For the handover to work correctly, the secondary must be able to pick up from whatever partial progress the primary made — without risking two different values being decided. This is the subtle and non-trivial part of the design, and it is where the new concept of Justifiability enters the picture. We will build up to it step by step.

The practical payoff is significant. As a concrete example, for a population where 68% of nodes are honest and an error probability of ε = 1/10¹⁸, a primary (Justifiable) committee needs only 543 nodes, whereas a standard secondary consensus committee requires 94,366 agents — a reduction of more than 99%. Smaller committees means fewer messages, which may even indirectly reduce the latency thanks to the lighter network load.

The quadratic barrier and random committees

The goal of a consensus protocol is to have the participants agree to a common decision, despite unreliable communication and Byzantine (i.e. malicious) faulty behavior from some of the participants. Many Byzantine fault-tolerant protocols require all nodes to exchange messages with each other, leading to a communication cost that grows quadratically with the number of nodes. As explained above, this cost is a serious bottleneck in large-scale networks. Unfortunately, this quadratic cost is to some extent fundamental to consensus: Dolev and Reischuk proved that it was required for deterministic protocols. Thus, randomisation is the privileged option for breaking the O(n²) barrier¹.

Scalable agreement protocols typically leverage randomness using committees: a small number of nodes (the committee members) are picked randomly and independently from the overall population, and only they run the expensive O(n²) protocol. The result is then propagated to the rest of the population.

The limit of this approach is due to the inherent bound on the number of Byzantine nodes tolerated by consensus algorithms. In our setting of interest where the network is partially synchronous, only up to t < n/3 nodes can be Byzantine, where n is the total number of nodes².

This constraint applies to the random committee as well. It will fail to be satisfied if too many Byzantine nodes are picked during the random selection; let’s call ε the probability that this happens. To minimise ε, there are two options:

accept to lower the number of tolerated Byzantine nodes for the whole population, and,
increase the committee size.

Asymptotically, both options make ε decrease exponentially³. In practice however, the concrete value for ε needs to be extremely small: we must trust that no committee will be Byzantine in the entire lifetime of the system. As a result, the committee size is usually rather large, e.g., in the thousands.

This is a somewhat frustrating state of affairs for random committees: it lets us discharge consensus to a constant number of nodes, but the gain in practice isn’t so great unless we compromise on either the number of Byzantine nodes tolerated or the probability of total failure.

Smaller committees with the multi-threshold model

The first ingredient that will help us is an observation from Momose and Ren, showing that instead of having a single fault threshold t that must always be satisfied, there can be multiple thresholds, and different properties will be guaranteed depending on which thresholds are satisfied. In our case, we have two:

When there are up to t_safe Byzantine nodes, safety properties are satisfied,
When there are up to t_live Byzantine nodes, liveness properties are satisfied.

In this new model, the n < 3t constraint can be generalised to n < t_safe + 2t_live. Now let’s assume that we only want to guarantee safety for the committee. Then we can freely choose a committee size, increase t_safe until the probability of error ε is as low as required, and then we will have liveness whenever there are less than (n - t_safe)/2 Byzantine nodes, which happens with fixed (non-zero) probability.

The potential gains here may be up to several orders of magnitude, as can be seen in Figure 1 where we plotted the committee size as a function of the overall number of Byzantine nodes in the population.

*Figure 1: gain in committee size, logarithmic scale. (*ε *= 1/10¹⁸)*

Of course, we must now deal with the case where Liveness does not hold in the committee. By design, the committee cannot deal with this situation by itself, so it is natural to try and remedy the situation with a backup consensus algorithm.

A naive attempt

Having our first ingredient, the solution seems simple: the small committee attempts to reach consensus, we will call it the primary P. If it does not succeed, it will send a signal to a backup B. As sketched in the big picture section, we are satisfied with the backup B being a regular (expensive but always safe and live) consensus protocol. Whenever B receives such signal, it will execute normally; if it receives a consensus decision from P, then it has nothing to do and we save on communication cost.

Trying to implement such a solution may seem simple at first glance. With a bit of experience in designing distributed protocols, one may try the following:

P sends to B either a proof of a decision or a proof of indecision (i.e., a proof that it has aborted).
Such proofs can be constructed by letting nodes vote and using standard quorum-based constructions⁴.

To start off, let us try to fix specific voting thresholds and see where the constraints lead. The voting threshold for a proof of decision can be set at T_d = (t_live + n) / 2, because it is what is required to guarantee quorum intersection when there are t_live faults, i.e., when the protocol is live. Then, we could set the voting threshold for a proof of indecision to T_a = n - t_safe, because it will guarantee that honest nodes alone will always be able to output a proof of indecision. Because we want our algorithm to be optimal, we will also require n = t_safe + 2t_live + 1. We must also guarantee that there is never both a proof of decision and a proof of indecision! This adds the constraint T_a + T_d - n > t_safe, which becomes n > 3t_safe when applying the relations above. Except that we are now back to square one, because the entire premise of this algorithm was to set t_safe equal to or higher than n/3.

After many attempts, each invariably leading to a contradiction, it becomes clear that something deeper is wrong with this approach.

The solution: Justifiable protocols

The key to solving this issue is to realise that P needs to transmit more information to B. The problem with the naive approach is that P can only signal success or failure, except that deciding that failure has occurred is as hard as successfully deciding on a value. The primary is essentially trying to solve consensus, something it can’t do by design. We solve this by giving P a third possible output: a pre-decision.

A pre-decision is like a weaker version of a decision: once a value has been pre-decided, only that value can be subsequently decided, but an indecision is also allowed to happen after a pre-decision, or even no output at all.

With this modification, a backup node b can simply wait to receive an output o from P. If o is a decision, then b has nothing to do except propagate the value. If o is a pre-decision then b will execute the backup protocol with o as input. Finally, if o is an indecision, then b will execute the backup protocol with a new block as input.

With that architecture, we only need one additional property from P: that it eventually sends some output to B — a decision, pre-decision, or indecision. Let’s call that property weak liveness.

This weak liveness property is at the heart of the trade-off being made here: while it is technically a liveness property, it must always hold, i.e., when there are up to t_safe faults, not merely t_live.

We call protocols that can decide, pre-decide, or abort, Justifiable⁵.

Looking back at the naive attempt

One might ask whether Justifiability is really what was required for our Primary/Backup architecture. Perhaps something simpler and/or weaker could have worked? We can answer that question in the negative by showing that any protocol where a primary communicates one-way to a backup must incorporate some kind of pre-decision.

We must also revisit the optimality of our thresholds. It turns out that n > 2t_safe + t_live must hold for Justifiable protocols. This inequality is stricter than the previous one, since t_safe > t_live.

Taken together, these two results⁶ explain why the naive approach was doomed from the start: we attempted to build something that required Justifiability, but our “optimal” threshold values could not satisfy the stricter bound that Justifiability imposes.

An example of a Justifiable protocol

To finally fulfill our initial goal, we need a concrete instantiation of a Justifiable protocol for the primary. We focus on a simple variant of consensus called Reliable Broadcast, where a single leader node sends a value to all other nodes and all nodes are guaranteed to obtain the same value⁷.

The design follows a similar approach as the one above, except there are now three types of proofs that P may send to B: decision proof, pre-decision proof and indecision proof, with thresholds T_d, T_p and T_i, respectively.

The quorum intersection arguments give us the following constraints:

2T_p - n > t_safe guarantees that a single value can be pre-decided.
T_d + T_i - n > t_safe prevents the output of both a decision and an indecision.
n - T_i > t_safe guarantees that an indecision can always be made by honest nodes.
n - T_d > t_live guarantees that a decision can be made when there are up to t_live.

Adding the new optimality requirement n = 2t_safe + t_live + 1, it is possible to derive a single correct value for T_d, T_p and T_i.

However if you’ve been following closely, you may remember that for those quorum arguments to work, honest nodes must not be allowed to vote twice. This means in particular that nodes cannot vote both for a decision and an indecision, causing a potential issue where honest nodes start voting for a decision, and are later prevented from voting for an indecision, thus making no output at all and breaking the weak liveness property. This is solved by requiring decision votes to be cast only for values that already have a pre-decision proof. Thus, in situation above, there will be a pre-decision proof that can be output to satisfy weak liveness.

With weak liveness sorted out, filling the gaps of the algorithm is straightforward: The leader will propose a value, nodes send pre-decision votes for the first proposal they see. They send a decision vote for the first pre-decision proof they see, unless they have already sent a vote for an indecision. They also start a timer, and when that timer expires they vote for an indecision (unless they already voted for a decision). For the full correctness proof, you may be interested in reading the full paper.

Conclusion

Committee-based consensus is a foundational building block of modern blockchain infrastructure, and its efficiency directly determines how fast, cheap, and decentralised a network can be. Our work moves the needle meaningfully: by introducing Justifiability and the multi-threshold primary/backup architecture, we show that committees can be reduced by over 99% in realistic settings without compromising safety. This kind of improvement — it is the difference between a protocol that is theoretically sound but practically sluggish, and one that could underpin genuinely fast, decentralised finality. The applications are broad: from L2 sequencer decentralisation to any permissionless system where the cost of coordination is a bottleneck. Beyond the immediate result, Justifiability is a new theoretical concept with connections to the wider distributed systems literature — one we expect to find further applications for as the field continues to develop.

At TBTL, this kind of research is what we do. We are a deep tech company specialising in cryptography, distributed systems, and formal verification, with a particular focus on the Web3 space. Whether you are building a new blockchain protocol, designing a rollup, integrating ZK proofs into your stack, or simply trying to understand whether a system you depend on is as secure as it claims to be, we can help. If you are a company moving into Web3 or already operating in it and want rigorous, research-grade expertise applied to your hardest problems, we would love to hear from you.

Footnotes

¹ Other options may take the form of optimistic protocols that have low communication cost when the conditions are good, or through the use of more advanced cryptographic schemes.

² In synchronous networks, the bound becomes t < n/2, or even t < n for some variants of consensus.

³ This can be seen with a simple application of the Chernoff Bound.

⁴ The quorum intersection argument states that there is a number Q_S such that any two sets of nodes of size Q_S have at least one honest node in common. Thus, if nodes are instructed to vote only once, only a single value can ever receive Q_S votes.

⁵ Justifiability only makes sense if the protocol is multi-threshold and has the weak liveness property. Otherwise, it collapses to the well-known family of commit-abort protocols.

⁶ The proofs can be found in the full paper; Theorems 5 and 9.

⁷ Of course, liveness for Reliable Broadcast can only be guaranteed if the leader is honest, in which case the output is also required to equal the leader’s input value.

Scaling committee-based consensus

Scaling committee-based consensus

Why Committee Size Matters in Practice

The big picture: a two-committee architecture

The quadratic barrier and random committees

Smaller committees with the multi-threshold model

A naive attempt

The solution: Justifiable protocols

Looking back at the naive attempt

An example of a Justifiable protocol

Conclusion

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

If You Can’t Explain Yield, You Are the Yield

Cryptocurrency Exchange Architecture: Components, Data Flow, and Trust Boundaries

This Token Surged Over 4,000% in 7 Days — But That’s Not the Real Story

Stock Screening Strategies for Undervalued Stocks

How AEON Is Becoming the Missing Payment Layer in Web3

If You Can’t Explain Yield, You Are the Yield.