Why is an LSM based embedded key-value store, a good fit for Ethereum execution-state storage?

Big picture

The Ethereum protocol requires that the node’s execution clients compute and verify Ethereum’s state commitments, historically the modified Merkle Patricia trie. (ref : https://medium.com/@suryanshshrivastava_75738/why-ethereum-uses-a-merkle-patricia-trie-for-state-15b700738064)

The database underneath is an implementation detail. Geth, the Go execution client, has historically stored recent blocks and state in LevelDB, with older block/receipt data moved into a separate “freezer” database (go-ethereum)

So the real question is not “why does Ethereum use LevelDB?” but:

Why is an ordered embedded key-value store, especially an LSM-tree one, a good fit for Ethereum execution-state storage?

The answer is: because Ethereum state updates are mostly opaque byte-key lookups and batched writes of many small trie nodes, not relational queries. The Merkle Patricia trie turns one logical account/storage update into several new trie-node writes with very poor key locality. An LSM store absorbs that write pattern much better than a B+ tree relational engine.

LevelDB vs Postgres/SQLite: the interface mismatch

Ethereum clients mostly need operations like:

Get(key) -> value
Put(key, value)
Delete(key)
BatchWrite([...])
Iterate(prefix or key range)
Snapshot/read view

LevelDB is exactly that: an ordered mapping from byte-array keys to byte-array values, with basic Put, Get, Delete, atomic batches, snapshots, and ordered iteration. It is explicitly not a SQL database and has no relational model or secondary indexes. (GitHub)

Postgres and SQLite solve a broader problem. Postgres provides SQL, MVCC, query planning, indexes, concurrency semantics, constraints, joins, statistics, vacuum, and so on. PostgreSQL B-tree indexes are excellent for sorted equality/range access over indexed columns, but they are part of a relational engine rather than a bare storage primitive. (PostgreSQL) SQLite is embedded, yes, but SQLite stores tables and indexes as B-trees in a database-file format designed around SQL tables, row records, pages, schema, and indexes. (sqlite.org)

You could model Ethereum trie nodes in SQLite like:

CREATE TABLE trie_nodes (
  key   BLOB PRIMARY KEY,
  value BLOB NOT NULL
);

But now every Ethereum trie-node write is going through a SQL table abstraction. You get many features Ethereum does not need, and you pay for a B-tree/page-update storage model when the workload is closer to “millions of tiny random-looking key-value updates in large batches.”

Why LSM trees fit the workload

A B+ tree tries to keep the on-disk tree updated in sorted order. Random updates tend to touch random pages. A relational engine may also generate row versions, index updates, WAL records, and later vacuum/cleanup. PostgreSQL, for example, uses MVCC so updates/deletes leave old row versions that must later be reclaimed by VACUUM, which itself can create substantial I/O. (PostgreSQL)

An LSM tree flips the tradeoff. Writes first go to a log and an in-memory memtable. When the log/memtable reaches a threshold, data is flushed to immutable sorted files; later, background compaction merges files, discards overwritten values and deletion markers, and reorganizes data into lower levels. LevelDB’s implementation docs describe exactly this log → memtable → sorted table → levels → compaction pipeline. (GitHub)

For Ethereum, that is attractive because a block import looks roughly like this:

for each block:
    execute many transactions
    read many accounts/storage slots
    update balances, nonces, contract storage
    create/modify many trie nodes
    batch-write all dirty state/trie data
    compute new stateRoot

The application-level write pattern is already “log-like”: after a trie node changes, its hash changes, so the client often writes a new node blob rather than mutating one stable record in place. LevelDB/Pebble absorb this as sequential-ish batch writes, then pay cleanup cost later through compaction.

That does not mean LSM is “free.” LSMs trade faster writes for read amplification and background compaction. But Ethereum’s state update pattern makes that tradeoff reasonable.

What the Merkle Patricia trie gives Ethereum

Ethereum’s world state is not stored in a trie because a trie is the fastest possible database index. It is stored in a trie because Ethereum needs cryptographic commitment to a huge mutable key-value state.

Ethereum.org describes the Ethereum state as encoded in a modified Merkle-Patricia trie, where the trie root creates a verifiable relationship between individual state items and a single root value. Two identical states produce the same root; changing the underlying state changes the root. (ethereum.org)

That root goes into block validation. When a node executes a block, it computes the resulting state root. If the computed root does not match the block’s claimed root, the block is invalid.

The MPT gives Ethereum several things:

Deterministic global state commitment
All honest nodes executing the same transactions from the same parent state should get the same root.
Efficient incremental updates
Changing one account or storage slot changes only the leaf and the nodes along the path to the root, not the entire state.
Merkle proofs
A node can prove that an account/storage slot is included, or absent, under a given state root. Ethereum.org notes that a proof for a nonexistent (path, value) cannot be forged because the root hash depends on all hashes below it. (ethereum.org)
Sparse keyspace support
Ethereum accounts and storage slots live in huge 160-bit/256-bit spaces. A Patricia trie compresses long paths and avoids allocating every empty level.
Adversarial-key robustness
Geth indexes account information by the hash of the address rather than the address directly, so traversal is over sha3(address); this prevents user-chosen addresses from creating pathological prefix clustering. (go-ethereum)

What an Ethereum state lookup actually looks like

At the conceptual level, the world state is:

stateTrie[
    keccak256(address)
] = RLP(account)

where an account contains roughly:

nonce
balance
storageRoot
codeHash

Geth’s FAQ describes the state trie as key-value pairs where account addresses map to RLP-encoded account objects containing nonce, balance, StorageRoot, and codeHash, with the trie indexed by the hash of the key. (go-ethereum)

For a contract storage slot, there is a second trie:

storageTrie[
    keccak256(storageSlotKey)
] = RLP(storageValue)

So a contract storage read is two-level:

1. Look up account in global state trie.
2. Read account.storageRoot.
3. Use that root to look up the storage slot in the contract’s storage trie.

For an ERC-20 balance lookup, Solidity might compute a mapping slot like:

slotKey = keccak256(pad(userAddress) ++ pad(mappingSlotNumber))

Then Ethereum uses the hashed slot key as the path into the contract’s storage trie.

Example: ETH transfer from Alice to Bob

Suppose Alice sends 1 ETH to Bob.

Conceptually only two account records change:

Alice.balance -= 1 ETH
Alice.nonce   += 1

Bob.balance   += 1 ETH

But in the Merkle Patricia trie, this does not mean “update two rows.”

For Alice:

path = keccak256(AliceAddress)
oldLeaf = RLP(old Alice account)
newLeaf = RLP(new Alice account)

Because the leaf value changed, the leaf hash changes. Because the leaf hash changed, the parent branch/extension node changes. Because the parent hash changed, its parent changes. This propagates all the way to the root.

Same for Bob.

So one simple transfer causes writes like:

new Alice leaf node
new ancestors on Alice path
new Bob leaf node
new ancestors on Bob path
new state root

If Alice and Bob share some trie prefix, some ancestors overlap; if not, the changed paths are mostly distinct. Either way, the database sees a batch of new small node blobs.

In a legacy hash-addressed trie database, those node blobs are naturally keyed by hash:

db[hash(rlp(node))] = rlp(node)

Those keys are effectively random. A B+ tree storage engine would see scattered logical updates. An LSM engine sees many Put(random-looking-key, small-value) operations, appends them to a log/memtable, and later compacts.

The important consequence: the trie defines random point-lookups, not relational scans

A trie lookup follows a nibble path. Ethereum.org’s MPT docs describe branch nodes as 17-item nodes, leaf nodes, and extension nodes that compress long paths; the lookup walks node-by-node, often fetching the next node by hash from the underlying database. (ethereum.org)

The database-level access pattern is therefore:

Get(rootNode)
inspect path nibble
Get(childNode)
inspect next nibble(s)
Get(childNode)
...
return leaf value

That is mostly point lookup by opaque key, not:

SELECT ...
JOIN ...
GROUP BY ...
ORDER BY ...

And because addresses/storage slots are hashed, neighboring application objects are not necessarily neighboring database keys. Alice and Bob might be close in the real world, or in application logic, but their trie paths are pseudorandom.

This is one of the strongest reasons a general relational datastore is not a natural fit. Ethereum’s execution layer does not need “find all accounts with balance > X” or “join contract storage with transactions.” It needs “given this 32-byte key, fetch this blob,” millions and millions of times.

But reads are painful — so clients add caches and snapshots

The MPT is great for verification, but not great for raw read speed. The Ethereum Foundation’s Geth snapshot article explains the tradeoff clearly: a flat key-value representation would be efficient, but verifying correctness would become expensive; the Merkle Patricia trie avoids hashing the whole dataset after every modification, but makes reads and writes traverse internal trie nodes. (Ethereum Foundation Blog)

That is why real clients do not naively hit disk for every trie node every time. They use:

state object caches
trie node caches
dirty-write caches
snapshot / flat-state acceleration
batch writes
pruning
freezer / ancient storage separation

The Geth snapshot design keeps a flat key-value view of accounts and storage slots so reads can often avoid full trie traversal, while the trie still exists for producing and verifying state roots. The EF article describes the snapshot as a dump of accounts and storage slots represented by a flat key-value store, reducing account/storage access to one LevelDB lookup instead of walking several trie nodes, with in-memory diff layers to handle recent blocks and reorgs. (Ethereum Foundation Blog)

So the canonical data structure is the MPT, but the physical read path in a production client is often accelerated by additional structures.

Why compaction matters

Ethereum clients absolutely make use of compaction when using LevelDB/Pebble-style backends.

Geth’s database docs say its LevelDB database supports batch writes and ordered iteration, and that the database is periodically compacted to reduce the cost of accessing individual items. The same docs list compaction-related metrics such as compaction time, compaction reads/writes, and write delays due to compaction. (go-ethereum)

Compaction is doing several useful things for Ethereum’s workload:

1. Removes overwritten versions.
2. Removes tombstoned/deleted records.
3. Merges scattered sorted files into lower levels.
4. Reduces the number of files/levels a point lookup must check.
5. Reclaims space from pruned stale trie nodes.

But compaction is also one of the big costs. Geth’s archive-mode docs explicitly mention that legacy hash-based archive nodes face significant database compaction overhead and can exceed 20 TB on mainnet. (go-ethereum)

This is why Geth separates old block/receipt data into the freezer: keeping less data in LevelDB makes compactions faster and leaves more cache for active state trie nodes. (go-ethereum)

Hash-based state vs path-based state

Historically, a common model is:

hashdb:
    key   = keccak256(rlp(trieNode))
    value = rlp(trieNode)

This gives beautiful content addressing. If two trie nodes are identical, they naturally share the same hash. It also makes Merkle proofs straightforward.

But it makes pruning hard. If the DB is full of nodes addressed only by hash, deciding which old nodes are no longer reachable from any retained root is nontrivial. A node may be referenced by some historical state root, by a storage trie, or by recent reorg protection.

Newer Geth supports path-based archive/state storage. Geth’s archive docs distinguish legacy hash-based archive nodes from path-based archive nodes; the path-based design is described as more disk-efficient and configurable for historical-state retention. (go-ethereum) The Geth pathdb package docs describe a layered structure with one persistent base key-value layer and in-memory diff layers above it, supporting rollback through reverse diffs depending on retained state history. (Go Packages)

The takeaway:

hash-addressed trie nodes:
    excellent for content-addressed Merkle structure
    hard to prune
    poor locality

path/state-history-aware storage:
    better pruning and history management
    still backed by ordered key-value storage
    still benefits from batched writes and compaction

Why not just use a B+ tree KV store?

You can. Ethereum does not mathematically require an LSM. A client could use a B+ tree or LMDB/MDBX-style engine if it designs around the tradeoffs.

But the workload has several LSM-friendly properties:

Press enter or click to view image in full size

A B+ tree can give excellent point reads and range scans, but random writes update pages throughout the tree. With Ethereum, the trie already makes updates scattered; an LSM moves that scattering out of the foreground write path and into background compaction.

The mental model

Think of Ethereum state storage as three layers:

Protocol layer:
    "The state must commit to a Merkle Patricia root."

Trie layer:
    "Accounts and storage slots are organized by hashed paths.
     Updating one value rewrites the path to the root."DB layer:
    "Persist lots of small opaque blobs by key.
     Reads are point lookups; writes arrive in batches.
     Old versions eventually become garbage."

LevelDB/LSM is chosen at the DB layer because it is a simple embedded ordered KV store that handles the last part well.

The Merkle Patricia trie is chosen at the protocol layer because Ethereum needs deterministic, cryptographically verifiable state roots and proofs.

Ethereum execution clients are asking it to persist and retrieve a huge number of small byte blobs under random-looking keys, while the Ethereum client itself owns the consensus logic, state transition logic, trie hashing, caching, pruning, and block atomicity.

Why is an LSM based embedded key-value store, a good fit for Ethereum execution-state storage?

Why is an LSM based embedded key-value store, a good fit for Ethereum execution-state storage?

Big picture

LevelDB vs Postgres/SQLite: the interface mismatch

Why LSM trees fit the workload

What the Merkle Patricia trie gives Ethereum

What an Ethereum state lookup actually looks like

Example: ETH transfer from Alice to Bob

The important consequence: the trie defines random point-lookups, not relational scans

But reads are painful — so clients add caches and snapshots

Why compaction matters

Hash-based state vs path-based state

Why not just use a B+ tree KV store?

The mental model

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Macron urges US-Iran coordination as Trump orders military guidance in Hormuz

First Solana Hotel Opens for NFC Summit Week in Lisbon

US law firm attempts to block transfer of frozen ETH from Kelp exploit

Russian soldier desertions highlight morale issues amid Ukraine conflict

Right2Vote’s Blockchain Technology in eVoting: Making Digital Elections Secure and Trustworthy

Why Ethereum Uses a Merkle Patricia Trie for State