Your Blockchain Indexer Is Fine — Until It Isn’t
Jatin Jain Saraf9 min read·Just now--
What nobody tells you about hitting 20–30 TB of indexed blockchain data, and why “just optimize it later” is the most expensive sentence in this industry.
The first year of building a blockchain indexer is deceptively smooth.
Blocks arrive. Transactions get parsed. Wallets accumulate state. Your PostgreSQL instance hums along. Queries return in milliseconds. Storage is cheap. The team ships features.
Then year two happens. Then year three.
Somewhere between 5 TB and 10 TB, your infrastructure bill starts a quiet conversation with your CTO. By 20 TB, that conversation gets loud. By 30 TB — which is entirely reachable for a high-throughput chain after three years of production indexing — you are no longer talking about features. You are talking about survival: how do you change anything in a live system of this size without taking it down, corrupting data, or spending six months on a migration that should have taken three weeks?
This is that article. Not the “here’s how to build an indexer” tutorial. The one about what happens after — when the data is already there, it’s already the wrong shape, and the chain doesn’t stop producing blocks while you figure it out.
What Makes Move Different at the Data Layer
Before talking about scale, it’s worth understanding why Move-based blockchains (Aptos, Sui, Supra, and others) create a fundamentally different indexing problem than EVM chains.
EVM state is shallow. An Ethereum indexer primarily deals with: blocks, transactions, logs, and account balances. The schema is wide but not deep. Relationships are relatively flat.
Move state is deep and typed. Move organises on-chain state as typed resources attached to addresses. A single wallet doesn’t just have a balance — it can own dozens of distinct resource types simultaneously: coin stores, object collections, automation registrations, governance positions, staking records. Each resource type has its own structure. Each needs to be indexed, tracked, and queryable.
Move has a first-class object model. In chains like Sui and Supra, everything is an addressable object with an owner, a version, a type path, and a parent-child relationship to other objects. Your indexer isn’t just recording what happened — it’s maintaining a live, queryable snapshot of an entire object graph that mutates with every block.
Events are strongly typed. Move’s event system emits structured, typed payloads — not raw byte arrays like EVM logs. This is powerful for developers querying the indexer, but it means your event storage needs to handle variable schemas across dozens of event types, each with different fields, all arriving in the same stream.
The consequence of all this: a production Move indexer has far more tables, far more relationships, and far more cross-references than an EVM equivalent. A realistic production schema has 35–50 tables. Most of them are interconnected. Most of them grow with every block.
That scale of interconnection is exactly what makes optimisation so painful later.
The Compounding Problem: Why Early Decisions Are Permanent
Here is the core tension of blockchain indexer design: you make your schema decisions when you have zero data, but you live with those decisions when you have 30 TB of it.
Every structural inefficiency you introduce on day one doesn’t just cost you N bytes per row. It costs you N bytes multiplied by every row ever written, across every table that inherited the pattern, indexed twice over because PostgreSQL also stores your mistakes in its B-tree indexes.
Let’s make this concrete.
The Repeated String Problem
Imagine you store a short identifier — say, the name of the network environment (“mainnet”, “testnet”) — as a VARCHAR column in every table. Seems harmless. It's just a word.
Now imagine that string is present in 40 tables. In every row of every one of those tables. At 10 billion total rows distributed across those tables — entirely realistic after three years on a fast chain — that one string, repeated, costs you tens to hundreds of gigabytes depending on its length.
But here’s where it gets worse: you can’t just ALTER TABLE your way out of this at 30 TB. An ALTER TABLE that changes a column type on a 2-billion-row table doesn't complete in minutes. It locks the table. It rewrites the heap. It takes hours or days. And you have 40 tables to fix. And the chain is still producing blocks.
The “right” solution — replacing the repeated string with a 2-byte integer foreign key — would have cost 30 minutes on day one. At year three, it costs you a multi-month migration project with significant operational risk.
The Fat Primary Key Problem
A transaction hash is typically 66 characters. A timestamp is 8 bytes. If your transaction table’s primary key is a composite of these plus an environment name, you’ve created a ~90-byte primary key.
That primary key gets copied — physically, on disk — into every foreign key column of every child table. A transaction-heavy Move indexer has 10 or more child tables per transaction: events, state changes, coin transfers, fungible asset movements, sender records, fee payer records, and payload details. Each one stores that full 90-byte reference on every row.
At 10 billion transactions with 8 child rows each on average:
90 bytes × 10 child tables × 8 rows × 10B transactions
= ~72 TB of foreign key storage aloneAnd PostgreSQL builds a B-tree index on every foreign key. That doubles it.
A 4-byte or 8-byte surrogate integer primary key — decided on day one — reduces this by a factor of 10 to 20. But retrofitting surrogate keys onto live tables with hundreds of billions of rows, active foreign key constraints, and applications reading those columns in real time is one of the hardest classes of database migrations that exists.
The Wrong Type Problem
Move’s RPC layer returns numeric values as strings because they can exceed JavaScript’s safe integer range. blockHeight: "18000000". sequenceNumber: "4294967298". version: "999999999".
Indexer developers, working fast, copy this directly to the database: blockheight VARCHAR(256). version VARCHAR(10). sequencenumber VARCHAR(255).
It works. For a while.
Then someone tries to write WHERE blockheight BETWEEN 15000000 AND 16000000. Postgres does lexicographic comparison on VARCHAR. The query returns wrong results. A bug is filed. A cast is added. The cast prevents index use. The query slows down. An index is added on CAST(blockheight AS BIGINT). Now you have a functional index instead of a simple column index, which is harder to maintain and slower to build.
At 2 billion blocks, changing VARCHAR(256) to BIGINT requires a full table rewrite. At 30 TB, that's a weekend project, not a 10-minute fix.
JSONB: The Tool That Becomes the Problem
JSONB is genuinely useful for a Move indexer. Move resources have variable schemas. Event payloads differ by type. Storing semi-structured data in a relational database without JSONB means either a hundred narrow tables or an unmaintainable EAV design.
But JSONB has a cost that surprises people who haven’t run it at scale.
Storage overhead is fixed regardless of content. A JSONB column storing {} still has ~20–30 bytes of binary representation overhead. When you have millions of rows with small or empty JSON blobs, that overhead dominates.
Updates are expensive. PostgreSQL stores JSONB as an immutable value. When you update a JSONB column — even to change a single key — the database writes a completely new version of the entire document and marks the old one dead. For tables that update JSONB columns frequently (like wallet state, which changes on every relevant transaction), this creates significant table bloat over time. The dead tuples accumulate. VACUUM can’t always keep up. The table grows even if the actual data size doesn’t.
GIN indexes on JSONB are large and slow to build. A GIN index on a full JSONB column is many times larger than a B-tree index on a scalar column. On a table with billions of rows, building or rebuilding a GIN index is a multi-hour operation that consumes significant I/O.
The failure mode at scale is not that JSONB stops working. It’s that it gradually degrades — updates get slower, VACUUM runs longer, storage grows faster than the data warrants — and diagnosing the root cause in a 30 TB database with dozens of tables requires significant forensic work.
The Migration Problem Is the Real Problem
Everything described above has a known correct solution. Surrogate keys. Proper numeric types. Selective JSONB use. Hash storage as binary instead of hex strings. These are not novel ideas.
The reason they matter — the reason this article exists — is not the storage cost in isolation. It’s what that storage cost means when you try to fix it three years into production.
You cannot take a blockchain indexer offline to migrate it. The chain does not pause. Blocks arrive every second. If your indexer falls behind, you create a gap in your data that is either expensive to backfill or permanently lost, depending on your node’s pruning settings.
Schema migrations at this scale require orchestration, not just SQL. You cannot ALTER TABLE indexer_transaction ADD COLUMN transaction_id BIGSERIAL on a 3-billion-row table and move on with your day. You need to: add the column with no constraint, backfill in batches during low-traffic windows, add the constraint as NOT VALID, validate it in a separate step, update all application code to write to both old and new columns during the transition, cut over reads, then drop the old column. Each step has rollback implications. Each step must be tested against a full-size copy of production data.
Foreign key changes require cascade planning. If you want to replace a composite FK with a surrogate integer FK, you need to: add the surrogate column to the parent, backfill it, add it to all child tables, backfill it there, drop the old FK constraints, add new ones, drop the old columns. For 10 child tables, each with billions of rows, this is months of careful work — not because any individual step is hard, but because the sequence must be correct and live traffic must be handled at every stage.
Indexes cannot be rebuilt without significant I/O impact. Replacing a large text-based index with a smaller integer-based index on a table that’s being actively written requires CREATE INDEX CONCURRENTLY — which takes the load without locking, but takes proportionally longer and puts sustained read pressure on the disk. On a busy 30 TB database, a single large index rebuild can take 12–24 hours.
This is the real complexity of running a production Move indexer at scale. Not any single technical problem, but the compounding interaction of many small early decisions that each seemed reasonable at the time — and the enormous operational cost of unwinding them under load, without downtime, while the chain keeps running.
What Good Design Buys You
The flip side of all this is worth stating clearly.
A team that makes the right schema decisions at the start — surrogate integer keys, correct numeric types, binary hash storage, selective JSONB use, properly normalised repeated strings — doesn’t just save storage. They buy themselves optionality.
At 30 TB with a well-designed schema, adding a new index takes hours, not days. Migrating a table to a new structure is a weekend project, not a quarter-long initiative. Backfilling new columns is a batch job, not a rewrite. Query optimisation is about access patterns, not fighting the type system.
The teams that build Move indexers correctly from the start can spend year three shipping new features and supporting developers building on their data. The teams that don’t spend year three fighting their own database.
That’s the real cost of getting it wrong. Not the terabytes. The lost time.
Closing Thought
If you are designing a Move blockchain indexer today — or evaluating one — the questions worth asking are not just “does it index all transaction types?” or “what’s the query latency?” They are:
- What does the schema look like at 100× current data volume?
- How long does it take to add a column to your largest table?
- Can you change a primary key structure without taking the system down?
- How do you handle schema migrations under continuous write load?
The answers reveal a lot about how much technical debt is already priced into the system — and how much of your future engineering capacity it will consume.
Blockchain data is permanent and ever-growing. Your schema decisions are not quite permanent — but at 30 TB, they’re close enough that the difference barely matters.
The Move ecosystem is still early, and the tooling for large-scale indexer operations is still maturing. If you are building in this space and want to compare notes, the comments are open.