NoSQL Internals: LSM Trees, WAL & Bloom Filters

Part 6

Source:Bloom Filters by Example · Bloom Filter Calculator Understand exactly how NoSQL databases store, retrieve, and manage data on disk — using the exact examples from the lecture.

📋 Table of Contents

The Problem: Why SQL’s Approach Breaks for NoSQL
Brute Force: Flat File Storage
Write-Ahead Log (WAL)
In-Memory Hashmap: Key → Offset
MemTable: Latest Chunk in Memory
SSTables: Sorted String Tables
Full Dry Run: Keys 1–5
LSM Tree: The Full Architecture
Sparse Index: Memory-Efficient Lookup
Bloom Filters: Skip Unnecessary Reads
Tombstones: How Deletion Works
Full Read & Write Flow
Summary Cheat Sheet

1. The Problem: Why SQL’s Approach Breaks for NoSQL

SQL: Fixed Schema → Safe In-Place Updates

In SQL, every column has a known data type and every table has a fixed schema. Because the size of every row is exactly known, you can safely update values in-place.

SQL Table: users
  Column: id    → BIGINT  (8 bytes)
  Column: name  → VARCHAR(50)
  Row size = 58 bytes, always.

Row 1:  id=20, name="nikl"        → 58 bytes exactly
Row 2:  id=5,  name="nl"          → 58 bytes exactly

Update: "nikl" → "nikl abra"
  Still fits in VARCHAR(50) → overwrite safely. ✅
  Size does not change.

SQL uses B+ Trees internally:

Every node in the B+ tree is exactly one disk block in size
Tree height = log(N) → reading/writing = log(N) disk seeks
Safe because row size is fixed — no overflow risk when updating

NoSQL: Variable-Size Data → In-Place Updates Are Dangerous

In a NoSQL document or key-value store, sizes are completely arbitrary:

Key   → any string (variable length)
Value → any string / JSON (variable length)

From the lecture — the exact example:

Disk layout (continuous bytes):
  ┌──────────────────────────────────────────┬───────────────────────────┐
  │ Doc ID=10 │ { "name": "someone" }        │ Doc ID=30 │ { "name":"N" }│
  │  4 bytes  │       100 bytes              │   4 bytes │    80 bytes   │
  └──────────────────────────────────────────┴───────────────────────────┘
  offset=0                                    offset=104

Update Doc ID=10:
  New value: { "name": "someone", "favorite_color": "red" }
  New size  = 140 bytes  >  original 100 bytes

  Blindly overwriting:
  → Extra 40 bytes bleed into Doc ID=30's space ❌
  → Doc ID=30 data is CORRUPTED

Two bad options when data grows:

https://medium.com/media/f6b7a7703568a660928a2b4968c8a8e3/href

“The only safe solution for variable-size data: never update in-place. Just append at the end.” — Instructor

Why disk seeks are so expensive:

Hard Disk Architecture:
  ┌────────────────────────────────────────┐
  │  Read/Write Head (physical arm)        │
  │         ┌──────────────────┐           │
  │         │   Spinning Disk  │ ← 6800 RPM│
  │         │  ●─────────────► │           │
  │         └──────────────────┘           │
  │   Track = one ring = sequential data   │
  └────────────────────────────────────────┘

  Sequential read:  Disk spins, head stays   → FAST  ✅
  Changing track:   Head physically moves    → 100ms+ ❌
  (This physical movement is called a "disk seek")

Memory Speed Hierarchy (approx):
  RAM           →   100 ns    (baseline)
  SSD random    →   100 µs    (1,000×  slower than RAM)
  HDD seq.      →     1 ms    (10,000× slower)
  HDD seek      →    10 ms    (100,000× slower than RAM)

Even for SSDs — sequential I/O is ~100× faster than random I/O. This holds for any memory model.

NoSQL Design Goals:

Optimise for heavy write loads (SQL writes are slow due to B-tree updates)
Never corrupt adjacent data when values change size
Minimise disk seeks (avoid fragmentation)

2. Brute Force: Flat File Storage

Approach: Write key-value pairs sequentially to a flat file — the simplest thing possible.

file.db (flat file on disk, variable-size entries):
  ──────────────────────────────────────────────────────
  key=001  value="V Prasad"    ← 100 bytes, offset=0
  key=002  value="N"           ← 50 bytes,  offset=100
  key=100  value="Bit"         ← 60 bytes,  offset=150
  key=060  value="Bishujit"    ← 40 bytes,  offset=210
  key=030  value="Shashank"    ← 50 bytes,  offset=250
  ──────────────────────────────────────────────────────

https://medium.com/media/c0c3dda979d760a8fc3dca16c71c0052/href

Adding Append-Only Writes:

Instead of overwriting, just append a new entry at the end. If key=060 → “Bishujit” needs to become “Shashank”:

file.db (append-only):
  key=001  value="V Prasad"
  key=002  value="N"
  key=100  value="Bit"
  key=060  value="Bishujit"    ← old entry, still here (stale)
  key=030  value="Shashank"
  key=060  value="Shashank"    ← NEW entry appended at the end ✅

  → To read: scan from BOTTOM TO TOP, first hit = most recent value

https://medium.com/media/e13513546d8eb3277fcbc6e325cf9b61/href

3. Write-Ahead Log (WAL)

Every database — especially NoSQL — maintains a Write-Ahead Log (WAL): a purely append-only file on disk that records every change ever made to the database.

WAL File (on disk, append-only):
  ────────────────────────────────────────────────────
  SET  key=001  value="V Prasad"
  SET  key=002  value="N"
  SET  key=060  value="Bishujit"
  SET  key=060  value="Shashank"   ← update appended, NOT overwritten
  DEL  key=030
  ────────────────────────────────────────────────────
  ← only appends; existing entries NEVER modified

Why WAL Exists

WAL Properties

https://medium.com/media/2904306f6cbe04b0fbeae434cf043224/href

WAL ≠ Database Snapshot. The database is the current state. The WAL is the full event history of every change that built that state.

4. In-Memory Hashmap: Key → Offset

Problem: Reads are still O(N) — scanning the whole WAL to find a key.

Solution: Maintain an in-memory hashmap mapping every key to its byte offset in the WAL file.

WAL File (on disk)                    Hashmap (in RAM)
  ──────────────────────────────────    ──────────────────────────────
  offset=0:   001 → "V Prasad"          001  →  offset 0
  offset=100: 002 → "N"                 002  →  offset 100
  offset=150: 100 → "Bit"               100  →  offset 150
  offset=210: 060 → "Bishujit"          060  →  offset 310  ← updated
  offset=250: 030 → "Shashank"          030  →  offset 250
  offset=310: 060 → "Shashank"     (latest entry for key=060)
  ──────────────────────────────────

def read(key):
    offset = hashmap[key]              # O(1) — pure RAM lookup
    buffer = file.read(n, offset)      # One disk seek to exact location
    key, value = parse(buffer)
    return value

def write(key, value):
    offset = wl.current_size           # End of WAL file
    wl.append(key, value)              # O(1) — sequential append, no seek
    hashmap[key] = offset              # O(1) — update RAM hashmap

Why can’t we just overwrite in-place?

WAL has:  [060 | "Bishujit" | 40 bytes allocated] [030 | "Shashank"]

Update to: "Shashank" (8 chars, but maybe needs 50 bytes total with metadata)

"Bishujit" slot = 40 bytes. "Shashank" entry needs 50 bytes.
→ Overflow: 10 bytes bleed into "030 | Shashank" ❌
→ Data corruption.

Rule: NEVER overwrite. ALWAYS append.

Performance:

Writes: O(1) — sequential append + hashmap update
Reads: O(1) — hashmap lookup + one disk seek

🚨 Big Problem: Hashmap Gets Too Large

1 billion keys × (avg key 50 bytes + offset 8 bytes) ≈ 58 GB hashmap

Typical server RAM: 32–256 GB (shared with everything else)
→ Trillions of keys? Impossible to fit in RAM.
→ Moving hashmap to disk? Makes reads slow again. ❌

“If you have trillions of keys and you put the hashmap on disk — life becomes slow again.” — Instructor

5. MemTable: Latest Chunk in Memory

Big Idea: Split the WAL into chunks. Store the latest (current) chunk in memory as a structured buffer — the MemTable.

Why Balanced BST (BBST) — Not a Hashmap?

https://medium.com/media/cc93f6915fdd14860435c9a4abcbff00/href

MemTable = Balanced Binary Search Tree (Red-Black Tree / AVL Tree / Skip List) in RAM.

MemTable (RAM, BBST):

           key=3 (val=Y)
          /              \
     key=1 (val=W)    key=5 (val=C)
          \
       key=2 (val=X)

In-order traversal → [key=1, key=2, key=3, key=5]   ← sorted ✅
Flush to disk       → SSTable is automatically sorted ✅
No duplicates       → each key appears once ✅

MemTable Write Flow

Write request: SET key=3, value=X

Step 1: Append to Backup WAL on disk (crash recovery insurance)
        backup.wal: | ... | SET key=3 value=X |

Step 2: Insert/Update key=3 in the BBST (all in memory)
        If key=3 already exists → just update the node. No duplicates.

Step 3: ACK client ✅  (ultra fast — purely in RAM)

MemTable as Read Cache

The MemTable is the hottest read cache in the system.

Read key=3:
  ① Check MemTable → FOUND (key=3 → X) → return X ✅  (microseconds, zero disk)

Read key=060 (old data, not in MemTable):
  ① Check MemTable → NOT FOUND
  ② Must look in SSTables on disk → (covered in Sections 9 & 10)

When MemTable is Full → Flush to SSTable

MemTable FULL (~100 MB reached):
  → Flush in-order traversal → new SSTable file on disk (sorted, immutable)
  → Delete Backup WAL (data safely persisted to disk)
  → Reset (clear) MemTable → begin fresh BBST
  → Start new Backup WAL

Because the MemTable was a BBST, its flush is automatically sorted — no extra sort step needed.

6. SSTables: Sorted String Tables

When the MemTable flushes, it creates an SSTable (Sorted String Table) — an immutable, sorted file on disk.

SSTable File (on disk, immutable):
  ──────────────────────────────────────────────
  key=1  →  value=A
  key=2  →  value=B
  key=3  →  value=C
  ──────────────────────────────────────────────

  ✅  Sorted by key
  ✅  No duplicates within this one file
  ✅  Immutable — never modified after creation
  ✅  Binary search is possible (it's sorted!)
  ❌  Duplicates CAN exist across different SSTable files

Multiple SSTables Over Time

Disk (after 3 MemTable flushes + current MemTable in RAM):

  1.wl: [key=1→A, key=2→B, key=3→C]             ← oldest flush
  2.wl: [key=1→X, key=2→W, key=4→D, key=5→C]    ← second flush
  3.wl: [key=1→Y, key=2→W, key=3→X]             ← newest flush

  MemTable (RAM): { key=2→X, key=3→Y, key=1→W } ← latest in-memory

  Key=1 appears in ALL four places → most recent wins.

Background Compaction (Merge Sort)

A background process periodically merges older SSTables:

Merge logic (keep latest value for duplicates):
  key=1: in both 1.wl(→A) and 2.wl(→X) → keep 2.wl (newer) → 1→X
  key=2: only in 1.wl                                        → 2→B (wait, 2→W from 2.wl)
  key=3: only in 1.wl                                        → 3→C
  key=4: only in 2.wl                                        → 4→D
  key=5: only in 2.wl                                        → 5→C

XL1.sst: [key=1→X, key=2→W, key=3→C, key=4→D, key=5→C]
Delete 1.wl and 2.wl. ✅

⚠️ Tuning Compaction is Critical in Production. When deploying Cassandra or any LSM-based store, poorly tuned compaction directly hurts read and write performance. Tune: chunk size, compaction frequency, run during lowest traffic (e.g., 3 AM).

7. Full Dry Run: Keys 1–5

This is the exact step-by-step dry run from the lecture. Chunk size = 3.

Initial State (disk + RAM):

Disk (SSTables):
  1.wl: [1→A, 2→B, 3→C]        ← oldest
  2.wl: [1→X, 4→D, 5→C]        ← second
  3.wl: [1→Y, 2→W, 3→X]        ← newest

MemTable (RAM, BBST): empty
Backup WAL (disk):    empty

Hashmap (RAM):
  key=1 → "3.wl"   (latest for key=1)
  key=2 → "3.wl"
  key=3 → "3.wl"
  key=4 → "2.wl"
  key=5 → "2.wl"

https://medium.com/media/6615756dfb1d5846c1976a4dca23c1b8/href

MemTable Full → Flush:

In-order traversal of BBST: [1→W, 2→X, 3→Y]  ← sorted automatically ✅

Flush to disk as 4.wl: [key=1→W, key=2→X, key=3→Y]

Update Hashmap:
  key=1 → "4.wl"   (was "3.wl")
  key=2 → "4.wl"   (was "3.wl")
  key=3 → "4.wl"   (was "3.wl")
  key=4 → "2.wl"   (unchanged)
  key=5 → "2.wl"   (unchanged)

Clear MemTable → empty BBST.  Delete Backup WAL.

Background Compaction: 1.wl + 2.wl + 3.wl → XL1.sst:

Merge-sort all three SSTables, keep latest per key:

  key=1: in 1.wl(→A), 2.wl(→X), 3.wl(→Y) → latest is 3.wl → 1→Y
  key=2: in 1.wl(→B), 3.wl(→W)            → latest is 3.wl → 2→W
  key=3: in 1.wl(→C), 3.wl(→X)            → latest is 3.wl → 3→X
  key=4: only in 2.wl                       → 4→D
  key=5: only in 2.wl                       → 5→C

XL1.sst: [1→Y, 2→W, 3→X, 4→D, 5→C]

Delete 1.wl, 2.wl, 3.wl.

Final disk state:
  XL1.sst: [1→Y, 2→W, 3→X, 4→D, 5→C]  ← compacted (older)
  4.wl:    [1→W, 2→X, 3→Y]             ← newest flush

8. LSM Tree: The Full Architecture

Multiple levels of SSTables form a Log-Structured Merge Tree (LSM Tree):

                    ┌──────────────────────────────────┐
                    │      MemTable  (RAM, BBST)       │  ← ALL writes go here
                    │      ~100 MB, no duplicates      │
                    └─────────────────┬────────────────┘
                                      │ flush when full
                    ┌─────────────────▼────────────────┐
    Level 0         │  sst1  sst2  sst3  sst4          │  ← small, ~100 MB each
    (disk)          │  sorted, may have overlapping keys│
                    └─────────────────┬────────────────┘
                                      │ compaction (merge sort)
                    ┌─────────────────▼────────────────┐
    Level 1         │     sst_A          sst_B         │  ← larger, ~200–500 MB
    (disk)          │     sorted, non-overlapping keys │
                    └─────────────────┬────────────────┘
                                      │ compaction
                    ┌─────────────────▼────────────────┐
    Level 2         │             sst_XL               │  ← 1 GB+, fully merged
    (disk)          │       no duplicates at all       │
                    └──────────────────────────────────┘

    WAL (disk, separate):
                    │ Crash recovery only               │
                    │ Deleted after each MemTable flush │

Complexity:

https://medium.com/media/28e5c12ff080dd01fbfb43f41debc599/href

“The number of SS tables is kept small by the compaction process — it’s order log(N).” — Instructor

9. Sparse Index: Memory-Efficient Lookup

Problem: The hashmap of key → SSTable name has one entry per key. Billions of keys = hashmap too big for RAM.

Solution: For each SSTable, maintain a Sparse Index in memory — store only the first key of every 64 KB block.

Building the Sparse Index

SSTable on disk (1 GB, sorted):

  Block 0  (offset=0,       64 KB): key=0    ... key=100
  Block 1  (offset=65536,   64 KB): key=101  ... key=2010
  Block 2  (offset=131072,  64 KB): key=2011 ... key=5000
  ...
  Block 16383 (last, 64 KB):        key=...  ... key=N

Sparse Index (stored in RAM — one entry per block):
  ┌─────────────────────────────────────────┐
  │  key=0    → offset 0                   │
  │  key=101  → offset 65,536              │
  │  key=2011 → offset 131,072             │
  │  ...                                   │
  └─────────────────────────────────────────┘

Total blocks      = 1 GB / 64 KB = 2^14 ≈ 16,000
Average key size  = 64 bytes
Sparse Index size = 16,000 × 64 bytes ≈ 1 MB

vs. full key index for 10M keys → hundreds of MB

→ 1000× reduction in index memory! 🎉

“If this SS table is 1 GB, this sparse index is only 1 MB — a 1000x reduction!” — Instructor

Reading with Sparse Index (2-Step Binary Search)

Example: Read key=300

Step 1: Binary search the Sparse Index (purely in RAM):
        Entries: key=0(offset=0), key=101(offset=65536), key=2011(offset=131072)

        key=101 ≤ 300 < key=2011
        → key=300 must be in Block 1, starting at offset=65536

Step 2: One disk seek to offset=65536 → read 64 KB block into RAM

Step 3: Binary search within the 64 KB block (in RAM):
        Block 1: [key=101, key=150, key=200, key=300, key=400...]
        → Found: key=300 → return value ✅

Cost: 1 disk seek per SSTable checked.

“64 KB is exactly the size of one track on a spinning disk — so reading a 64 KB block = exactly one disk seek. As efficient as it gets.”

10. Bloom Filters: Skip Unnecessary Reads

Problem: When a key doesn’t exist anywhere, we still check every SSTable’s sparse index and do a disk seek per SSTable — only to find nothing.

Solution: A Bloom Filter — a probabilistic, fixed-size bit array that tells you whether a key was ever inserted.

Bloom Filter — Core Idea

“A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.” — Bloom Filters by Example

Two operations only:

insert(key) — mark this key as present
contains(key) → TRUE (maybe present) or FALSE (definitely absent)

Guarantees:

If key IS present     → ALWAYS returns TRUE      (zero false negatives) ✅
If key IS NOT present → returns FALSE (usually)  ✅
                        tiny chance of TRUE       (false positive) ⚠️
                        → tunable rate (e.g., 1%)

Size NEVER grows regardless of how many keys are inserted. ← The Magic!

“In case of a set or hashmap, as you insert more data, the size will grow. In case of a bloom filter, the size does not change at all. That is magic.” — Instructor

How a Bloom Filter Works — The Exact Lecture Example

Setup:

Bloom Filter = bit array of size M=12, all zeros:

  Index:  0   1   2   3   4   5   6   7   8   9  10  11
         ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
  Bits:  │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │ 0 │
         └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

K = 3 hash functions: h1, h2, h3

INSERT “Nikl” — pass through all 3 hash functions, set those bits:

h1("Nikl") = 2  → bit[2] = 1
h2("Nikl") = 3  → bit[3] = 1
h3("Nikl") = 9  → bit[9] = 1

  Index:  0   1   2   3   4   5   6   7   8   9  10  11
         ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
  Bits:  │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 0 │ 0 │ 1 │ 0 │ 0 │
         └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
                  ↑   ↑                   ↑
              "Nikl" bits

Did the size of the filter change? ❌ No. Still M=12 bits.

INSERT “Vishal”:

h1("Vishal") = 2  → bit[2] already 1 (no change)
h2("Vishal") = 5  → bit[5] = 1
h3("Vishal") = 11 → bit[11] = 1

  Index:  0   1   2   3   4   5   6   7   8   9  10  11
         ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
  Bits:  │ 0 │ 0 │ 1 │ 1 │ 0 │ 1 │ 0 │ 0 │ 0 │ 1 │ 0 │ 1 │
         └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

CHECK “V Prasad” (never inserted) — hash → {1, 3, 7}:

bit[1] = 0  ← STOP! Any zero bit = DEFINITE NO ✅

  → "V Prasad" is NOT in this SSTable.
  → Skip this SSTable — no disk seek needed! 🎉

CHECK “Abhishek” (never inserted) — hash → {3, 9, 11}:

bit[3]  = 1 ✓
bit[9]  = 1 ✓
bit[11] = 1 ✓   ← all bits are 1!

→ Bloom Filter says: "MAYBE EXISTS" ⚠️
→ Must search the SSTable on disk
→ Not found → FALSE POSITIVE (rare, tunable)

False Positive Formula

From llimllib.github.io/bloomfilter-tutorial:

P(false positive) ≈ (1 - e^(-k·n/m))^k

Variables:
  k = number of hash functions
  n = number of keys inserted into the filter
  m = number of bits in the filter (= M)

Optimal number of hash functions:
  k_optimal = (m/n) × ln(2)   ← choose this k for minimum false positives

From the lecture’s calculator demo (hur.st/bloomfilter):

n = 1,000,000,000  (1 billion keys)
p = 0.01           (1% false positive rate — target)
k = 7              (optimal hash functions)
m = 1 GB           (filter size in bits)

→ False positive rate ≈ 1% ✅

vs. storing 1 billion keys in a hashmap:
    1 billion × 100 bytes/key = 100 GB  ← won't fit in RAM ❌
    Bloom filter: just 1 GB             ← trivially fits ✅

Space savings: ~100× 🎉

Choosing the Right Filter Size (from the Tutorial)

To configure a Bloom filter for your application:

Choose n — estimate how many keys you'll insert
Choose m — pick a bit-array size you can afford in memory
Calculate optimal k = (m/n) × ln(2)
Calculate false positive rate = (1 - e^(-k·n/m))^k
If false positive rate is too high → increase m and repeat

Key insight: A larger filter = fewer false positives. Time complexity: O(k) for both insert and lookup — always constant regardless of how many elements are stored.

Real-World Use Cases

1. CDN (Akamai / Cloudflare) — from the lecture:

Fact: 70% of URLs on the internet are visited only ONCE.
      CDNs cache a URL only after it has been visited TWICE.

Problem:
  Store all seen URLs to detect second-visit:
  Trillion URLs × 500 bytes/URL = 500 TB hashmap ❌

Solution with Bloom Filter:
  First visit  → insert URL into Bloom Filter (don't cache yet)
  Second visit → Bloom Filter says "seen before" → cache it ✅

  Filter size: a few GB (vs 500 TB) — 100,000× smaller!

2. Other real-world uses (from resources):

https://medium.com/media/1dace6949d334314cd43d539ed5b9684/href

3. Real databases using Bloom Filters (from tutorial):

ScyllaDB — uses MurmurHash for bloom filters
RocksDB — built-in bloom filter per SSTable
Apache Spark — BloomFilterImpl
SQLite — used internally for query optimization
Chromium browser — detects malicious URLs
InfluxDB — for time-series data lookups

11. Tombstones: How Deletion Works

Problem:

SSTables are immutable — you can’t remove an entry from them
Bloom Filters never forget — once bits are set, they cannot be unset

Solution: Tombstones — a special sentinel value meaning “this key has been deleted.”

Deletion = Write a Tombstone

DELETE key=060

Step 1: Write tombstone to MemTable:
        MemTable: key=060 → TOMBSTONE  (a special marker value)

Step 2: Tombstone flushes to SSTable just like any normal value:
        sst_X: [..., key=060 → TOMBSTONE, ...]

Step 3: On any future read for key=060:
        Value = TOMBSTONE → return "key not found" to client

Why Not Just Delete From the Bloom Filter?

Bloom Filter: NO delete operation exists.

"Bloom filter never forgets." — Instructor

When key=060 was first written → some bits (e.g., 3, 9, 11) were set.
Even after the key is deleted:
  → Those bits stay 1.
  → Bloom Filter will ALWAYS say "key=060 maybe exists".

Without a tombstone:
  Every read for deleted key=060:
    ① Bloom filter: "maybe"
    ② Scan ALL SSTables (waste of disk I/O!) ← bad ❌

With a tombstone:
  Every read for deleted key=060:
    ① Bloom filter: "maybe"
    ② Seek into newest SSTable
    ③ Find TOMBSTONE immediately
    ④ Return "not found" — no need to check older SSTables ✅

Tombstone During Compaction

12. Full Read & Write Flow

Write Flow

Read Flow

Performance Summary

https://medium.com/media/7d45a4d728a848c3551270e7c0241391/href

13. Summary Cheat Sheet

┌──────────────────────────────────────────────────────────────────────────┐
│          NoSQL Internals — LSM Tree Architecture at a Glance             │
├──────────────────────┬───────────────────────────────────────────────────┤
│ Component            │ Summary                                           │
├──────────────────────┼───────────────────────────────────────────────────┤
│ WAL                  │ Append-only log on disk. Crash recovery,          │
│ (Write-Ahead Log)    │ replication, point-in-time recovery.              │
│                      │ Immutable. Deleted after each MemTable flush.     │
├──────────────────────┼───────────────────────────────────────────────────┤
│ MemTable             │ BBST in RAM. Absorbs ALL writes. Acts as read     │
│                      │ cache. No duplicates. In-order dump = sorted.     │
│                      │ Flushed to SSTable when full (~100 MB).           │
├──────────────────────┼───────────────────────────────────────────────────┤
│ SSTable              │ Sorted String Table. Immutable sorted file on     │
│ (Sorted String Table)│ disk. No duplicates within one file. Created      │
│                      │ from MemTable flush. Never appended to again.     │
├──────────────────────┼───────────────────────────────────────────────────┤
│ LSM Tree             │ Multi-level hierarchy of SSTables. Small at top,  │
│ (Log-Structured      │ large merged files at bottom. Compaction keeps    │
│  Merge Tree)         │ SSTable count at O(log N).                        │
├──────────────────────┼───────────────────────────────────────────────────┤
│ Compaction           │ Background merge of SSTables via merge sort.      │
│                      │ Keeps latest value per key. Removes duplicates.   │
│                      │ Critical to tune in production (timing, size).    │
├──────────────────────┼───────────────────────────────────────────────────┤
│ Sparse Index         │ Per-SSTable in-memory index. One entry per 64 KB  │
│                      │ block. 1 GB SSTable → 1 MB index (1000× smaller).│
│                      │ Enables 2-step binary search into SSTables.       │
├──────────────────────┼───────────────────────────────────────────────────┤
│ Bloom Filter         │ Fixed-size bit array (M bits, K hash functions).  │
│                      │ Zero false negatives. Tunable false positive rate. │
│                      │ 1B keys → 1 GB filter at 1% FP. Size never grows.│
│                      │ Time: O(k) for insert/lookup. Real DBs: RocksDB, │
│                      │ Cassandra, ScyllaDB, SQLite, Chromium, Spark.     │
├──────────────────────┼───────────────────────────────────────────────────┤
│ Tombstone            │ Special delete marker value.                      │
│                      │ DELETE key → write(key, TOMBSTONE).               │
│                      │ Needed because: SSTable is immutable +            │
│                      │ Bloom Filter never forgets.                       │
│                      │ Physically removed during later compaction.       │
└──────────────────────┴──────────────────────────────────────────────────┘

Databases That Use LSM Trees

https://medium.com/media/d1b9aff5957bfe35d616e00aeefe80cc/href

📚 Related Topics:

* SQL vs NoSQL Ultimate Guide
* External Resources:
-> Bloom Filters by Example — interactive tutorial
-> Bloom Filter Calculator — tune n, m, k, p
* Wikipedia — Bloom Filter — comprehensive reference
* Assignment from this class: Bloom Filter implementation
* Next class: Case Study — Designing a Messaging App

Notes based on from: llimllib.github.io/bloomfilter-tutorial and hur.st/bloomfilter

🌳 NoSQL Internals: LSM Trees, WAL & Bloom Filters was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

NoSQL Internals: LSM Trees, WAL & Bloom Filters

Part 6

📋 Table of Contents

1. The Problem: Why SQL’s Approach Breaks for NoSQL

SQL: Fixed Schema → Safe In-Place Updates

NoSQL: Variable-Size Data → In-Place Updates Are Dangerous

2. Brute Force: Flat File Storage

3. Write-Ahead Log (WAL)

Why WAL Exists

WAL Properties

4. In-Memory Hashmap: Key → Offset

🚨 Big Problem: Hashmap Gets Too Large

5. MemTable: Latest Chunk in Memory

Why Balanced BST (BBST) — Not a Hashmap?

MemTable Write Flow

MemTable as Read Cache

When MemTable is Full → Flush to SSTable

6. SSTables: Sorted String Tables

Multiple SSTables Over Time

Background Compaction (Merge Sort)

7. Full Dry Run: Keys 1–5

8. LSM Tree: The Full Architecture

9. Sparse Index: Memory-Efficient Lookup

Building the Sparse Index

Reading with Sparse Index (2-Step Binary Search)

10. Bloom Filters: Skip Unnecessary Reads

Bloom Filter — Core Idea

How a Bloom Filter Works — The Exact Lecture Example

False Positive Formula

Choosing the Right Filter Size (from the Tutorial)

Real-World Use Cases

11. Tombstones: How Deletion Works

Deletion = Write a Tombstone

Why Not Just Delete From the Bloom Filter?

Tombstone During Compaction

12. Full Read & Write Flow

Write Flow

Read Flow

Performance Summary

13. Summary Cheat Sheet

Databases That Use LSM Trees

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Why DeFi Needs Vault Infrastructure

Why DeFi Needs Vault Infrastructure

Trump-Backed World Liberty Puts $5.3 Million Price Tag on 'Guaranteed Access' to Team

True Diversification in DeFi: Why More Tokens Don’t Mean Less Risk

Why DeFi Needs Vault Infrastructure

Why DeFi Needs Vault Infrastructure