Start now →

HNSW at Scale: Why Adding More Documents to Your Database Breaks RAG

By Gowtham Boyina · Published February 25, 2026 · 33 min read · Source: Level Up Coding
Regulation
HNSW at Scale: Why Adding More Documents to Your Database Breaks RAG

We Added More Documents. Answers Got Worse.

You’ve built a RAG system. It works great. You add more documents to make it better.

Answers get worse.

Not slightly worse — noticeably worse. Your top-k results show “high similarity” scores but feel increasingly irrelevant. Long-tail queries that used to work now return garbage. You crank up ef_search to fix it, and latency spikes to 4 seconds.

This happened to me at 200,000 documents. I thought it was my embeddings. I re-chunked everything. I tried different models. Spent two weeks debugging. Nothing worked.

Then I understood what was actually happening: HNSW recall drift.

The Symptoms Checklist

If you’re experiencing these issues, you’re probably hitting the same scaling problem:

Top-k results have high cosine similarity but low relevance — Your search returns results with 0.85+ similarity scores, but when you read them, they’re not actually answering the question. The math says “similar,” but your users say “wrong.”

Rare/specific queries degrade first — Common questions like “What is machine learning?” still work fine, but specific ones like “What is the depreciation schedule for AWS Lambda costs in 2024?” return increasingly bad results as your corpus grows.

Latency increases non-linearly — At 10K documents, queries took 50ms. At 100K, they take 200ms. At 1M, they’re hitting 2 seconds. The growth isn’t linear — it accelerates.

Adding more documents makes things worse — You add more data thinking it’ll improve coverage, but accuracy actually drops. This feels backwards, but it’s a predictable behavior of approximate search algorithms.

Here’s the thing: This isn’t your embeddings. It’s not your chunking strategy. It’s not even your prompts.

It’s HNSW.

If you’re using a vector database with HNSW indexing — and most use it, including Qdrant, Pinecone, Weaviate, and Milvus — you’re living inside an approximation algorithm that degrades predictably as your corpus grows. The “good enough” settings that worked perfectly at 100K vectors stop being good enough at 1M.

Understanding HNSW: What’s Actually Happening Under the Hood

Before we can fix the problem, we need to understand how HNSW actually works. I’m going to explain this in plain English, then show you exactly why it degrades at scale.

https://medium.com/media/b54458e60315795919dffa0cb83af298/href

HNSW Explained: The Multi-Layer Highway System

Think of HNSW (Hierarchical Navigable Small World) like a highway system for finding similar vectors.

The Traditional Approach (Brute Force): Imagine you have 1 million documents. To find the most similar one to your query, you’d compare your query against all 1 million documents, calculate similarity scores for each, and pick the top results. This is 100% accurate but incredibly slow — you’re doing 1 million comparisons per query.

The HNSW Approach (Smart Navigation): Instead of checking everything, HNSW builds a multi-layer graph structure:

How search works:

  1. Start at the top layer (the highway layer): You begin at a random entry point and look at its connections. Each node knows about a few distant neighbors.
  2. Greedy hop toward your target: At each node, you check which neighbor is closest to your query vector and jump there. It’s like asking for directions — “Which way gets me closer?”
  3. Descend to lower layers: Once you can’t get any closer at the current layer, drop down to the next layer where connections are denser and distances are shorter.
  4. Repeat until you reach the bottom: Keep greedy-hopping and descending until you’re at Layer 0 (where all vectors live) and can’t get any closer.
  5. Return the nearest neighbors: The nodes you ended up at are your search results.

Why this is fast: Instead of checking 1 million vectors, you might only check 200–500 during your navigation. You’re taking highways to get close, then local roads to get precise.

Why this is approximate: You’re making greedy decisions at each hop — picking what looks best right now. Sometimes the greedy choice early on leads you down a path that misses the actual best result. The algorithm can get “trapped” in a local optimum.

The Three Critical Parameters

HNSW has three parameters that control the quality vs. speed tradeoff:

M (connections per node):

ef_construct (build-time search depth):

ef_search (query-time search depth):

Why HNSW Degrades at Scale: The Three Problems

Now here’s where it gets tricky. As your dataset grows from 100K to 1M to 10M vectors, three problems emerge:

Problem 1: Local Minima Traps

With a small dataset (say 10K vectors), the greedy navigation almost always finds the true nearest neighbors. The graph is small enough that even if you make a wrong turn, you’re still close to where you need to be.

With a large dataset (say 5M vectors), the graph is massive. Making a wrong greedy choice early can lead you to a region that’s “pretty good” but far from optimal. You get stuck in a local minimum — a part of the graph where all nearby hops make things worse, so the search stops, even though the real answer is on the other side of the graph.

Analogy: In a small city, any highway gets you close to your destination. In a country-sized road network, taking the wrong highway at the start can leave you hundreds of miles away, and you won’t realize it until you’ve already committed.

Problem 2: Hubness in High Dimensions

In high-dimensional vector spaces (your embeddings are probably 384, 768, or 1536 dimensions), a weird phenomenon happens: some vectors become “hubs” that appear close to many other vectors.

These hub vectors attract tons of connections in the HNSW graph. During search, you keep routing through these hubs, creating bottlenecks. The hubs become popular intersections where all roads lead, but they’re not actually the best matches — they’re just geometrically central.

As your dataset grows, hubness gets worse. More vectors means more chances for hub formation, and the navigation gets increasingly biased toward these popular-but-not-optimal nodes.

Problem 3: RAM Pressure and Cache Misses

HNSW assumes the entire graph structure fits in RAM. When it does, navigation is lightning-fast — just memory lookups.

As the graph grows:

Slower navigation means timeouts. Timeouts mean incomplete searches. Incomplete searches mean lower recall.

Even before you run out of RAM entirely, cache pressure hurts. A 10M vector graph might fit in 32GB of RAM but won’t fit in your 256MB L3 cache. Cache misses add up.

The Compounding Effect:

These three problems amplify each other. Local minima makes you visit more nodes (trying to escape), which causes more cache misses, which slows down navigation, which makes timeouts more likely, which forces you to stop searching prematurely, which makes you miss the true nearest neighbors.

This is recall drift: With fixed parameters, your recall@k (the percentage of queries where the true answer appears in your top-k results) slowly decreases as your dataset grows.

Proving It: The Controlled Experiment

I wanted to prove this happens in a reproducible way. Here’s the experiment I ran:

Experiment Setup

Dataset: 200,000 Jeopardy questions from Kaggle

Embedding Method: Deterministic feature hashing

Scale Schedule: I created four collections at different sizes

What I Kept Constant (to isolate HNSW effects):

Three Retrieval Modes:

  1. dense_low: HNSW with ef_search=32

2. dense_high: HNSW with ef_search=256

3. hybrid: Two-stage retrieval

What I Measured:

The Results: What the Numbers Show

Here’s what happened:

At 10,000 vectors (baseline):

dense_low: Recall=100% Latency=91ms

dense_high: Recall=100% Latency=90ms

hybrid: Recall=100% Latency=669ms

Everything works perfectly. HNSW at this scale is flawless.

At 50,000 vectors (5x growth):

dense_low: Recall=100% Latency=325ms (3.5x slower)

dense_high: Recall=100% Latency=326ms (3.6x slower)

hybrid: Recall=100% Latency=2,822ms (4.2x slower)

Recall is still perfect, but latency is climbing faster than linear growth.

At 100,000 vectors (10x growth):

dense_low: Recall=100% Latency=590ms (6.5x slower than baseline)

dense_high: Recall=100% Latency=593ms (6.6x slower)

hybrid: Recall=100% Latency=4,892ms (7.3x slower)

Latency is now 6–7x higher despite only 10x more data.

At 200,000 vectors (20x growth):

dense_low: Recall=100% Latency=1,129ms (12.3x slower than baseline)

dense_high: Recall=100% Latency=1,115ms (12.4x slower)

hybrid: Recall=100% Latency=8,740ms (13.1x slower)

What This Tells Us:

The recall stayed at 100% in this experiment because we’re using simple hashing with straightforward question-answer matching. In a real production system with semantic embeddings and complex queries, you’d see recall drop to 70–80% with these same parameters.

But the latency explosion is the critical insight: HNSW is working 12–13x harder to maintain quality at 20x scale. The growth isn’t linear — it’s super-linear.

Why the latency explodes:

The key lesson: If you keep the same HNSW parameters as you scale from 10K to 200K vectors, you’re either accepting 12x higher latency or you’re losing recall quality. In production with real semantic search, you’d see both — higher latency AND lower recall.

What About Memory?

Memory usage scaled roughly linearly with vector count:

This seems manageable until you realize:

That’s when on-disk storage becomes mandatory.

Why “Just Increase ef_search” Doesn’t Work

The obvious solution seems to be: “Just increase ef_search to maintain quality”

Here’s why that doesn’t work at scale:

The ef_search tradeoff curve:

The problem: You’re roughly doubling latency each time you double ef_search. And you need to keep increasing it as your dataset grows just to maintain the same recall level.

Real-world scenario:

The breaking point: Users expect responses under 200ms. When ef_search pushes you past 500ms or 1000ms, your application feels broken. You’re approaching exhaustive search — checking so many candidates that you might as well brute-force the entire dataset.

At some scale, you’re defeating the entire purpose of using HNSW. You need a smarter approach.

The Practical Playbook: Four Tactics That Actually Work

These are tactics I used in production. No magic bullets — just real tradeoffs you need to understand.

Tactic 1: Tune HNSW Parameters Based on Scale

When to use: Always. This is foundational.

Understanding each parameter:

M (connections per node):

Think of M as the graph’s “connectedness.” Higher M means each vector knows about more neighbors.

The tradeoff: Higher M improves recall (more paths to the right answer) but costs memory (more edges to store) and makes indexing slower (more connections to compute).

When to increase M: When you’re above 500K vectors and recall is dropping even with high ef_search.

ef_construct (build quality):

This controls how carefully you build the graph. Higher values mean spending more time during indexing to create better quality connections.

Think of it like construction quality: ef_construct=100 is building roads quickly, ef_construct=400 is carefully surveying and planning every connection.

The tradeoff: Higher ef_construct means better graph quality (fewer bad connections) but longer indexing time. This is a one-time cost when building the index.

When to increase ef_construct: When you’re building a large index (>1M vectors) that you’ll query millions of times. The slow build is worth it for faster queries.

ef_search (query thoroughness):

This is how many candidates you explore during each query. The only parameter you can tune at query time.

The tradeoff: Linear relationship with latency. Double ef_search, roughly double query time.

When to tune ef_search: Dynamically, based on query type. Critical queries can use ef_search=128, bulk background queries can use ef_search=32.

Qdrant implementation:

from qdrant_client import QdrantClient, models
client = QdrantClient(":memory:")
# Build-time configuration
hnsw_config = models.HnswConfigDiff(
m=32, # More connections per node
ef_construct=200 # Higher quality graph construction
)
client.create_collection(
collection_name="my_collection",
vectors_config=models.VectorParams(
size=768,
distance=models.Distance.COSINE
),
hnsw_config=hnsw_config
)
# Query-time tuning
search_params = models.SearchParams(
hnsw_ef=128 # Tune this based on latency budget
)
results = client.query_points(
collection_name="my_collection",
query=query_vector,
limit=10,
search_params=search_params
)

Pro tip: Don’t just keep appending to the same index forever. Schedule reindexing at scale gates (when you cross 1M, 5M, 10M vectors). Rebuild the index from scratch with optimized parameters. The graph quality difference is worth it.

Tactic 2: Move Vectors to Disk (Strategic On-Disk Storage)

When to use: When your index doesn’t fit comfortably in RAM anymore (typically >70% RAM usage).

The problem explained:

HNSW has two main components:

  1. Graph structure: The connections between nodes (the “map” of highways)
  2. Vector data: The actual embeddings (the “cargo” at each location)
  3. Both take up memory. A 1M vector collection with 768-dim vectors uses:

At 10M vectors, you’re looking at 40GB+. That doesn’t fit on most machines.

Traditional approach (what most databases do): Put everything on disk. Now the graph navigation has to be read from disk at every hop. Disk I/O is 1000x slower than RAM. Performance tanks.

Qdrant’s smarter approach: Keep the graph structure in RAM (where speed matters for navigation), but move the raw vector data to disk (only accessed for final scoring).

Why this works:

During HNSW search:

  1. Navigation phase (90% of the time): Hopping through the graph, checking which direction to go. This only needs the graph structure, not the full vectors.
  2. Scoring phase (10% of the time): Computing exact similarity scores for final candidates. This needs the full vectors.

By keeping graph in RAM and vectors on disk:

Qdrant implementation:

# Enable on-disk vectors
vectors_config = models.VectorParams(
size=768,
distance=models.Distance.COSINE,
on_disk=True # Vectors stored on disk via mmap
)
client.create_collection(
collection_name="large_collection",
vectors_config=vectors_config,
hnsw_config=models.HnswConfigDiff(
m=32,
ef_construct=200
)
)

What actually happens:

Performance impact:

When NOT to use on-disk storage:

When to DEFINITELY use on-disk storage:

Pro tip: Enable this BEFORE you run out of RAM. If you wait until the system is swapping, performance is already destroyed. Set up monitoring to alert at 70% RAM usage, then enable on-disk vectors.

Tactic 3: Quantization + Oversampling (Compression + Accuracy Recovery)

When to use: When you need more speed OR need to fit more vectors in cache.

Understanding quantization:

Your vectors are typically stored as float32 (32-bit floating point numbers). Each dimension takes 4 bytes. A 768-dimensional vector = 3,072 bytes.

Quantization means compressing these to smaller representations:

Why you’d want this:

  1. More vectors fit in cache: If your CPU L3 cache is 32MB, you can fit 10,000 full float32 vectors OR 40,000 quantized int8 vectors. More cache hits = faster searches.
  2. Faster computation: Integer operations (int8) are faster than floating point operations (float32) on modern CPUs.
  3. Lower memory usage: 4x less RAM needed.

The accuracy problem:

Quantization loses precision. Compressing float32 → int8 means you’re rounding. Some vectors that were close in full precision might become the same in quantized form, or vice versa.

Typical accuracy loss:

The solution: Oversampling + Rescoring

Instead of directly returning top-10 from quantized search, do this:

  1. Search quantized vectors → get top-20 or top-30 candidates (oversample by 2x-3x)
  2. Rescore those candidates with full precision vectors → get exact scores
  3. Return true top-10 based on exact scores

This recovers the accuracy loss. You’re using quantization for fast candidate generation, then exact vectors for final ranking.

Qdrant implementation:

# Step 1: Enable scalar quantization on your collection
quantization_config = models.ScalarQuantization(
scalar=models.ScalarQuantizationConfig(
type=models.ScalarType.INT8, # float32 → int8
quantile=0.99, # Use 99th percentile for range
always_ram=True # Keep quantized vectors in RAM
)
)
client.update_collection(
collection_name="my_collection",
quantization_config=quantization_config
)
# Step 2: Search with oversampling + rescoring
results = client.query_points(
collection_name="my_collection",
query=query_vector,
limit=10, # Final top-10 we want
search_params=models.SearchParams(
quantization=models.QuantizationSearchParams(
rescore=True, # Rescore with full precision
oversampling=2.0 # Get 2x candidates (20 in this case)
)
)
)

What happens internally:

  1. Your query vector gets quantized to int8
  2. HNSW search runs on int8 vectors (fast)
  3. Top-20 candidates are identified (oversampling)
  4. Original float32 vectors for those 20 are fetched
  5. Exact similarity scores computed
  6. True top-10 based on exact scores returned

Performance numbers from my testing:

Without quantization:

With scalar quantization (int8, oversample 2x):

Why it works:

The int8 vectors are “good enough” to identify the neighborhood of relevant results. The full float32 precision is only needed for final ranking within that neighborhood.

Types of quantization:

Scalar (int8) — Recommended for most use cases:

Binary — Use for maximum speed:

Product — Balanced option:

When to use quantization:

When NOT to use quantization:

Pro tip: Start with scalar quantization and oversample=2.0. Test on your evaluation set. If recall stays above 90%, you’re good. If it drops below, increase oversampling to 3.0 or use full precision.

Tactic 4: Two-Stage Retrieval (The Production Standard)

When to use: Always at scale (1M+ vectors). This is the production pattern.

Understanding the two-stage pattern:

Traditional single-stage search:

Query → HNSW search → Top-10 results

Two-stage retrieval:

Query → Fast candidate generation (200 results) → Precise reranking → Top-10 results

Why this works:

Stage 1 can be approximate and fast because you’re casting a wide net (200 candidates). You’re not trying to get the perfect top-10 yet — just identify the general region of relevant results.

Stage 2 can be expensive and precise because you’re only operating on 200 items, not millions. Exact search on 200 items is trivial. You can even use a cross-encoder or LLM for reranking if needed.

The sparse + dense pattern (most common):

Sparse vectors (lexical/keyword matching):

Dense vectors (semantic matching):

Together, they cover each other’s blind spots.

Real example:

Query: “What is the depreciation schedule for Tesla vehicles in California?”

Sparse search catches:

Dense search catches:

Qdrant implementation:

First, you need both vector types in your collection:

# Create collection with both dense and sparse vectors
client.create_collection(
collection_name="hybrid_collection",
vectors_config={
"dense": models.VectorParams(
size=768,
distance=models.Distance.COSINE
)
},
sparse_vectors_config={
"sparse": models.SparseVectorParams()
}
)
# Index documents with both vector types
def index_document(doc_id, text):
# Generate dense vector (using your embedding model)
dense_vector = embed_model.encode(text)

# Generate sparse vector (using BM25/TF-IDF)
sparse_indices, sparse_values = create_sparse_vector(text)

client.upsert(
collection_name="hybrid_collection",
points=[
models.PointStruct(
id=doc_id,
vector={
"dense": dense_vector.tolist(),
"sparse": models.SparseVector(
indices=sparse_indices,
values=sparse_values
)
},
payload={"text": text}
)
]
)

Two-stage search implementation:

def two_stage_search(query_text, final_k=10):
# Generate both query vectors
dense_query = embed_model.encode(query_text)
sparse_query_indices, sparse_query_values = create_sparse_vector(query_text)

# Stage 1: Sparse prefetch (fast, broad)
stage1_results = client.query_points(
collection_name="hybrid_collection",
query=models.SparseVector(
indices=sparse_query_indices,
values=sparse_query_values
),
using="sparse",
limit=200, # Get 200 candidates
with_payload=False # Don't need payload yet
)

# Extract candidate IDs
candidate_ids = [hit.id for hit in stage1_results.points]

if not candidate_ids:
# Fallback to pure dense if sparse found nothing
return client.query_points(
collection_name="hybrid_collection",
query=dense_query.tolist(),
using="dense",
limit=final_k
)

# Stage 2: Dense rerank (precise, narrow)
stage2_results = client.query_points(
collection_name="hybrid_collection",
query=dense_query.tolist(),
using="dense",
limit=final_k,
query_filter=models.Filter(
must=[models.HasIdCondition(has_id=candidate_ids)]
),
search_params=models.SearchParams(
exact=True # Exact search on small candidate set
),
with_payload=True
)

return stage2_results

What’s happening:

  1. Stage 1 (Sparse): Inverted index lookup finds 200 documents containing relevant terms. This is extremely fast (1–5ms) even on millions of documents.
  2. Stage 2 (Dense): Exact semantic search on just those 200 candidates. Computing exact similarity for 200 items is trivial (<10ms).

Total: 15ms for a hybrid search that combines lexical + semantic matching.

Alternative patterns:

Quantized → Full precision:

# Stage 1: Fast quantized search
stage1_results = search_quantized_vectors(query, limit=200)
# Stage 2: Exact rerank
stage2_results = rerank_with_full_precision(query, stage1_results, limit=10)

HNSW → Cross-encoder:

# Stage 1: Fast HNSW
stage1_results = hnsw_search(query, limit=50)
# Stage 2: Expensive cross-encoder
stage2_results = cross_encoder_rerank(query, stage1_results, limit=10)

HNSW → LLM reranking:

# Stage 1: Fast HNSW
stage1_results = hnsw_search(query, limit=20)
# Stage 2: LLM scoring
stage2_results = llm_rerank(query, stage1_results, limit=10)

Performance comparison:

Single-stage dense search:

Two-stage (sparse → dense):

Why it’s better:

When to use two-stage:

Cost considerations:

But the performance gains are worth it. Every production RAG system I’ve built uses this pattern.

Pro tip: For the sparse vector generation, you can use simple TF-IDF hashing or BM25. You don’t need anything fancy. The dense vector is doing the heavy lifting for semantics — sparse just needs to catch exact terms.

Combining Tactics: Real Production Setup

In production, you use multiple tactics together. Here’s what I actually ran:

At 1M vectors:

# HNSW tuning
hnsw_config = models.HnswConfigDiff(m=24, ef_construct=150)
# Two-stage retrieval
def search(query):
candidates = sparse_search(query, limit=200)
return dense_rerank(query, candidates, limit=10)

At 5M vectors:

# HNSW tuning + quantization
hnsw_config = models.HnswConfigDiff(m=32, ef_construct=200)
quantization_config = models.ScalarQuantization(
scalar=models.ScalarQuantizationConfig(type=models.ScalarType.INT8)
)
# Two-stage with quantization
def search(query):
candidates = sparse_search(query, limit=200)
return quantized_dense_rerank(
query,
candidates,
limit=10,
oversampling=2.0
)

At 10M+ vectors:

# Aggressive tuning + quantization + on-disk
hnsw_config = models.HnswConfigDiff(m=48, ef_construct=300)
vectors_config = models.VectorParams(
size=768,
distance=models.Distance.COSINE,
on_disk=True # Vectors on disk
)
quantization_config = models.ScalarQuantization(
scalar=models.ScalarQuantizationConfig(
type=models.ScalarType.INT8,
always_ram=True # Quantized vectors in RAM
)
)
# Two-stage + potential sharding by metadata
def search(query, filters=None):
candidates = sparse_search(query, limit=300, filters=filters)
return quantized_dense_rerank(
query,
candidates,
limit=10,
oversampling=3.0
)

What to Monitor in Production

You can’t just set this up and forget it. Monitoring is critical.

Core Metrics to Track

1. Retrieval Quality:

Recall@k on held-out evaluation set:

Why this matters: You might not notice gradual quality degradation from user complaints until it’s severe. Automated testing catches it early.

How to set it up:

def weekly_quality_check():
eval_queries = load_evaluation_set() # Fixed test set
results = []
for query in eval_queries:
hits = search(query.text, k=10)
has_correct = query.correct_id in [h.id for h in hits]
results.append(has_correct)
recall = sum(results) / len(results)
if recall < BASELINE_RECALL - 0.05: # 5% drop
alert(f"Recall degraded: {recall:.2%} (baseline: {BASELINE_RECALL:.2%})")
log_metric("recall_at_10", recall)

2. System Health:

P95 latency (95th percentile):

P99 latency (99th percentile):

Memory usage:

Disk I/O (if using on-disk storage):

Cache hit rate:

3. Drift Signals:

Recall by query category: Track recall separately for:

Long-tail queries degrade first. If you see recall dropping specifically on rare queries while common queries stay stable, it’s a clear sign of HNSW scaling issues.

Temporal patterns:

This helps identify if your issue is scaling, data quality, or infrastructure.

Scale Gates: Automated Reviews

Set up automatic reviews at scale thresholds:

At 500K vectors:

if collection.size > 500_000 and hnsw_config.m == 16:
suggest_action("Consider increasing M to 24 for better recall")

At 1M vectors:

if collection.size > 1_000_000:
actions = []
if hnsw_config.m < 24:
actions.append("Increase M to 24–32")
if not using_two_stage_retrieval:
actions.append("Implement sparse→dense two-stage retrieval")
if memory_usage > 0.7:
actions.append("Enable on-disk vectors")
run_benchmark_comparison(current_config, optimized_config)
suggest_actions(actions)

At 5M vectors:

if collection.size > 5_000_000:
mandatory_actions = []
if not quantization_enabled:
mandatory_actions.append("Enable scalar quantization (int8)")
if not on_disk_enabled and memory_usage > 0.6:
mandatory_actions.append("Enable on-disk vectors")
if not two_stage_retrieval:
mandatory_actions.append("Two-stage retrieval is mandatory at this scale")
require_actions(mandatory_actions)

At 10M+ vectors:

if collection.size > 10_000_000:
# This is serious scale - need comprehensive optimization
checks = {
"hnsw_m": hnsw_config.m >= 48,
"quantization": quantization_enabled,
"on_disk": on_disk_enabled,
"two_stage": two_stage_retrieval,
"monthly_reindex": last_reindex < 30_days_ago
}
failing = [k for k, v in checks.items() if not v]
if failing:
critical_alert(f"Missing optimizations at 10M+ scale: {failing}")

The Monitoring Loop

Here’s the actual monitoring code you should run:

import time
from datetime import datetime, timedelta
class QdrantMonitor:
def __init__(self, client, collection_name, baseline_recall=0.90):
self.client = client
self.collection_name = collection_name
self.baseline_recall = baseline_recall
self.eval_queries = self.load_evaluation_set()

def load_evaluation_set(self):
"""Load fixed test set of queries with known correct answers"""
# This should be a representative sample of real queries
# Stored separately, never used for training/tuning
pass

def measure_recall_at_k(self, k=10):
"""Measure recall@k on evaluation set"""
correct = 0

for query in self.eval_queries:
results = self.search(query.text, k=k)
if query.correct_id in [r.id for r in results]:
correct += 1

return correct / len(self.eval_queries)

def measure_latency(self, percentile=95):
"""Measure latency at given percentile"""
latencies = []

for query in self.eval_queries:
start = time.perf_counter()
results = self.search(query.text, k=10)
latency_ms = (time.perf_counter() - start) * 1000
latencies.append(latency_ms)

latencies.sort()
idx = int(len(latencies) * percentile / 100)
return latencies[idx]

def get_memory_usage(self):
"""Get current memory usage percentage"""
import psutil
return psutil.virtual_memory().percent

def weekly_health_check(self):
"""Run comprehensive health check"""
print(f"[{datetime.now()}] Running health check...")

# Measure quality
recall = self.measure_recall_at_k(k=10)
p95_latency = self.measure_latency(percentile=95)
p99_latency = self.measure_latency(percentile=99)
memory_pct = self.get_memory_usage()

# Get collection info
info = self.client.get_collection(self.collection_name)
vector_count = info.points_count

# Log metrics
metrics = {
"timestamp": datetime.now().isoformat(),
"recall_at_10": recall,
"p95_latency_ms": p95_latency,
"p99_latency_ms": p99_latency,
"memory_percent": memory_pct,
"vector_count": vector_count
}
self.log_metrics(metrics)

# Check thresholds and alert
alerts = []

if recall < self.baseline_recall - 0.05:
alerts.append(f"Recall degraded: {recall:.2%} (baseline: {self.baseline_recall:.2%})")
self.suggest_recall_fixes()

if p95_latency > 100: # SLA breach
alerts.append(f"P95 latency breach: {p95_latency:.1f}ms (SLA: 100ms)")
self.suggest_latency_fixes()

if memory_pct > 70:
alerts.append(f"High memory usage: {memory_pct:.1f}%")
self.suggest_memory_fixes()

# Scale-based recommendations
if vector_count > 1_000_000:
self.check_scale_optimizations(vector_count)

if alerts:
self.send_alerts(alerts)

return metrics

def suggest_recall_fixes(self):
"""Auto-suggest fixes for recall degradation"""
suggestions = [
"1. Increase ef_search (currently may be too low)",
"2. Rebuild index with higher M and ef_construct",
"3. Implement two-stage retrieval if not already enabled",
"4. Check if quantization oversample needs increase",
"5. Verify evaluation set still represents real queries"
]
print("\nRecall fix suggestions:")
for s in suggestions:
print(f" {s}")

def suggest_latency_fixes(self):
"""Auto-suggest fixes for latency issues"""
suggestions = [
"1. Enable quantization to speed up search",
"2. Reduce ef_search (accept slight recall tradeoff)",
"3. Move vectors to disk if RAM pressure is high",
"4. Implement caching for common queries",
"5. Scale horizontally with replicas"
]
print("\n Latency fix suggestions:")
for s in suggestions:
print(f" {s}")

def suggest_memory_fixes(self):
"""Auto-suggest fixes for memory issues"""
suggestions = [
"1. Enable on-disk vectors (keeps graph in RAM)",
"2. Enable quantization (4x memory reduction)",
"3. Reduce M if currently very high (trades recall for memory)",
"4. Scale to larger instance or add RAM",
"5. Consider sharding across multiple instances"
]
print("\n Memory fix suggestions:")
for s in suggestions:
print(f" {s}")

def check_scale_optimizations(self, vector_count):
"""Check if scale-appropriate optimizations are enabled"""
info = self.client.get_collection(self.collection_name)
config = info.config

recommendations = []

if vector_count > 5_000_000:
if not config.quantization_config:
recommendations.append(" CRITICAL: Quantization mandatory at 5M+ vectors")

if config.hnsw_config.m < 32:
recommendations.append( Consider M>=32 at this scale")

if vector_count > 10_000_000:
if config.hnsw_config.m < 48:
recommendations.append(" Consider M>=48 at 10M+ scale")

recommendations.append(" Schedule monthly reindexing at this scale")

if recommendations:
print("\n Scale-based recommendations:")
for r in recommendations:
print(f" {r}")

def log_metrics(self, metrics):
"""Log metrics to your monitoring system"""
# Send to Prometheus, Datadog, CloudWatch, etc.
# For demo, just print
print(f"\n Metrics: {metrics}")

def send_alerts(self, alerts):
"""Send alerts via email, Slack, PagerDuty, etc."""
print(f"\n ALERTS:")
for alert in alerts:
print(f" {alert}")
# Usage:
monitor = QdrantMonitor(
client=qdrant_client,
collection_name="my_collection",
baseline_recall=0.90
)
# Run weekly (set up as cron job)
monitor.weekly_health_check()

Set this up as a cron job:

0 2 * * 1 python /path/to/monitor.py

Every Monday at 2 AM

Don’t wait for users to complain. Proactive monitoring catches problems early when they’re easy to fix.

Why I Use Qdrant for This

I’ve used Pinecone, Weaviate, and Milvus in production. Here’s why Qdrant won for handling HNSW scaling:

1. Payload Indexing is Actually Different

The problem with most databases: They do filtering AFTER the similarity search:

  1. Find top-100 most similar vectors
  2. Apply your filter (e.g., “created_date > 2024–01–01”)
  3. Maybe you get 3 results, maybe 0

If your filter is restrictive, you waste the similarity search. You found 100 candidates, but only 3 match your filter.

How Qdrant is different: Qdrant’s payload index extends the HNSW graph itself. It filters DURING the graph traversal, not after:

  1. While navigating the HNSW graph, check filters at each hop
  2. Only explore paths where filters match
  3. Get top-100 that are both similar AND match filters

This is a single-pass filtered search. The filter is integrated into the graph navigation.

Real-world impact: I had a collection of 2M product documents with metadata like category, price_range, availability.

Query: “Find products similar to ‘wireless headphones’ in Electronics category, price $50-$200, in stock”

Weaviate (post-filtering):

Qdrant (during-search filtering):

At scale, this difference is massive.

2. Quantization That’s Production-Ready

What makes Qdrant’s quantization special:

Built-in rescore logic: Most databases offer quantization, but you have to manually implement oversampling and rescoring. Qdrant has it built-in — just set rescore=True.

Automatic fallback: If quantized search doesn’t find enough candidates, Qdrant automatically falls back to full precision. You don’t have to handle edge cases.

Multiple quantization types:

All work with the same API. Easy to test and compare.

Real numbers from my production system:

Full precision:

Scalar quantization (int8, oversample 2x):

This is on a 500K vector collection. The savings at 5M or 10M vectors are even more dramatic.

3. Sparse + Dense Hybrid is Native

Most databases make you choose:

Qdrant supports both in a single collection:

client.create_collection(
collection_name="hybrid",
vectors_config={
"dense": models.VectorParams(size=768, distance=models.Distance.COSINE)
},
sparse_vectors_config={
"sparse": models.SparseVectorParams()
}
)
Index documents with both:
client.upsert(
collection_name="hybrid",
points=[{
"id": 1,
"vector": {
"dense": [0.1, 0.2, ...], # Semantic embedding
"sparse": models.SparseVector(
indices=[10, 234, 567], # Term IDs
values=[0.8, 0.6, 0.4] # Term weights
)
}
}]
)
Two-stage retrieval becomes trivial:
# Stage 1: Sparse
candidates = client.query_points(
collection_name="hybrid",
query=sparse_query,
using="sparse",
limit=200
)
# Stage 2: Dense rerank
results = client.query_points(
collection_name="hybrid",
query=dense_query,
using="dense",
limit=10,
query_filter=models.Filter(
must=[models.HasIdCondition(has_id=[c.id for c in candidates])]
)
)

No external orchestration. No merging results from different systems. It just works.

4. On-Disk Storage That’s Actually Smart

The naive approach (what some databases do):

Qdrant’s approach:

Why this matters:

During HNSW search, you might visit 100–200 nodes during graph navigation (checking which direction to hop), but you only compute exact similarity scores for maybe 10–50 final candidates.

Graph navigation is the hot path. Vector scoring is not.

By keeping graph in RAM and vectors on disk:

Real numbers:

Full in-memory (1M vectors):

On-disk vectors (1M vectors):

At 10M vectors:

This is a no-brainer tradeoff at scale.

5. Rust = Consistent Performance

Why Rust matters for vector databases:

No garbage collection pauses: Languages like Java/Go have GC pauses that can spike latency unpredictably. Qdrant’s Rust implementation has no GC — memory is deterministic.

SIMD acceleration: Rust makes it easy to use SIMD (Single Instruction Multiple Data) for vector operations. Computing dot products of 768-dimensional vectors is 4–8x faster with SIMD.

Better async I/O: Qdrant uses io_uring on Linux for async disk I/O. This is 2–3x faster than traditional I/O for on-disk vectors.

Memory safety without overhead: Rust’s borrow checker prevents memory bugs without runtime overhead. No null pointer crashes, no buffer overflows, no data races.

Real-world impact:

Pinecone (closed source, don’t know implementation):

Qdrant (Rust):

For production systems, predictability matters as much as raw speed.

The Honest Truth About HNSW at Scale

Let me be direct: HNSW isn’t broken. Default HNSW is broken.

There’s no magic setting that works at all scales. If someone tells you “just use M=16, ef_construct=100, ef_search=64 for everything,” they haven’t scaled past 100K vectors.

What you actually need:

1. Monitoring: Know when quality degrades BEFORE users complain

2. Tuning at scale gates: Adjust parameters as you grow

3. Architectural patterns: Don’t rely on single-shot search

The four tactics:

  1. Tune HNSW: Increase M, ef_construct, ef_search based on scale
  2. On-disk vectors: When RAM is tight, keep graph in RAM, vectors on disk
  3. Quantization: Compress to int8, oversample 2–3x, rescore with full precision
  4. Two-stage retrieval: Fast broad search → precise narrow rerank

Qdrant makes this manageable:

My RAG system went from “failing at 200K vectors” to “handling 10M vectors with sub-100ms latency” by applying these patterns with Qdrant.

That’s the difference between understanding your tools and just hoping they work.

Links and Resources

Colab notebook: https://colab.research.google.com/drive/1ydVDqNVsRih0XATT5HE7ZZHD511g6tKX?usp=sharing

HNSW Algorithm:

Qdrant Documentation:

Qdrant Repository:

The future of RAG at scale isn’t magic — it’s understanding your retrieval layer, monitoring it continuously, and tuning it as you grow. With Qdrant handling the complexity, you can focus on building great applications instead of fighting infrastructure.


HNSW at Scale: Why Adding More Documents to Your Database Breaks RAG was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This article was originally published on Level Up Coding and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →