
LLM inference performance is often discussed in terms of model size, batching, quantization, and GPU utilization. But one of the most important parts of the serving stack is less visible: the KV cache.
When a transformer generates tokens, it stores key-value tensors from previous tokens so it does not need to recompute attention over the entire sequence every time. This cache is critical for efficient decoding, but it also consumes memory. As prompts get longer, context windows grow, and workloads become more repetitive, KV cache management becomes a major part of LLM serving performance. This experiment started with a practical question:
If vLLM already supports prefix caching, when does adding LMCache actually help?
I wanted to understand not just whether LMCache works, but under what workload conditions it provides measurable value.
The short answer from my experiments:
LMCache does not improve performance by default. It starts to help only when there is enough repeated-prefix structure and enough KV-cache pressure that preserving KV outside GPU memory becomes cheaper than recomputing it.
Even then, the benefit showed up mostly in TTFT, not in overall throughput.
Why prefix caching matters
Many real LLM applications repeatedly send large shared prefixes.
Examples include:
- A long system prompt
- A retrieved document or knowledge base context
- A tool schema
- A conversation history
- A shared instruction template
- A coding agent prompt with the same repository context
Without caching, the model has to repeatedly prefill the same prefix. That means recomputing attention over tokens it has already processed in previous requests. vLLM’s prefix caching helps with this by reusing the KV cache for repeated prompt prefixes when possible.
Conceptually:
Request 1:
[shared prefix][user question A]
Request 2:
[shared prefix][user question B]
If the shared prefix is already cached, Request 2 can skip a significant amount of prefill work. This is especially important because prefill is often what dominates time to first token, or TTFT, for long prompts.
What LMCache adds
vLLM prefix caching is useful, but GPU memory is limited. If there are too many active or reusable prefixes, some KV blocks may be evicted from GPU memory. That is where LMCache comes in.
LMCache acts as an external KV storage layer. In my local setup, it used CPU memory as the backing store. The goal is not to make decoding inherently faster. Instead, the goal is to preserve reusable KV outside GPU memory so that if the GPU-resident cache evicts something useful, it can be retrieved rather than fully recomputed. A simplified view:
Without LMCache:
Repeated prefix is cached on GPU → reuse is fast
Repeated prefix is evicted from GPU → recompute from scratch
With LMCache:
Repeated prefix is cached on GPU → reuse is fast
Repeated prefix is evicted from GPU → retrieve from LMCache instead of recomputing
That sounds obviously helpful, but there is a tradeoff. Retrieving KV from another memory layer is not free. It adds transfer and coordination overhead. So LMCache only helps when the recomputation cost it avoids is greater than the overhead it introduces. That was the core hypothesis I wanted to test.
Experimental setup
I used a single-node setup on Google Colab with an A100 HighRAM instance. The serving stack was:
Model: Qwen/Qwen2.5-1.5B-Instruct
Serving engine: vLLM
KV offload backend: LMCache
Hardware: Google Colab A100 HighRAM
I compared two serving conditions:
Condition 1: vLLM prefix caching enabled, LMCache disabled
Condition 2: vLLM prefix caching enabled, LMCache enabled
The goal was to isolate the effect of LMCache while keeping the rest of the serving setup constant. For LMCache-enabled runs, I configured vLLM with LMCache as the KV offloading backend and used local CPU-backed LMCache storage. At a high level, the configuration looked like this:
--kv-offloading-backend lmcache
--disable-hybrid-kv-cache-manager
And the LMCache environment variables enabled local CPU storage:
LMCACHE_LOCAL_CPU=True
LMCACHE_CHUNK_SIZE=256
Workloads tested
I tested two different workloads.
1. prefix_repetition
This is a synthetic benchmark designed to create repeated shared prefixes. It is useful because it stresses the exact thing LMCache is supposed to help with: repeated prefix reuse under KV pressure. This is not necessarily representative of every production workload, but it is a good mechanism test.
2. ShareGPT
ShareGPT is a more realistic conversational workload. The prompts are shorter and more diverse, so there is less exact repeated-prefix structure. That makes it less favorable to LMCache, but useful as a realism check. In other words:
prefix_repetition = favorable benchmark for prefix reuse
ShareGPT = more realistic, less favorable benchmark
Creating enough memory pressure
One challenge was that the model was relatively small for an A100. Qwen2.5–1.5B-Instruct on an A100 HighRAM setup leaves a lot of memory headroom. In early runs, LMCache mostly added overhead because vLLM could keep enough KV resident on GPU without needing much external offload. To make the experiment meaningful, I progressively increased pressure by changing:
- Prefix length
- Number of prefixes
- Output length
- Concurrency
- GPU memory utilization
This was important because LMCache is not expected to help much when GPU memory is already sufficient. The interesting regime is when GPU memory becomes constrained enough that useful KV would otherwise be evicted and recomputed.
Results
This was the workload where LMCache finally started to show its intended value:
1. Prefix caching ON, LMCache OFF [prefix_repetition benchmark]
Requests/sec: 2.05
Output tokens/sec: 383.56
Total tokens/sec: 34,416.64
Mean TTFT: 41.35s
Median TTFT: 43.52s
P99 TTFT: 52.70s
Mean ITL: 93.45ms
Median ITL: 100.47ms
P99 ITL: 152.02ms
2. Prefix caching ON, LMCache ON [prefix_repetition benchmark]
Requests/sec: 2.03
Output tokens/sec: 381.94
Total tokens/sec: 34,175.45
Mean TTFT: 38.47s
Median TTFT: 39.10s
P99 TTFT: 60.10s
Mean ITL: 116.58ms
Median ITL: 126.93ms
P99 ITL: 220.88ms
The main improvement was in TTFT:
Mean TTFT:
41.35s → 38.47s
Median TTFT:
43.52s → 39.10s
This suggests that the setup had finally entered a regime where preserving reusable prefix KV outside GPU memory was sometimes better than recomputing it. But the improvement was narrow. LMCache did not improve overall serving performance:
- Request throughput was slightly lower
- Output token throughput was slightly lower
- Mean ITL became worse
- Median ITL became worse
- P99 ITL became worse
- P99 TTFT became worse
3. Prefix caching ON, LMCache OFF [ShareGPT benchmark]
Requests/sec: 38.47
Output tokens/sec: 4,923.93
Total tokens/sec: 44,315.41
Mean TTFT: 1.07s
Median TTFT: 924.29ms
P99 TTFT: 2.63s
Mean ITL: 20.03ms
Median ITL: 11.47ms
P99 ITL: 45.01ms
4. Prefix caching ON, LMCache ON [ShareGPT benchmark]
Requests/sec: 29.05
Output tokens/sec: 3,718.89
Total tokens/sec: 33,470.04
Mean TTFT: 1.13s
Median TTFT: 901.48ms
P99 TTFT: 3.44s
Mean ITL: 28.76ms
Median ITL: 12.20ms
P99 ITL: 62.04ms
Since ShareGPT is more realistic than the synthetic prefix repetition benchmark, it does not strongly stress exact prefix reuse. There is less reusable KV to preserve, which means LMCache has fewer opportunities to offset its own overhead.
- Lower request throughput
- Lower output token throughput
- Lower total token throughput
- Worse mean TTFT
- Worse p99 TTFT
- Worse mean ITL
- Worse p99 ITL
Key Findings
1. LMCache does not help by default
The most important takeaway is that LMCache is not a free performance win.
In low-to-moderate memory pressure regimes, it mostly adds overhead. That was visible in early experiments and clearly visible on ShareGPT.
This makes sense. If vLLM can already keep the useful KV cache resident on GPU, then adding another memory layer introduces complexity without much benefit.
In that case, local GPU prefix caching is already the fastest path.
2. LMCache starts to help only under repeated-prefix memory pressure
The first clear TTFT improvement appeared only after I created a heavier prefix_repetition workload. That is the regime where LMCache’s purpose becomes relevant:
Large repeated prefixes
Enough concurrency or prefix diversity to pressure GPU KV memory
Eviction of useful GPU-resident KV
High recomputation cost if evicted prefixes are needed again
In that situation, external KV preservation can be better than recomputation. But this is a specific workload shape. It should not be generalized to all LLM traffic.
3. TTFT is where LMCache is most likely to help
LMCache is primarily a prefill-side optimization. The expected win is not lower inter-token latency. The expected win is avoiding repeated prefill work for long shared prefixes. That means the metric most likely to improve is TTFT. This is exactly what showed up in the prefix_repetition benchmark. Mean and median TTFT improved, but ITL got worse. That tradeoff is important for production systems. If your application is highly sensitive to first-token latency for long-context prompts, LMCache may be useful under the right workload. But if your bottleneck is decode throughput or inter-token latency, LMCache may not help and can even hurt.
Limitations
My experiment used LMCache in its simplest form: CPU-backed KV preservation on a single A100 instance — that is only part of the story. In a larger distributed inference system, LMCache can be more valuable because KV reuse is not limited to one GPU worker.
- Requests with the same prefix may hit different replicas
- GPU-local KV is fragmented across workers
- Prefill and decode are disaggregated
- Workers are scaled up and down
- Useful KV should survive beyond one process lifetime
- Large shared contexts recur across the fleet
Code
The full experiment and notebook are available here —
vLLM Prefix Caching vs. LMCache: Benchmarking KV Reuse Tradeoffs was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.