Introduction
KV cache (Key-Value cache) stores K and V matrices during autoregressive generation to avoid recomputing them for previously generated tokens. This dramatically speeds up inference since each new token only needs to compute attention with cached K,V, not recalculate everything.
The Problem
In autoregressive generation, we generate tokens one at a time:
Token 0: compute K_0, V_0, generate token 1
Token 1: recompute K_0, V_0, K_1, V_1, generate token 2
Token 2: recompute K_0, V_0, K_1, V_1, K_2, V_2, generate token 3
...
Token 1: recompute K_0, V_0, K_1, V_1, generate token 2
Token 2: recompute K_0, V_0, K_1, V_1, K_2, V_2, generate token 3
...
This is O(n²) total - very slow for long generations.
KV Cache Solution
At generation step t:
Q_t = new query (just for position t)
K_cache = [K_0, K_1, ..., K_{t-1}]
V_cache = [V_0, V_1, ..., V_{t-1}]
Attention = softmax(Q_t K_cacheᵀ / √d) V_cache
Only compute K_t, V_t for new token, reuse cached ones
Q_t = new query (just for position t)
K_cache = [K_0, K_1, ..., K_{t-1}]
V_cache = [V_0, V_1, ..., V_{t-1}]
Attention = softmax(Q_t K_cacheᵀ / √d) V_cache
Only compute K_t, V_t for new token, reuse cached ones
Speedup
| Method | Computations for step t |
|---|---|
| Without cache | O(t·d) for K,V + O(t·d) for attention |
| With KV cache | O(d) for new K,V + O(t·d) for attention only |
Memory Usage
KV cache size: 2 · n_layers · n_heads · seq_len · d_head
For GPT-3 (96 layers, 96 heads, 2048 context, 64 d_head):
2 · 96 · 96 · 2048 · 64 ≈ 2.4GB per sequence
For GPT-3 (96 layers, 96 heads, 2048 context, 64 d_head):
2 · 96 · 96 · 2048 · 64 ≈ 2.4GB per sequence
Practical Considerations
- Batch size matters: KV cache multiplies with batch size
- Memory management: PagedAttention (vLLM) manages cache like virtual memory
- Prefix caching: Share KV cache for common prefixes