54 - KV Cache | Mango Encyclopedia

Introduction

KV cache (Key-Value cache) stores K and V matrices during autoregressive generation to avoid recomputing them for previously generated tokens. This dramatically speeds up inference since each new token only needs to compute attention with cached K,V, not recalculate everything.

The Problem

In autoregressive generation, we generate tokens one at a time:

Token 0: compute K_0, V_0, generate token 1
Token 1: recompute K_0, V_0, K_1, V_1, generate token 2
Token 2: recompute K_0, V_0, K_1, V_1, K_2, V_2, generate token 3
...

This is O(n²) total - very slow for long generations.

KV Cache Solution

At generation step t:
Q_t = new query (just for position t)
K_cache = [K_0, K_1, ..., K_{t-1}]
V_cache = [V_0, V_1, ..., V_{t-1}] Attention = softmax(Q_t K_cacheᵀ / √d) V_cache Only compute K_t, V_t for new token, reuse cached ones

Speedup

Method	Computations for step t
Without cache	O(t·d) for K,V + O(t·d) for attention
With KV cache	O(d) for new K,V + O(t·d) for attention only

Memory Usage

KV cache size: 2 · n_layers · n_heads · seq_len · d_head For GPT-3 (96 layers, 96 heads, 2048 context, 64 d_head):
2 · 96 · 96 · 2048 · 64 ≈ 2.4GB per sequence

Practical Considerations

Batch size matters: KV cache multiplies with batch size
Memory management: PagedAttention (vLLM) manages cache like virtual memory
Prefix caching: Share KV cache for common prefixes

Test Your Understanding

Question 1: KV cache stores:

A) Q matrices only
B) K and V matrices for generated tokens
C) No cache
D) Random data

Question 2: KV cache is used in:

A) Training only
B) Autoregressive inference
C) Not used
D> Encoder only

Question 3: At step t with KV cache, we compute:

A) All K, V from scratch
B) Only K_t, V_t (new token), reuse cached K,V
C) Only Q
D) Nothing

Question 4: Without KV cache, step t requires O(t·d) operations for K,V. With cache:

A) Still O(t·d)
B) O(d) only for new K,V
C) O(1)
D) O(t²)

Question 5: KV cache memory grows with:

A) Only batch size
B) Sequence length, number of layers, heads
C> Only sequence length
D> No growth

Question 6: PagedAttention (vLLM) manages KV cache like:

A) Stack
B> Virtual memory pages
C) Sequential list
D) No management

Question 7: Prefix caching allows:

A) No cache sharing
B) Sharing KV cache for common prompt prefixes
C) Only single use
D) Slower generation

Question 8: KV cache dramatically speeds up inference by:

A) Making computation slower
B) Avoiding recomputation of K,V for previous tokens
C) Using more memory
D) Reducing batch size

54. KV Cache