Introduction
Multi-Query Attention (MQA) is an extreme case of Grouped Query Attention where all query heads share a single key-value head. This maximally reduces KV cache size and memory bandwidth at the cost of some quality reduction.
Comparison: MHA vs GQA vs MQA
| Type | Query Heads | KV Heads | Sharing |
|---|---|---|---|
| MHA | h | h | 1:1 (no sharing) |
| GQA | h | g (e.g., 8) | 1:g (each KV serves h/g queries) |
| MQA | h | 1 | 1:all (all queries share 1 KV) |
MQA Formula
Q ∈ ℝ^{seq_len × h·d} (h query heads)
K ∈ ℝ^{seq_len × d} (single KV head)
V ∈ ℝ^{seq_len × d} (single KV head)
All h query heads attend to the same K, V
K ∈ ℝ^{seq_len × d} (single KV head)
V ∈ ℝ^{seq_len × d} (single KV head)
All h query heads attend to the same K, V
Memory Reduction
For h=32 heads, d=64:
MHA: 32 KV heads
MQA: 1 KV head (32× smaller KV cache)
MHA: 32 KV heads
MQA: 1 KV head (32× smaller KV cache)
Speed Benefits
- KV cache: 32× smaller than MHA
- Memory bandwidth: Load only 1 set of K,V instead of 32
- Popular for inference: Used in many production models
Trade-offs
MQA may slightly reduce model quality because:
- All query heads attend to identical K,V information
- Loses the diversity of having separate KV heads
- Some tasks that benefit from diverse KV attention may degrade