56 - Multi-Query Attention (MQA)

Introduction

Multi-Query Attention (MQA) is an extreme case of Grouped Query Attention where all query heads share a single key-value head. This maximally reduces KV cache size and memory bandwidth at the cost of some quality reduction.

Comparison: MHA vs GQA vs MQA

Type	Query Heads	KV Heads	Sharing
MHA	h	h	1:1 (no sharing)
GQA	h	g (e.g., 8)	1:g (each KV serves h/g queries)
MQA	h	1	1:all (all queries share 1 KV)

MQA Formula

Q ∈ ℝ^{seq_len × h·d} (h query heads)
K ∈ ℝ^{seq_len × d} (single KV head)
V ∈ ℝ^{seq_len × d} (single KV head) All h query heads attend to the same K, V

Memory Reduction

For h=32 heads, d=64:
MHA: 32 KV heads
MQA: 1 KV head (32× smaller KV cache)

Speed Benefits

KV cache: 32× smaller than MHA
Memory bandwidth: Load only 1 set of K,V instead of 32
Popular for inference: Used in many production models

Trade-offs

MQA may slightly reduce model quality because:

All query heads attend to identical K,V information
Loses the diversity of having separate KV heads
Some tasks that benefit from diverse KV attention may degrade

Test Your Understanding

Question 1: In MQA, all query heads share:

A) Different KV heads
B) A single key-value head
C) No KV heads
D> Random sharing

Question 2: MQA has how many KV heads?

A) h (same as query)
B) g (intermediate)
C) 1
D) 0

Question 3: Compared to MHA, MQA KV cache is:

A) Same size
B) 32× smaller (for h=32)
C) 32× larger
D) 2× smaller

Question 4: MQA is an extreme case of:

A) Single query attention
B) Grouped Query Attention
C) No relation
D) Standard attention

Question 5: MQA speed benefit comes from:

A) More computation
B) Loading only 1 set of K,V instead of many
C) Larger KV cache
D) No benefit

Question 6: A potential downside of MQA is:

A) Too much memory
B) Possible quality reduction due to shared KV
C) Slower than MHA
D) No inference benefit

Question 7: If h=32, MQA has how many KV heads?

A) 32
B) 1
C) 8
D) 64

Question 8: MQA is particularly useful for:

A) Training
B) Inference (memory/speed)
C) Not used
D) Large batch training

56. Multi-Query Attention (MQA)