56. Multi-Query Attention (MQA)

Introduction

Multi-Query Attention (MQA) is an extreme case of Grouped Query Attention where all query heads share a single key-value head. This maximally reduces KV cache size and memory bandwidth at the cost of some quality reduction.

Comparison: MHA vs GQA vs MQA

TypeQuery HeadsKV HeadsSharing
MHAhh1:1 (no sharing)
GQAhg (e.g., 8)1:g (each KV serves h/g queries)
MQAh11:all (all queries share 1 KV)

MQA Formula

Q ∈ ℝ^{seq_len × h·d} (h query heads)
K ∈ ℝ^{seq_len × d} (single KV head)
V ∈ ℝ^{seq_len × d} (single KV head) All h query heads attend to the same K, V

Memory Reduction

For h=32 heads, d=64:
MHA: 32 KV heads
MQA: 1 KV head (32× smaller KV cache)

Speed Benefits

Trade-offs

MQA may slightly reduce model quality because:

Test Your Understanding

Question 1: In MQA, all query heads share:

  • A) Different KV heads
  • B) A single key-value head
  • C) No KV heads
  • D> Random sharing

Question 2: MQA has how many KV heads?

  • A) h (same as query)
  • B) g (intermediate)
  • C) 1
  • D) 0

Question 3: Compared to MHA, MQA KV cache is:

  • A) Same size
  • B) 32× smaller (for h=32)
  • C) 32× larger
  • D) 2× smaller

Question 4: MQA is an extreme case of:

  • A) Single query attention
  • B) Grouped Query Attention
  • C) No relation
  • D) Standard attention

Question 5: MQA speed benefit comes from:

  • A) More computation
  • B) Loading only 1 set of K,V instead of many
  • C) Larger KV cache
  • D) No benefit

Question 6: A potential downside of MQA is:

  • A) Too much memory
  • B) Possible quality reduction due to shared KV
  • C) Slower than MHA
  • D) No inference benefit

Question 7: If h=32, MQA has how many KV heads?

  • A) 32
  • B) 1
  • C) 8
  • D) 64

Question 8: MQA is particularly useful for:

  • A) Training
  • B) Inference (memory/speed)
  • C) Not used
  • D) Large batch training