66. Mixture-of-Experts Attention

Introduction

Mixture-of-Experts (MoE) attention is an approach where different attention heads or experts specialize in different types of attention patterns. A gating mechanism selects which experts to use for each input, allowing the model to have more parameters without proportional computation.

MoE Architecture

Input x
↓ Gating network: G(x) = softmax(W·x)
↓ Experts: E₁(x), E₂(x), ..., Eₖ(x)
↓ Output = Σᵢ G(x)ᵢ · Eᵢ(x)

Attention MoE

Instead of all heads processing all tokens:

Expert 1: May specialize in local patterns
Expert 2: May specialize in long-range dependencies
Expert 3: May specialize in syntactic patterns
...

Gating selects relevant experts per position

Sparse Gating

Top-k gating activates only k experts:

G(x) = TopK(softmax(W·x), k)

Only k experts process each token
Rest are zero → computational savings

Benefits

AspectStandardMoE
ParametersN totalCan have N×E with E experts, but only k active
ComputeFull computationOnly k experts per token
SpecializationAll heads same patternsDifferent heads learn different patterns

Examples

Test Your Understanding

Question 1: In MoE, gating network:

  • A) Does attention computation
  • B) Selects which experts to use
  • C) No gating
  • D> Is the output

Question 2: MoE allows more parameters with:

  • A) Proportional computation increase
  • B) Sub-linear compute increase via sparse gating
  • C) No savings
  • D> More computation

Question 3: Top-k gating means:

  • A) All experts active
  • B) Only top k experts are selected per token
  • C) No gating
  • D> k=1 only

Question 4: Different MoE experts can specialize in:

  • A) Same patterns
  • B) Local, long-range, syntactic patterns
  • C) No specialization
  • D) Random patterns

Question 5: Output = Σᵢ G(x)ᵢ · Eᵢ(x) means:

  • A) Weighted sum of expert outputs
  • B) Only first expert
  • C) No sum
  • D> Random output

Question 6: Mixtral uses:

  • A) Standard attention only
  • B) Mixture of experts
  • C> No MoE
  • D> Dense activation

Question 7: Sparse gating vs dense (all experts):

  • A) Sparse has less compute, same as dense
  • B) Sparse has less compute (only k of E experts)
  • C) Dense is more sparse
  • D> Same compute

Question 8: MoE is used in:

  • A) Switch Transformer
  • B) Mixtral
  • C> Both
  • D> Neither