66 - Mixture-of-Experts Attention

Introduction

Mixture-of-Experts (MoE) attention is an approach where different attention heads or experts specialize in different types of attention patterns. A gating mechanism selects which experts to use for each input, allowing the model to have more parameters without proportional computation.

MoE Architecture

Input x
↓ Gating network: G(x) = softmax(W·x)
↓ Experts: E₁(x), E₂(x), ..., Eₖ(x)
↓ Output = Σᵢ G(x)ᵢ · Eᵢ(x)

Attention MoE

Instead of all heads processing all tokens:

Expert 1: May specialize in local patterns
Expert 2: May specialize in long-range dependencies
Expert 3: May specialize in syntactic patterns
...

Gating selects relevant experts per position

Sparse Gating

Top-k gating activates only k experts:

G(x) = TopK(softmax(W·x), k)

Only k experts process each token
Rest are zero → computational savings

Benefits

Aspect	Standard	MoE
Parameters	N total	Can have N×E with E experts, but only k active
Compute	Full computation	Only k experts per token
Specialization	All heads same patterns	Different heads learn different patterns

Examples

Switch Transformer: Switch attention heads
ST-MoE: Stable routing in MoE
Mixtral: Mixture of experts LLM

Test Your Understanding

Question 1: In MoE, gating network:

A) Does attention computation
B) Selects which experts to use
C) No gating
D> Is the output

Question 2: MoE allows more parameters with:

A) Proportional computation increase
B) Sub-linear compute increase via sparse gating
C) No savings
D> More computation

Question 3: Top-k gating means:

A) All experts active
B) Only top k experts are selected per token
C) No gating
D> k=1 only

Question 4: Different MoE experts can specialize in:

A) Same patterns
B) Local, long-range, syntactic patterns
C) No specialization
D) Random patterns

Question 5: Output = Σᵢ G(x)ᵢ · Eᵢ(x) means:

A) Weighted sum of expert outputs
B) Only first expert
C) No sum
D> Random output

Question 6: Mixtral uses:

A) Standard attention only
B) Mixture of experts
C> No MoE
D> Dense activation

Question 7: Sparse gating vs dense (all experts):

A) Sparse has less compute, same as dense
B) Sparse has less compute (only k of E experts)
C) Dense is more sparse
D> Same compute

Question 8: MoE is used in:

A) Switch Transformer
B) Mixtral
C> Both
D> Neither

66. Mixture-of-Experts Attention