Introduction
Mixture-of-Experts (MoE) attention is an approach where different attention heads or experts specialize in different types of attention patterns. A gating mechanism selects which experts to use for each input, allowing the model to have more parameters without proportional computation.
MoE Architecture
Input x
↓ Gating network: G(x) = softmax(W·x)
↓ Experts: E₁(x), E₂(x), ..., Eₖ(x)
↓ Output = Σᵢ G(x)ᵢ · Eᵢ(x)
↓ Gating network: G(x) = softmax(W·x)
↓ Experts: E₁(x), E₂(x), ..., Eₖ(x)
↓ Output = Σᵢ G(x)ᵢ · Eᵢ(x)
Attention MoE
Instead of all heads processing all tokens:
Expert 1: May specialize in local patterns
Expert 2: May specialize in long-range dependencies
Expert 3: May specialize in syntactic patterns
...
Gating selects relevant experts per position
Expert 2: May specialize in long-range dependencies
Expert 3: May specialize in syntactic patterns
...
Gating selects relevant experts per position
Sparse Gating
Top-k gating activates only k experts:
G(x) = TopK(softmax(W·x), k)
Only k experts process each token
Rest are zero → computational savings
Only k experts process each token
Rest are zero → computational savings
Benefits
| Aspect | Standard | MoE |
|---|---|---|
| Parameters | N total | Can have N×E with E experts, but only k active |
| Compute | Full computation | Only k experts per token |
| Specialization | All heads same patterns | Different heads learn different patterns |
Examples
- Switch Transformer: Switch attention heads
- ST-MoE: Stable routing in MoE
- Mixtral: Mixture of experts LLM