Introduction
Feed-forward layers (FFN) are the point-wise fully connected networks in Transformer layers. Each position in the sequence goes through the same feed-forward transformation independently, enabling the model to process each token with a two-layer network. FFN typically comprises a large portion of Transformer parameters.
Architecture
FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂
Equivalent to: Linear → ReLU → Linear
Dimensions: d_model → d_ff → d_model
Equivalent to: Linear → ReLU → Linear
Dimensions: d_model → d_ff → d_model
Standard Dimensions
| Model | d_model | d_ff | Expansion |
|---|---|---|---|
| Original Transformer | 512 | 2048 | 4× |
| BERT Base | 768 | 3072 | 4× |
| BERT Large | 1024 | 4096 | 4× |
| GPT-2 Small | 768 | 3072 | 4× |
| T5 Base | 768 | 2048 | 2.67× |
Why FFN is Important
Despite appearing simple, FFN serves critical functions:
- Parameter count: FFN is typically ~2/3 of Transformer's parameters
- Expressivity: Provides capacity for complex token-level transformations
- Separation of concerns: Attention captures relationships, FFN captures token-specific transformations
Position-wise Processing
The "position-wise" designation means each token position is transformed independently:
Input X: [batch, seq_len, d_model]
After first linear: [batch, seq_len, d_ff]
After ReLU: [batch, seq_len, d_ff] (same shape, max(0,x))
After second linear: [batch, seq_len, d_model]
Note: The same weights are applied to ALL positions (shared across sequence)
After first linear: [batch, seq_len, d_ff]
After ReLU: [batch, seq_len, d_ff] (same shape, max(0,x))
After second linear: [batch, seq_len, d_model]
Note: The same weights are applied to ALL positions (shared across sequence)
Variants
1. Original FFN
max(0, x·W₁)·W₂ with ReLU activation
2. GELU FFN (BERT, GPT)
FFN(x) = x·W₁·σ + x·W₃) · W₂
where σ is GELU activation
Known as "Swish" activation: x · sigmoid(x)
where σ is GELU activation
Known as "Swish" activation: x · sigmoid(x)
3. GLU Variants (LLama, etc.)
FFN(x) = (σ(x·W₁) · (x·W₃)) · W₂
Gated Linear Units provide better performance
Gated Linear Units provide better performance
Parameter Count
FFN parameters = W₁ + b₁ + W₂ + b₂
W₁: d_model × d_ff
W₂: d_ff × d_model
For d_model=512, d_ff=2048:
W₁: 512 × 2048 = 1,048,576
W₂: 2048 × 512 = 1,048,576
Total FFN: ~2M parameters per layer
W₁: d_model × d_ff
W₂: d_ff × d_model
For d_model=512, d_ff=2048:
W₁: 512 × 2048 = 1,048,576
W₂: 2048 × 512 = 1,048,576
Total FFN: ~2M parameters per layer