22 - Feed-Forward Layers | Mango Encyclopedia

Introduction

Feed-forward layers (FFN) are the point-wise fully connected networks in Transformer layers. Each position in the sequence goes through the same feed-forward transformation independently, enabling the model to process each token with a two-layer network. FFN typically comprises a large portion of Transformer parameters.

Architecture

FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂

Equivalent to: Linear → ReLU → Linear

Dimensions: d_model → d_ff → d_model

Standard Dimensions

Model	d_model	d_ff	Expansion
Original Transformer	512	2048	4×
BERT Base	768	3072	4×
BERT Large	1024	4096	4×
GPT-2 Small	768	3072	4×
T5 Base	768	2048	2.67×

Why FFN is Important

Despite appearing simple, FFN serves critical functions:

Parameter count: FFN is typically ~2/3 of Transformer's parameters
Expressivity: Provides capacity for complex token-level transformations
Separation of concerns: Attention captures relationships, FFN captures token-specific transformations

Position-wise Processing

The "position-wise" designation means each token position is transformed independently:

Input X: [batch, seq_len, d_model]

After first linear: [batch, seq_len, d_ff]

After ReLU: [batch, seq_len, d_ff] (same shape, max(0,x))

After second linear: [batch, seq_len, d_model]

Note: The same weights are applied to ALL positions (shared across sequence)

Variants

1. Original FFN

max(0, x·W₁)·W₂ with ReLU activation

2. GELU FFN (BERT, GPT)

FFN(x) = x·W₁·σ + x·W₃) · W₂
where σ is GELU activation

Known as "Swish" activation: x · sigmoid(x)

3. GLU Variants (LLama, etc.)

FFN(x) = (σ(x·W₁) · (x·W₃)) · W₂

Gated Linear Units provide better performance

Parameter Count

FFN parameters = W₁ + b₁ + W₂ + b₂

W₁: d_model × d_ff
W₂: d_ff × d_model

For d_model=512, d_ff=2048:
W₁: 512 × 2048 = 1,048,576
W₂: 2048 × 512 = 1,048,576
Total FFN: ~2M parameters per layer

Test Your Understanding

Question 1: What is the typical expansion ratio of FFN in Transformers?

A) 1× (no expansion)
B) 2×
C) 4×
D) 10×

Question 2: What activation function does the original Transformer FFN use?

A) Sigmoid
B) Tanh
C) ReLU
D) Softmax

Question 3: What portion of Transformer parameters does FFN typically comprise?

A) About 1/3
B) About 2/3
C) About 1/2
D) About 1/4

Question 4: What does "position-wise" mean in FFN?

A) Only first position is processed
B) Each token position is processed independently with shared weights
C) Positions are ordered
D) Position embedding is used

Question 5: For d_model=768 and d_ff=3072, what is the size of W₁?

A) 768 × 768
B) 768 × 3072
C) 3072 × 768
D) 3072 × 3072

Question 6: Which activation is used in BERT and GPT models?

A) ReLU
B) GELU
C) Leaky ReLU
D) ELU

Question 7: In GLU variants, what is added to the FFN?

A) Additional linear layer
B) Gating mechanism with sigmoid
C) Dropout
D) Layer normalization

Question 8: What happens to FFN input [batch, seq_len, d_model] after the first linear layer?

A) [batch, seq_len, d_model]
B) [batch, seq_len, d_ff]
C) [batch, d_ff, seq_len]
D) [d_ff, batch, seq_len]

22. Feed-Forward Layers