22. Feed-Forward Layers

Introduction

Feed-forward layers (FFN) are the point-wise fully connected networks in Transformer layers. Each position in the sequence goes through the same feed-forward transformation independently, enabling the model to process each token with a two-layer network. FFN typically comprises a large portion of Transformer parameters.

Architecture

FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂

Equivalent to: Linear → ReLU → Linear

Dimensions: d_model → d_ff → d_model

Standard Dimensions

Modeld_modeld_ffExpansion
Original Transformer5122048
BERT Base7683072
BERT Large10244096
GPT-2 Small7683072
T5 Base76820482.67×

Why FFN is Important

Despite appearing simple, FFN serves critical functions:

Position-wise Processing

The "position-wise" designation means each token position is transformed independently:

Input X: [batch, seq_len, d_model]

After first linear: [batch, seq_len, d_ff]

After ReLU: [batch, seq_len, d_ff] (same shape, max(0,x))

After second linear: [batch, seq_len, d_model]

Note: The same weights are applied to ALL positions (shared across sequence)

Variants

1. Original FFN

max(0, x·W₁)·W₂ with ReLU activation

2. GELU FFN (BERT, GPT)

FFN(x) = x·W₁·σ + x·W₃) · W₂
where σ is GELU activation

Known as "Swish" activation: x · sigmoid(x)

3. GLU Variants (LLama, etc.)

FFN(x) = (σ(x·W₁) · (x·W₃)) · W₂

Gated Linear Units provide better performance

Parameter Count

FFN parameters = W₁ + b₁ + W₂ + b₂

W₁: d_model × d_ff
W₂: d_ff × d_model

For d_model=512, d_ff=2048:
W₁: 512 × 2048 = 1,048,576
W₂: 2048 × 512 = 1,048,576
Total FFN: ~2M parameters per layer

Test Your Understanding

Question 1: What is the typical expansion ratio of FFN in Transformers?

  • A) 1× (no expansion)
  • B) 2×
  • C) 4×
  • D) 10×

Question 2: What activation function does the original Transformer FFN use?

  • A) Sigmoid
  • B) Tanh
  • C) ReLU
  • D) Softmax

Question 3: What portion of Transformer parameters does FFN typically comprise?

  • A) About 1/3
  • B) About 2/3
  • C) About 1/2
  • D) About 1/4

Question 4: What does "position-wise" mean in FFN?

  • A) Only first position is processed
  • B) Each token position is processed independently with shared weights
  • C) Positions are ordered
  • D) Position embedding is used

Question 5: For d_model=768 and d_ff=3072, what is the size of W₁?

  • A) 768 × 768
  • B) 768 × 3072
  • C) 3072 × 768
  • D) 3072 × 3072

Question 6: Which activation is used in BERT and GPT models?

  • A) ReLU
  • B) GELU
  • C) Leaky ReLU
  • D) ELU

Question 7: In GLU variants, what is added to the FFN?

  • A) Additional linear layer
  • B) Gating mechanism with sigmoid
  • C) Dropout
  • D) Layer normalization

Question 8: What happens to FFN input [batch, seq_len, d_model] after the first linear layer?

  • A) [batch, seq_len, d_model]
  • B) [batch, seq_len, d_ff]
  • C) [batch, d_ff, seq_len]
  • D) [d_ff, batch, seq_len]