24. Layer Normalization

Introduction

Layer normalization is a normalization technique that normalizes activations across features (instead of across the batch). It stabilizes training by ensuring activations have consistent mean and variance, and is a key component in Transformer architectures alongside residual connections.

Formula

μ = (1/d) Σᵢ xᵢ

σ² = (1/d) Σᵢ (xᵢ - μ)²

LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β

where:
γ = learnable scale parameter
β = learnable shift parameter
ε = small constant for numerical stability

What Gets Normalized

LayerNorm normalizes across the feature dimension for each individual example:

Input shape: [batch, seq_len, d_model]

For each (batch, position) pair:
Compute mean and std over the d_model features
Normalize to have mean 0, variance 1
Apply learnable γ, β

Output shape: [batch, seq_len, d_model]

LayerNorm vs BatchNorm

AspectLayerNormBatchNorm
Normalizes overFeature dimensionBatch dimension
Training behaviorUses current sample statsUses batch statistics
InferenceSame as trainingUses running statistics
RNN compatibilityYesNo (sequential)
Transformer usageYes (standard)No (not suitable)

In Transformers

LayerNorm is applied in two places within each Transformer layer:

1. After attention: LayerNorm(x + Attention(x))

2. After FFN: LayerNorm(x + FFN(x))

Pre-LN vs Post-LN

Post-LN (Original Transformer)

output = LayerNorm(x + SubLayer(x))

LayerNorm comes AFTER the residual addition

Pre-LN (More Common Now)

output = LayerNorm(x) + SubLayer(LayerNorm(x))

LayerNorm comes BEFORE the sub-layer

This can be more stable during training

Why LayerNorm Works

Parameters

LayerNorm has 2 parameters per feature:

γ ∈ ℝ^{d_model} (scale)
β ∈ ℝ^{d_model} (shift)

For d_model=768: 2 × 768 = 1536 parameters per LayerNorm

A Transformer layer has 2 LayerNorms → ~3000 parameters

Test Your Understanding

Question 1: What does LayerNorm normalize over?

  • A) Batch dimension
  • B) Feature dimension
  • C> Sequence dimension
  • D) Time dimension

Question 2: What are the learnable parameters in LayerNorm?

  • A) Only γ (scale)
  • B) Only β (shift)
  • C) γ and β
  • D) No learnable parameters

Question 3: In Post-LN, where is LayerNorm applied?

  • A) Before residual connection
  • B) After residual connection (on x + SubLayer)
  • C) Only on SubLayer output
  • D) Before attention only

Question 4: What is the shape of γ in LayerNorm for Transformer?

  • A) [batch, seq_len]
  • B) [d_model]
  • C) [batch]
  • D) [seq_len, d_model]

Question 5: How many LayerNorm instances are in one Transformer layer?

  • A) 1
  • B) 2
  • C) 3
  • D) 6

Question 6: What is ε in LayerNorm formula for?

  • A) Learning rate
  • B) Numerical stability (prevent division by zero)
  • C) Exponential decay
  • D> Regularization

Question 7: Why is LayerNorm used in Transformers instead of BatchNorm?

  • A) BatchNorm has more parameters
  • B) LayerNorm works per-sample, suitable for variable-length sequences
  • C) BatchNorm is too slow
  • D) LayerNorm requires less memory

Question 8: LayerNorm normalizes to have mean and variance:

  • A) Mean=1, Var=0
  • B) Mean=0, Var=1
  • C) Mean=0, Var=0
  • D) Mean=1, Var=1