24 - Layer Normalization | Mango Encyclopedia

Introduction

Layer normalization is a normalization technique that normalizes activations across features (instead of across the batch). It stabilizes training by ensuring activations have consistent mean and variance, and is a key component in Transformer architectures alongside residual connections.

Formula

μ = (1/d) Σᵢ xᵢ

σ² = (1/d) Σᵢ (xᵢ - μ)²

LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β

where:
γ = learnable scale parameter
β = learnable shift parameter
ε = small constant for numerical stability

What Gets Normalized

LayerNorm normalizes across the feature dimension for each individual example:

Input shape: [batch, seq_len, d_model]

For each (batch, position) pair:
Compute mean and std over the d_model features
Normalize to have mean 0, variance 1
Apply learnable γ, β

Output shape: [batch, seq_len, d_model]

LayerNorm vs BatchNorm

Aspect	LayerNorm	BatchNorm
Normalizes over	Feature dimension	Batch dimension
Training behavior	Uses current sample stats	Uses batch statistics
Inference	Same as training	Uses running statistics
RNN compatibility	Yes	No (sequential)
Transformer usage	Yes (standard)	No (not suitable)

In Transformers

LayerNorm is applied in two places within each Transformer layer:

1. After attention: LayerNorm(x + Attention(x))

2. After FFN: LayerNorm(x + FFN(x))

Pre-LN vs Post-LN

Post-LN (Original Transformer)

output = LayerNorm(x + SubLayer(x))

LayerNorm comes AFTER the residual addition

Pre-LN (More Common Now)

output = LayerNorm(x) + SubLayer(LayerNorm(x))

LayerNorm comes BEFORE the sub-layer

This can be more stable during training

Why LayerNorm Works

Stabilizes activations: Keeps values in reasonable range
Enables higher learning rates: Normalized inputs allow bigger gradients
Reduces internal covariate shift: Inputs to each layer are more consistent
Learned affine transformation: γ, β allow the network to undo normalization if needed

Parameters

LayerNorm has 2 parameters per feature:

γ ∈ ℝ^{d_model} (scale)
β ∈ ℝ^{d_model} (shift)

For d_model=768: 2 × 768 = 1536 parameters per LayerNorm

A Transformer layer has 2 LayerNorms → ~3000 parameters

Test Your Understanding

Question 1: What does LayerNorm normalize over?

A) Batch dimension
B) Feature dimension
C> Sequence dimension
D) Time dimension

Question 2: What are the learnable parameters in LayerNorm?

A) Only γ (scale)
B) Only β (shift)
C) γ and β
D) No learnable parameters

Question 3: In Post-LN, where is LayerNorm applied?

A) Before residual connection
B) After residual connection (on x + SubLayer)
C) Only on SubLayer output
D) Before attention only

Question 4: What is the shape of γ in LayerNorm for Transformer?

A) [batch, seq_len]
B) [d_model]
C) [batch]
D) [seq_len, d_model]

Question 5: How many LayerNorm instances are in one Transformer layer?

A) 1
B) 2
C) 3
D) 6

Question 6: What is ε in LayerNorm formula for?

A) Learning rate
B) Numerical stability (prevent division by zero)
C) Exponential decay
D> Regularization

Question 7: Why is LayerNorm used in Transformers instead of BatchNorm?

A) BatchNorm has more parameters
B) LayerNorm works per-sample, suitable for variable-length sequences
C) BatchNorm is too slow
D) LayerNorm requires less memory

Question 8: LayerNorm normalizes to have mean and variance:

A) Mean=1, Var=0
B) Mean=0, Var=1
C) Mean=0, Var=0
D) Mean=1, Var=1

24. Layer Normalization