Introduction
Layer normalization is a normalization technique that normalizes activations across features (instead of across the batch). It stabilizes training by ensuring activations have consistent mean and variance, and is a key component in Transformer architectures alongside residual connections.
Formula
μ = (1/d) Σᵢ xᵢ
σ² = (1/d) Σᵢ (xᵢ - μ)²
LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β
where:
γ = learnable scale parameter
β = learnable shift parameter
ε = small constant for numerical stability
σ² = (1/d) Σᵢ (xᵢ - μ)²
LayerNorm(x) = γ · (x - μ) / √(σ² + ε) + β
where:
γ = learnable scale parameter
β = learnable shift parameter
ε = small constant for numerical stability
What Gets Normalized
LayerNorm normalizes across the feature dimension for each individual example:
Input shape: [batch, seq_len, d_model]
For each (batch, position) pair:
Compute mean and std over the d_model features
Normalize to have mean 0, variance 1
Apply learnable γ, β
Output shape: [batch, seq_len, d_model]
For each (batch, position) pair:
Compute mean and std over the d_model features
Normalize to have mean 0, variance 1
Apply learnable γ, β
Output shape: [batch, seq_len, d_model]
LayerNorm vs BatchNorm
| Aspect | LayerNorm | BatchNorm |
|---|---|---|
| Normalizes over | Feature dimension | Batch dimension |
| Training behavior | Uses current sample stats | Uses batch statistics |
| Inference | Same as training | Uses running statistics |
| RNN compatibility | Yes | No (sequential) |
| Transformer usage | Yes (standard) | No (not suitable) |
In Transformers
LayerNorm is applied in two places within each Transformer layer:
1. After attention: LayerNorm(x + Attention(x))
2. After FFN: LayerNorm(x + FFN(x))
2. After FFN: LayerNorm(x + FFN(x))
Pre-LN vs Post-LN
Post-LN (Original Transformer)
output = LayerNorm(x + SubLayer(x))
LayerNorm comes AFTER the residual addition
LayerNorm comes AFTER the residual addition
Pre-LN (More Common Now)
output = LayerNorm(x) + SubLayer(LayerNorm(x))
LayerNorm comes BEFORE the sub-layer
This can be more stable during training
LayerNorm comes BEFORE the sub-layer
This can be more stable during training
Why LayerNorm Works
- Stabilizes activations: Keeps values in reasonable range
- Enables higher learning rates: Normalized inputs allow bigger gradients
- Reduces internal covariate shift: Inputs to each layer are more consistent
- Learned affine transformation: γ, β allow the network to undo normalization if needed
Parameters
LayerNorm has 2 parameters per feature:
γ ∈ ℝ^{d_model} (scale)
β ∈ ℝ^{d_model} (shift)
For d_model=768: 2 × 768 = 1536 parameters per LayerNorm
A Transformer layer has 2 LayerNorms → ~3000 parameters
γ ∈ ℝ^{d_model} (scale)
β ∈ ℝ^{d_model} (shift)
For d_model=768: 2 × 768 = 1536 parameters per LayerNorm
A Transformer layer has 2 LayerNorms → ~3000 parameters