Introduction
Residual connections (also called skip connections or residual skip connections) were introduced in ResNet (2015) and adopted in Transformers. They create a shortcut path that allows gradients to flow directly through the network, enabling training of very deep architectures without degradation.
The Residual Formula
output = LayerNorm(x + SubLayer(x))
where:
x = input to the sub-layer
SubLayer(x) = attention or FFN
Standard (left) vs Residual (right):
Standard: Residual:
x → [Layer] → out x → [Layer] → add → out
↘_____________↑
Why Residuals Help
1. Gradient Flow
During backpropagation, gradients can flow through the skip connection directly:
∂L/∂x = ∂L/∂out · (1 + ∂out/∂x_from_layer)
The "1" comes from the skip connection, ensuring gradient flow even if layer gradient is small
2. Easier Learning
If the sub-layer needs to learn identity mapping, it can simply push weights toward zero. The skip connection provides a default identity path.
3. Deep Networks
Without residuals, very deep networks suffer from degradation (higher training error). Residuals enable training 100+ layer networks.
In Transformers
Transformers use two residual connections per layer:
Transformer Layer:
Input x
│
├──────────────────────────┐
▼ │
[Multi-Head Attention] │
│ │
▼ SubLayer(x) │
[Add & Norm] ←──────────────┘
│
├──────────────────────────┐
▼ │
[Feed-Forward Network] │
│ │
▼ SubLayer(x) │
[Add & Norm] ←──────────────┘
│
▼
Output
Key Insight
Residual connections don't change the function being computed (if the sub-layer learns identity, output equals input). They just provide an alternative gradient path. This means:
- The network can learn to pass information unchanged when helpful
- Training is more stable with better gradient flow
- Initialization can be closer to identity (helps convergence)
Connection to Pre-LN
Modern Transformers sometimes use "Pre-Layer Normalization" (PreLN):
PreLN: LayerNorm(x) + SubLayer(LayerNorm(x))
Original (PostLN): LayerNorm(x + SubLayer(x))
PreLN can be more stable during training
Test Your Understanding
Question 1: What is the formula for residual connection?
- A) output = Layer(SubLayer(x))
- B) output = x + SubLayer(x)
- C) output = SubLayer(x) - x
- D) output = LayerNorm(x) + LayerNorm(SubLayer(x))
Question 2: How do residual connections help gradient flow?
- A) They add gradients from skip path
- B) They multiply gradients
- C) They have no effect on gradients
- D) They make gradients smaller
Question 3: If the sub-layer learns identity mapping, what is the output?
- A) Zero
- B) LayerNorm(x)
- C) Approximately x (unchanged)
- D) LayerNorm(0)
Question 4: Where are residual connections used in Transformers?
- A) Only after attention
- B) Only after FFN
- C) After both attention and FFN
- D) No residual connections
Question 5: What problem did residual connections solve?
- A) Overfitting
- B) Vanishing gradients and degradation in deep networks
- C) Computational cost
- D) Memory usage
Question 6: How many residual connections are in each Transformer layer?
- A) 1
- B) 2 (one after attention, one after FFN)
- C) 3
- D) 4
Question 7: What paper introduced residual connections?
- A) "Attention is All You Need"
- B) "Deep Residual Learning for Image Recognition" (ResNet)
- C) "BERT"
- D) "ImageNet Classification"
Question 8: In PreLN vs PostLN, which applies layer norm before the sub-layer?
- A) PostLN
- B) PreLN
- C> Neither
- D) Both