23 - Residual Connections | Mango Encyclopedia

Introduction

Residual connections (also called skip connections or residual skip connections) were introduced in ResNet (2015) and adopted in Transformers. They create a shortcut path that allows gradients to flow directly through the network, enabling training of very deep architectures without degradation.

The Residual Formula

output = LayerNorm(x + SubLayer(x))

where:
x = input to the sub-layer
SubLayer(x) = attention or FFN

Standard (left) vs Residual (right): Standard: Residual: x → [Layer] → out x → [Layer] → add → out ↘_____________↑

Why Residuals Help

1. Gradient Flow

During backpropagation, gradients can flow through the skip connection directly:

∂L/∂x = ∂L/∂out · (1 + ∂out/∂x_from_layer)

The "1" comes from the skip connection, ensuring gradient flow even if layer gradient is small

2. Easier Learning

If the sub-layer needs to learn identity mapping, it can simply push weights toward zero. The skip connection provides a default identity path.

3. Deep Networks

Without residuals, very deep networks suffer from degradation (higher training error). Residuals enable training 100+ layer networks.

In Transformers

Transformers use two residual connections per layer:

Transformer Layer: Input x │ ├──────────────────────────┐ ▼ │ [Multi-Head Attention] │ │ │ ▼ SubLayer(x) │ [Add & Norm] ←──────────────┘ │ ├──────────────────────────┐ ▼ │ [Feed-Forward Network] │ │ │ ▼ SubLayer(x) │ [Add & Norm] ←──────────────┘ │ ▼ Output

Key Insight

Residual connections don't change the function being computed (if the sub-layer learns identity, output equals input). They just provide an alternative gradient path. This means:

The network can learn to pass information unchanged when helpful
Training is more stable with better gradient flow
Initialization can be closer to identity (helps convergence)

Connection to Pre-LN

Modern Transformers sometimes use "Pre-Layer Normalization" (PreLN):

PreLN: LayerNorm(x) + SubLayer(LayerNorm(x))

Original (PostLN): LayerNorm(x + SubLayer(x))

PreLN can be more stable during training

Test Your Understanding

Question 1: What is the formula for residual connection?

A) output = Layer(SubLayer(x))
B) output = x + SubLayer(x)
C) output = SubLayer(x) - x
D) output = LayerNorm(x) + LayerNorm(SubLayer(x))

Question 2: How do residual connections help gradient flow?

A) They add gradients from skip path
B) They multiply gradients
C) They have no effect on gradients
D) They make gradients smaller

Question 3: If the sub-layer learns identity mapping, what is the output?

A) Zero
B) LayerNorm(x)
C) Approximately x (unchanged)
D) LayerNorm(0)

Question 4: Where are residual connections used in Transformers?

A) Only after attention
B) Only after FFN
C) After both attention and FFN
D) No residual connections

Question 5: What problem did residual connections solve?

A) Overfitting
B) Vanishing gradients and degradation in deep networks
C) Computational cost
D) Memory usage

Question 6: How many residual connections are in each Transformer layer?

A) 1
B) 2 (one after attention, one after FFN)
C) 3
D) 4

Question 7: What paper introduced residual connections?

A) "Attention is All You Need"
B) "Deep Residual Learning for Image Recognition" (ResNet)
C) "BERT"
D) "ImageNet Classification"

Question 8: In PreLN vs PostLN, which applies layer norm before the sub-layer?

A) PostLN
B) PreLN
C> Neither
D) Both

23. Residual Connections