23. Residual Connections

Introduction

Residual connections (also called skip connections or residual skip connections) were introduced in ResNet (2015) and adopted in Transformers. They create a shortcut path that allows gradients to flow directly through the network, enabling training of very deep architectures without degradation.

The Residual Formula

output = LayerNorm(x + SubLayer(x))

where:
x = input to the sub-layer
SubLayer(x) = attention or FFN
Standard (left) vs Residual (right): Standard: Residual: x → [Layer] → out x → [Layer] → add → out ↘_____________↑

Why Residuals Help

1. Gradient Flow

During backpropagation, gradients can flow through the skip connection directly:

∂L/∂x = ∂L/∂out · (1 + ∂out/∂x_from_layer)

The "1" comes from the skip connection, ensuring gradient flow even if layer gradient is small

2. Easier Learning

If the sub-layer needs to learn identity mapping, it can simply push weights toward zero. The skip connection provides a default identity path.

3. Deep Networks

Without residuals, very deep networks suffer from degradation (higher training error). Residuals enable training 100+ layer networks.

In Transformers

Transformers use two residual connections per layer:

Transformer Layer: Input x │ ├──────────────────────────┐ ▼ │ [Multi-Head Attention] │ │ │ ▼ SubLayer(x) │ [Add & Norm] ←──────────────┘ │ ├──────────────────────────┐ ▼ │ [Feed-Forward Network] │ │ │ ▼ SubLayer(x) │ [Add & Norm] ←──────────────┘ │ ▼ Output

Key Insight

Residual connections don't change the function being computed (if the sub-layer learns identity, output equals input). They just provide an alternative gradient path. This means:

Connection to Pre-LN

Modern Transformers sometimes use "Pre-Layer Normalization" (PreLN):

PreLN: LayerNorm(x) + SubLayer(LayerNorm(x))

Original (PostLN): LayerNorm(x + SubLayer(x))

PreLN can be more stable during training

Test Your Understanding

Question 1: What is the formula for residual connection?

  • A) output = Layer(SubLayer(x))
  • B) output = x + SubLayer(x)
  • C) output = SubLayer(x) - x
  • D) output = LayerNorm(x) + LayerNorm(SubLayer(x))

Question 2: How do residual connections help gradient flow?

  • A) They add gradients from skip path
  • B) They multiply gradients
  • C) They have no effect on gradients
  • D) They make gradients smaller

Question 3: If the sub-layer learns identity mapping, what is the output?

  • A) Zero
  • B) LayerNorm(x)
  • C) Approximately x (unchanged)
  • D) LayerNorm(0)

Question 4: Where are residual connections used in Transformers?

  • A) Only after attention
  • B) Only after FFN
  • C) After both attention and FFN
  • D) No residual connections

Question 5: What problem did residual connections solve?

  • A) Overfitting
  • B) Vanishing gradients and degradation in deep networks
  • C) Computational cost
  • D) Memory usage

Question 6: How many residual connections are in each Transformer layer?

  • A) 1
  • B) 2 (one after attention, one after FFN)
  • C) 3
  • D) 4

Question 7: What paper introduced residual connections?

  • A) "Attention is All You Need"
  • B) "Deep Residual Learning for Image Recognition" (ResNet)
  • C) "BERT"
  • D) "ImageNet Classification"

Question 8: In PreLN vs PostLN, which applies layer norm before the sub-layer?

  • A) PostLN
  • B) PreLN
  • C> Neither
  • D) Both