59. Attention Rollout

Introduction

Attention rollout is a technique that accumulates attention weights across layers of a transformer to produce a single attention matrix representing the overall influence between positions. This provides a more complete picture of information flow through the network.

Problem with Single-Layer Attention

Single-layer attention only shows direct connections. Information flows through multiple layers, and later layers' attention may better reflect final model decisions.

Rollout Algorithm

For L layers: Rollout_1 = A_1 (attention from layer 1)

Rollout_l = A_l @ Rollout_{l-1}

Or with residual: Rollout_l = A_l @ (Rollout_{l-1} + I) Result: Single matrix showing accumulated influence

With Residual Connections

Since each layer has: output = x + Attention(x) Include identity to account for residual flow:

Augmented attention: Â = A + I
Rollout = Â_L @ Â_{L-1} @ ... @ Â_1

What Rollout Captures

Applications

Test Your Understanding

Question 1: Attention rollout accumulates attention:

  • A) Within single layer
  • B> Across multiple layers
  • C) No accumulation
  • D> Only at final layer

Question 2: Rollout helps understand:

  • A) Only layer 1
  • B> Information flow through all layers
  • C) No multi-hop
  • D> Single token only

Question 3: With residual connections, we add identity matrix to attention because:

  • A) Attention is wrong
  • B> Residual connections provide direct path (skip connections)
  • C) No reason
  • D> To make it faster

Question 4: Rollout can capture multi-hop connections like:

  • A) A → B directly
  • B) A → C → B (through intermediate C)
  • C) No hops
  • D> Random hops

Question 5: Rollout formula Rollout_l = A_l @ Rollout_{l-1} computes:

  • A) Sum of attentions
  • B) Matrix product to accumulate influence
  • C) Average
  • D> No product

Question 6: A_l in rollout represents:

  • A) Layer 1 only
  • B> Attention matrix from layer l
  • C) Loss
  • D) Input

Question 7: Rollout often better correlates with model output than single-layer because:

  • A) It includes all layer information
  • B) It's random
  • C) It uses less computation
  • D> No reason

Question 8: For 12-layer model, rollout computes:

  • A) 1 matrix
  • B) Product of 12 attention matrices
  • C) 12 separate attentions
  • D> Sum of 12