59 - Attention Rollout | Mango Encyclopedia

Introduction

Attention rollout is a technique that accumulates attention weights across layers of a transformer to produce a single attention matrix representing the overall influence between positions. This provides a more complete picture of information flow through the network.

Problem with Single-Layer Attention

Single-layer attention only shows direct connections. Information flows through multiple layers, and later layers' attention may better reflect final model decisions.

Rollout Algorithm

For L layers: Rollout_1 = A_1 (attention from layer 1)

Rollout_l = A_l @ Rollout_{l-1}

Or with residual: Rollout_l = A_l @ (Rollout_{l-1} + I) Result: Single matrix showing accumulated influence

With Residual Connections

Since each layer has: output = x + Attention(x) Include identity to account for residual flow:

Augmented attention: Â = A + I
Rollout = Â_L @ Â_{L-1} @ ... @ Â_1

What Rollout Captures

Multi-hop connections: Token A influences B through C (two hops)
Deep representation: Combines all layer information
Better correlation: Often better correlates with model output than single-layer attention

Applications

Better visualization: More complete picture of attention
Interpretability: Understand how information flows
Feature importance: Aggregate importance across layers

Test Your Understanding

Question 1: Attention rollout accumulates attention:

A) Within single layer
B> Across multiple layers
C) No accumulation
D> Only at final layer

Question 2: Rollout helps understand:

A) Only layer 1
B> Information flow through all layers
C) No multi-hop
D> Single token only

Question 3: With residual connections, we add identity matrix to attention because:

A) Attention is wrong
B> Residual connections provide direct path (skip connections)
C) No reason
D> To make it faster

Question 4: Rollout can capture multi-hop connections like:

A) A → B directly
B) A → C → B (through intermediate C)
C) No hops
D> Random hops

Question 5: Rollout formula Rollout_l = A_l @ Rollout_{l-1} computes:

A) Sum of attentions
B) Matrix product to accumulate influence
C) Average
D> No product

Question 6: A_l in rollout represents:

A) Layer 1 only
B> Attention matrix from layer l
C) Loss
D) Input

Question 7: Rollout often better correlates with model output than single-layer because:

A) It includes all layer information
B) It's random
C) It uses less computation
D> No reason

Question 8: For 12-layer model, rollout computes:

A) 1 matrix
B) Product of 12 attention matrices
C) 12 separate attentions
D> Sum of 12

59. Attention Rollout