Introduction
Attention rollout is a technique that accumulates attention weights across layers of a transformer to produce a single attention matrix representing the overall influence between positions. This provides a more complete picture of information flow through the network.
Problem with Single-Layer Attention
Single-layer attention only shows direct connections. Information flows through multiple layers, and later layers' attention may better reflect final model decisions.
Rollout Algorithm
For L layers:
Rollout_1 = A_1 (attention from layer 1)
Rollout_l = A_l @ Rollout_{l-1}
Or with residual: Rollout_l = A_l @ (Rollout_{l-1} + I)
Result: Single matrix showing accumulated influence
Rollout_1 = A_1 (attention from layer 1)
Rollout_l = A_l @ Rollout_{l-1}
Or with residual: Rollout_l = A_l @ (Rollout_{l-1} + I)
Result: Single matrix showing accumulated influence
With Residual Connections
Since each layer has: output = x + Attention(x)
Include identity to account for residual flow:
Augmented attention: Â = A + I
Rollout = Â_L @ Â_{L-1} @ ... @ Â_1
Include identity to account for residual flow:
Augmented attention: Â = A + I
Rollout = Â_L @ Â_{L-1} @ ... @ Â_1
What Rollout Captures
- Multi-hop connections: Token A influences B through C (two hops)
- Deep representation: Combines all layer information
- Better correlation: Often better correlates with model output than single-layer attention
Applications
- Better visualization: More complete picture of attention
- Interpretability: Understand how information flows
- Feature importance: Aggregate importance across layers