58 - Attention Interpretability | Mango Encyclopedia

Introduction

Attention interpretability studies how attention weights can be used to understand what neural networks have learned. While attention weights are often used as explanations for model behavior, research shows they may not always provide faithful explanations.

Faithfulness of Attention

Key question: Does attention reflect what the model actually uses?

Evidence For

Attention often correlates with gradient-based importance
Attention patterns match human intuitions about syntax/semantics
Ablating attended positions hurts performance

Evidence Against

Attention can be noisy and not focused on causal features
Different attention patterns can yield same output
Models can learn to ignore attention in some cases

Attention as Explanation

Attention weight αᵢⱼ often used as "explanation" for:
"Token i attends to token j, so j is important for i"

But: Attention ≠ Causality

Better Interpretability Methods

Method	Description
Attention rollout	Accumulate attention across layers
Grad-CAM	Gradient-based attention visualization
Layer-wise Relevance	Relevance propagation
Probing classifiers	Train classifiers on attention weights

Probing Classifiers

Train a probe (classifier) on attention representations to see what information is encoded:

Train probe on attention to predict:
- POS tags
- Syntactic dependencies
- Coreference

If probe works well → information is in attention

Test Your Understanding

Question 1: Attention weights are always faithful explanations:

A) True
B) False - they may not reflect actual model behavior
C) Sometimes true
D) Always true for NLP

Question 2: Attention ≠ Causality means:

A) Attention always causes output
B) Attention weight doesn't necessarily mean causal importance
C> Attention causes gradient
D> No meaning

Question 3: Probing classifiers are used to:

A) Compress models
B) See what information is encoded in attention representations
C) Speed up inference
D) Visualize attention

Question 4: Attention rollout accumulates:

A) Gradients only
B) Attention weights across layers
C) Loss values
D) Random values

Question 5: A limitation of attention as explanation:

A) Too accurate
B) Models can learn to ignore attention
C) No limitations
D> Always perfect

Question 6: Grad-CAM uses:

A) Only attention weights
B) Gradients combined with attention
C) Random projections
D) No gradients

Question 7: If ablating attended positions hurts performance, it suggests:

A) Attention is random
B) Those positions are actually important
C> No relation
D) Model is perfect

Question 8: Attention interpretability studies whether:

A) Attention is fast
B) Attention reflects what the model actually uses
C) Attention uses less memory
D) No study needed

58. Attention Interpretability