Introduction
Attention interpretability studies how attention weights can be used to understand what neural networks have learned. While attention weights are often used as explanations for model behavior, research shows they may not always provide faithful explanations.
Faithfulness of Attention
Key question: Does attention reflect what the model actually uses?
Evidence For
- Attention often correlates with gradient-based importance
- Attention patterns match human intuitions about syntax/semantics
- Ablating attended positions hurts performance
Evidence Against
- Attention can be noisy and not focused on causal features
- Different attention patterns can yield same output
- Models can learn to ignore attention in some cases
Attention as Explanation
Attention weight αᵢⱼ often used as "explanation" for:
"Token i attends to token j, so j is important for i"
But: Attention ≠ Causality
"Token i attends to token j, so j is important for i"
But: Attention ≠ Causality
Better Interpretability Methods
| Method | Description |
|---|---|
| Attention rollout | Accumulate attention across layers |
| Grad-CAM | Gradient-based attention visualization |
| Layer-wise Relevance | Relevance propagation |
| Probing classifiers | Train classifiers on attention weights |
Probing Classifiers
Train a probe (classifier) on attention representations to see what information is encoded:
Train probe on attention to predict:
- POS tags
- Syntactic dependencies
- Coreference
If probe works well → information is in attention
- POS tags
- Syntactic dependencies
- Coreference
If probe works well → information is in attention