58. Attention Interpretability

Introduction

Attention interpretability studies how attention weights can be used to understand what neural networks have learned. While attention weights are often used as explanations for model behavior, research shows they may not always provide faithful explanations.

Faithfulness of Attention

Key question: Does attention reflect what the model actually uses?

Evidence For

Evidence Against

Attention as Explanation

Attention weight αᵢⱼ often used as "explanation" for:
"Token i attends to token j, so j is important for i"

But: Attention ≠ Causality

Better Interpretability Methods

MethodDescription
Attention rolloutAccumulate attention across layers
Grad-CAMGradient-based attention visualization
Layer-wise RelevanceRelevance propagation
Probing classifiersTrain classifiers on attention weights

Probing Classifiers

Train a probe (classifier) on attention representations to see what information is encoded:

Train probe on attention to predict:
- POS tags
- Syntactic dependencies
- Coreference

If probe works well → information is in attention

Test Your Understanding

Question 1: Attention weights are always faithful explanations:

  • A) True
  • B) False - they may not reflect actual model behavior
  • C) Sometimes true
  • D) Always true for NLP

Question 2: Attention ≠ Causality means:

  • A) Attention always causes output
  • B) Attention weight doesn't necessarily mean causal importance
  • C> Attention causes gradient
  • D> No meaning

Question 3: Probing classifiers are used to:

  • A) Compress models
  • B) See what information is encoded in attention representations
  • C) Speed up inference
  • D) Visualize attention

Question 4: Attention rollout accumulates:

  • A) Gradients only
  • B) Attention weights across layers
  • C) Loss values
  • D) Random values

Question 5: A limitation of attention as explanation:

  • A) Too accurate
  • B) Models can learn to ignore attention
  • C) No limitations
  • D> Always perfect

Question 6: Grad-CAM uses:

  • A) Only attention weights
  • B) Gradients combined with attention
  • C) Random projections
  • D) No gradients

Question 7: If ablating attended positions hurts performance, it suggests:

  • A) Attention is random
  • B) Those positions are actually important
  • C> No relation
  • D) Model is perfect

Question 8: Attention interpretability studies whether:

  • A) Attention is fast
  • B) Attention reflects what the model actually uses
  • C) Attention uses less memory
  • D) No study needed