Introduction
Hard attention is an attention mechanism that selects a single specific position (or a few positions) from the source, rather than taking a weighted average of all positions. This is more akin to how humans focus attention - picking one spot to look at rather than blending everything together. However, this non-differentiable nature requires special training techniques.
Key Characteristics
- Discrete selection: Selects one (or few) positions to attend to
- Non-differentiable: Cannot use standard backpropagation
- Stochastic: Often uses sampling to select positions
- Memory efficient: Only processes selected positions
Mathematical Formulation
Probability of selecting position j: p(z=j) = softmax(e)ⱼ
Context vector: c = v_z (value at selected position)
OR: c = Σᵢ αᵢ · vᵢ where αᵢ ∈ {0,1} (one-hot)
Selection Process
Hard attention typically uses one of these selection methods:
1. Argmax Selection
2. Stochastic Sampling
Sample position based on attention distribution
3. Top-k Selection
Select top-k positions instead of just one, providing partial soft attention within hard framework.
Training with Hard Attention
Since hard attention is non-differentiable, we cannot use standard gradient descent. Instead, we use techniques from reinforcement learning:
REINFORCE Algorithm
Where r is the reward (e.g., negative loss)
Variance Reduction Techniques
- Baseline: Subtract value baseline from reward
- REINFORCE with baseline: ∇L = (r - b) · ∇ log p(z)
- Attention as policy: Treat attention weight selection as policy in RL
Comparison: Soft vs Hard Attention
| Property | Soft Attention | Hard Attention |
|---|---|---|
| Selection | Dense (all positions) | Sparse (single position) |
| Output | Weighted average | Single value |
| Differentiable | Yes (via softmax) | No (requires RL) |
| Training | Standard backprop | Reinforcement learning |
| Computational | O(n·d) all positions | O(d) one position |
| Memory | All positions | One position |
| Interpretability | Shows relative importance | Shows exact focus |
Advantages of Hard Attention
- Memory efficiency: Only stores and processes selected positions
- Theoretical appeal: More similar to human visual attention
- Potential for long sequences: Can handle very long sequences without O(n²) cost
Disadvantages
- Non-differentiable: Cannot use standard backpropagation
- High variance: RL training has high variance in gradients
- Unstable training: Hard to to get to converge reliably
- Less precise gradients: Gradient estimates are noisy
Historical Context and Usage
Hard attention was originally used in the Show, Attend and Tell paper (Xu et al., 2015) for image captioning. It was one of the first attention mechanisms applied to computer vision.
However, due to training difficulties, soft attention became more popular and is now the dominant approach in modern architectures.
Modern Relevance
While soft attention dominates modern deep learning, hard attention concepts appear in:
- Sparse transformers: Selecting subset of positions to attend to
- Routing mechanisms: Mixture of experts selecting specific experts
- Memory addressing: Hard addressing in memory networks