09 - Hard Attention | Mango Encyclopedia

Introduction

Hard attention is an attention mechanism that selects a single specific position (or a few positions) from the source, rather than taking a weighted average of all positions. This is more akin to how humans focus attention - picking one spot to look at rather than blending everything together. However, this non-differentiable nature requires special training techniques.

Key Characteristics

Discrete selection: Selects one (or few) positions to attend to
Non-differentiable: Cannot use standard backpropagation
Stochastic: Often uses sampling to select positions
Memory efficient: Only processes selected positions

Mathematical Formulation

Select position: z ∈ {1, 2, ..., n}

Probability of selecting position j: p(z=j) = softmax(e)ⱼ

Context vector: c = v_z (value at selected position)

OR: c = Σᵢ αᵢ · vᵢ where αᵢ ∈ {0,1} (one-hot)

Selection Process

Hard attention typically uses one of these selection methods:

1. Argmax Selection

z = argmax(eᵢ) (select position with highest score)

2. Stochastic Sampling

z ~ Categorical(softmax(e))

Sample position based on attention distribution

3. Top-k Selection

Select top-k positions instead of just one, providing partial soft attention within hard framework.

Training with Hard Attention

Since hard attention is non-differentiable, we cannot use standard gradient descent. Instead, we use techniques from reinforcement learning:

REINFORCE Algorithm

∇L = E[∇ log p(z) · r]

Where r is the reward (e.g., negative loss)

Variance Reduction Techniques

Baseline: Subtract value baseline from reward
REINFORCE with baseline: ∇L = (r - b) · ∇ log p(z)
Attention as policy: Treat attention weight selection as policy in RL

Comparison: Soft vs Hard Attention

Property	Soft Attention	Hard Attention
Selection	Dense (all positions)	Sparse (single position)
Output	Weighted average	Single value
Differentiable	Yes (via softmax)	No (requires RL)
Training	Standard backprop	Reinforcement learning
Computational	O(n·d) all positions	O(d) one position
Memory	All positions	One position
Interpretability	Shows relative importance	Shows exact focus

Advantages of Hard Attention

Memory efficiency: Only stores and processes selected positions
Theoretical appeal: More similar to human visual attention
Potential for long sequences: Can handle very long sequences without O(n²) cost

Disadvantages

Non-differentiable: Cannot use standard backpropagation
High variance: RL training has high variance in gradients
Unstable training: Hard to to get to converge reliably
Less precise gradients: Gradient estimates are noisy

Historical Context and Usage

Hard attention was originally used in the Show, Attend and Tell paper (Xu et al., 2015) for image captioning. It was one of the first attention mechanisms applied to computer vision.

However, due to training difficulties, soft attention became more popular and is now the dominant approach in modern architectures.

Modern Relevance

While soft attention dominates modern deep learning, hard attention concepts appear in:

Sparse transformers: Selecting subset of positions to attend to
Routing mechanisms: Mixture of experts selecting specific experts
Memory addressing: Hard addressing in memory networks

Test Your Understanding

Question 1: Why is hard attention non-differentiable?

A) Uses softmax
B) Uses discrete selection (argmax or sampling)
C) Uses tanh
D) Uses matrix multiplication

Question 2: What training method is needed for hard attention?

A) Standard backpropagation
B) Reinforcement learning (REINFORCE)
C) No training needed
D) Supervised learning only

Question 3: In hard attention, what is the output context vector?

A) Weighted sum of all values
B) Value at selected position (one-hot weighted)
C) Sum of all values
D) Average of all values

Question 4: What is the computational complexity of hard attention per query?

A) O(n·d) where n is sequence length
B) O(d) where d is dimension
C) O(n²)
D) O(1)

Question 5: Which paper first used hard attention?

A) "Attention is All You Need"
B) "Neural Machine Translation"
C) "Show, Attend and Tell"
D) "BERT"

Question 6: What does REINFORCE algorithm estimate?

A) Exact gradients
B) Stochastic gradients for non-differentiable functions
C) Loss function
D) Attention weights

Question 7: Why did soft attention become more popular than hard attention?

A) Hard attention is too fast
B) Soft attention is differentiable and easier to train with backprop
C) Hard attention uses too much memory
D) Soft attention cannot handle long sequences

Question 8: What is "top-k" hard attention?

A) Selecting only position 1 and k
B) Selecting top k positions instead of just one
C) Selecting positions 1 to k
D) Selecting k random positions

09. Hard Attention