09. Hard Attention

Introduction

Hard attention is an attention mechanism that selects a single specific position (or a few positions) from the source, rather than taking a weighted average of all positions. This is more akin to how humans focus attention - picking one spot to look at rather than blending everything together. However, this non-differentiable nature requires special training techniques.

Key Characteristics

Mathematical Formulation

Select position: z ∈ {1, 2, ..., n}

Probability of selecting position j: p(z=j) = softmax(e)ⱼ

Context vector: c = v_z (value at selected position)

OR: c = Σᵢ αᵢ · vᵢ where αᵢ ∈ {0,1} (one-hot)

Selection Process

Hard attention typically uses one of these selection methods:

1. Argmax Selection

z = argmax(eᵢ) (select position with highest score)

2. Stochastic Sampling

z ~ Categorical(softmax(e))

Sample position based on attention distribution

3. Top-k Selection

Select top-k positions instead of just one, providing partial soft attention within hard framework.

Training with Hard Attention

Since hard attention is non-differentiable, we cannot use standard gradient descent. Instead, we use techniques from reinforcement learning:

REINFORCE Algorithm

∇L = E[∇ log p(z) · r]

Where r is the reward (e.g., negative loss)

Variance Reduction Techniques

Comparison: Soft vs Hard Attention

PropertySoft AttentionHard Attention
SelectionDense (all positions)Sparse (single position)
OutputWeighted averageSingle value
DifferentiableYes (via softmax)No (requires RL)
TrainingStandard backpropReinforcement learning
ComputationalO(n·d) all positionsO(d) one position
MemoryAll positionsOne position
InterpretabilityShows relative importanceShows exact focus

Advantages of Hard Attention

Disadvantages

Historical Context and Usage

Hard attention was originally used in the Show, Attend and Tell paper (Xu et al., 2015) for image captioning. It was one of the first attention mechanisms applied to computer vision.

However, due to training difficulties, soft attention became more popular and is now the dominant approach in modern architectures.

Modern Relevance

While soft attention dominates modern deep learning, hard attention concepts appear in:

Test Your Understanding

Question 1: Why is hard attention non-differentiable?

  • A) Uses softmax
  • B) Uses discrete selection (argmax or sampling)
  • C) Uses tanh
  • D) Uses matrix multiplication

Question 2: What training method is needed for hard attention?

  • A) Standard backpropagation
  • B) Reinforcement learning (REINFORCE)
  • C) No training needed
  • D) Supervised learning only

Question 3: In hard attention, what is the output context vector?

  • A) Weighted sum of all values
  • B) Value at selected position (one-hot weighted)
  • C) Sum of all values
  • D) Average of all values

Question 4: What is the computational complexity of hard attention per query?

  • A) O(n·d) where n is sequence length
  • B) O(d) where d is dimension
  • C) O(n²)
  • D) O(1)

Question 5: Which paper first used hard attention?

  • A) "Attention is All You Need"
  • B) "Neural Machine Translation"
  • C) "Show, Attend and Tell"
  • D) "BERT"

Question 6: What does REINFORCE algorithm estimate?

  • A) Exact gradients
  • B) Stochastic gradients for non-differentiable functions
  • C) Loss function
  • D) Attention weights

Question 7: Why did soft attention become more popular than hard attention?

  • A) Hard attention is too fast
  • B) Soft attention is differentiable and easier to train with backprop
  • C) Hard attention uses too much memory
  • D) Soft attention cannot handle long sequences

Question 8: What is "top-k" hard attention?

  • A) Selecting only position 1 and k
  • B) Selecting top k positions instead of just one
  • C) Selecting positions 1 to k
  • D) Selecting k random positions