08. Soft Attention

Introduction

Soft attention (also called "global attention" or "dense attention") computes a weighted average over all source positions, where weights are determined by a learned attention mechanism. This produces a soft, probabilistic focus across all positions, allowing fully differentiable end-to-end training.

Key Characteristics

Mathematical Formulation

eᵢⱼ = score(qᵢ, kⱼ) (alignment score)

αᵢⱼ = softmax(eᵢ)ⱼ = exp(eᵢⱼ) / Σₖ exp(eᵢₖ) (attention weight)

cᵢ = Σⱼ αᵢⱼ · vⱼ (context vector)

Step-by-Step Process

Step 1: Compute Alignment Scores

For each query position i, compute scores against all key positions j:

eᵢ = [eᵢ₁, eᵢ₂, ..., eᵢₙ] = score(qᵢ, K)

Step 2: Apply Softmax

Convert scores to probabilities using softmax:

αᵢ = softmax(eᵢ) = exp(eᵢ) / Σₖ exp(eᵢₖ)

Σⱼ αᵢⱼ = 1 (weights sum to 1)

Step 3: Weighted Sum

Compute context as weighted average of values:

cᵢ = Σⱼ αᵢⱼ · vⱼ

Soft vs Hard Attention

PropertySoft AttentionHard Attention
FocusDense (all positions)Sparse (single position)
DifferentiableYes (via softmax)No (requires sampling)
TrainingStandard backpropReinforcement learning
Computational costO(n) per positionVariable (can be O(1))
MemoryAll positions storedOnly selected position
Gradient flowAll positionsOnly selected position

Advantages of Soft Attention

Disadvantages

Use Cases

Soft attention is the standard attention mechanism used in:

Variants of Soft Attention

1. Additive Soft Attention (Bahdanau)

eᵢⱼ = vᵀ tanh(W·qᵢ + U·kⱼ)

2. Multiplicative Soft Attention (Luong)

eᵢⱼ = qᵢᵀ · kⱼ

3. Scaled Dot-Product Attention (Transformer)

Attention(Q,K,V) = softmax(QKᵀ / √d) · V

Visualization Example

For a sentence "The cat sat on the mat":

When attending to position of "sat" (verb), soft attention might produce:

α = [0.05, 0.15, 0.50, 0.05, 0.10, 0.10, 0.05]

[The, cat, sat, on, the, mat]

Highest weight (0.50) on "sat" itself, second highest (0.15) on "cat" (agent)

Test Your Understanding

Question 1: What makes soft attention differentiable?

  • A) Uses max function
  • B> Uses softmax which is differentiable
  • C) Uses sampling
  • D) Uses step function

Question 2: What is the sum of all attention weights in soft attention?

  • A) 0
  • B) 1
  • C) Equal to number of positions
  • D) Infinity

Question 3: How does hard attention differ from soft attention?

  • A) Hard attention is non-differentiable
  • B) Hard attention uses all positions
  • C) Hard attention requires less computation
  • D) Hard attention is used in Transformers

Question 4: What training method is required for hard attention?

  • A) Standard backpropagation
  • B) Reinforcement learning
  • C> No training needed
  • D) Supervised learning only

Question 5: Which type of attention is used in modern Transformers?

  • A) Hard attention
  • B> Soft attention (scaled dot-product)
  • C) Random attention
  • D) No attention

Question 6: In the example with "sat", why might "cat" have high attention weight?

  • A) "cat" is the subject performing "sat"
  • B) "cat" is the closest word
  • C) "cat" is the last word
  • D) Random chance

Question 7: What is a disadvantage of soft attention?

  • A) Not differentiable
  • B> O(n) memory per query (must store all positions)
  • C) Cannot use backpropagation
  • D) Only focuses on one position

Question 8: Soft attention is also known as:

  • A) Sparse attention
  • B) Global attention or dense attention
  • C) Local attention
  • D) Hard attention