08 - Soft Attention | Mango Encyclopedia

Introduction

Soft attention (also called "global attention" or "dense attention") computes a weighted average over all source positions, where weights are determined by a learned attention mechanism. This produces a soft, probabilistic focus across all positions, allowing fully differentiable end-to-end training.

Key Characteristics

Differentiable: Uses softmax to produce smooth attention weights
probabilistic: Attention weights form a probability distribution (sum to 1)
Global: Every position can potentially contribute to every other position
Deterministic: Same input always produces same attention weights

Mathematical Formulation

eᵢⱼ = score(qᵢ, kⱼ) (alignment score)

αᵢⱼ = softmax(eᵢ)ⱼ = exp(eᵢⱼ) / Σₖ exp(eᵢₖ) (attention weight)

cᵢ = Σⱼ αᵢⱼ · vⱼ (context vector)

Step-by-Step Process

Step 1: Compute Alignment Scores

For each query position i, compute scores against all key positions j:

eᵢ = [eᵢ₁, eᵢ₂, ..., eᵢₙ] = score(qᵢ, K)

Step 2: Apply Softmax

Convert scores to probabilities using softmax:

αᵢ = softmax(eᵢ) = exp(eᵢ) / Σₖ exp(eᵢₖ)

Σⱼ αᵢⱼ = 1 (weights sum to 1)

Step 3: Weighted Sum

Compute context as weighted average of values:

cᵢ = Σⱼ αᵢⱼ · vⱼ

Soft vs Hard Attention

Property	Soft Attention	Hard Attention
Focus	Dense (all positions)	Sparse (single position)
Differentiable	Yes (via softmax)	No (requires sampling)
Training	Standard backprop	Reinforcement learning
Computational cost	O(n) per position	Variable (can be O(1))
Memory	All positions stored	Only selected position
Gradient flow	All positions	Only selected position

Advantages of Soft Attention

Fully differentiable: Can be trained with standard backpropagation
Stable gradients: Softmax provides stable gradient flow
Efficient computation: Parallelizable across all positions
Full context access: Each position considers all other positions
Deterministic: Reproducible results across runs

Disadvantages

Computational complexity: O(n·d) for each attention computation
Memory intensity: Must attend to all positions (O(n) memory per query)
Can be diffused: May spread attention too thin across many positions

Use Cases

Soft attention is the standard attention mechanism used in:

Transformers: All modern transformer architectures use soft attention
Seq2seq models: Neural machine translation
Vision transformers: Image classification with ViT
Multimodal models: Vision-language models

Variants of Soft Attention

1. Additive Soft Attention (Bahdanau)

eᵢⱼ = vᵀ tanh(W·qᵢ + U·kⱼ)

2. Multiplicative Soft Attention (Luong)

eᵢⱼ = qᵢᵀ · kⱼ

3. Scaled Dot-Product Attention (Transformer)

Attention(Q,K,V) = softmax(QKᵀ / √d) · V

Visualization Example

For a sentence "The cat sat on the mat":

When attending to position of "sat" (verb), soft attention might produce:

α = [0.05, 0.15, 0.50, 0.05, 0.10, 0.10, 0.05]

[The, cat, sat, on, the, mat]

Highest weight (0.50) on "sat" itself, second highest (0.15) on "cat" (agent)

Test Your Understanding

Question 1: What makes soft attention differentiable?

A) Uses max function
B> Uses softmax which is differentiable
C) Uses sampling
D) Uses step function

Question 2: What is the sum of all attention weights in soft attention?

A) 0
B) 1
C) Equal to number of positions
D) Infinity

Question 3: How does hard attention differ from soft attention?

A) Hard attention is non-differentiable
B) Hard attention uses all positions
C) Hard attention requires less computation
D) Hard attention is used in Transformers

Question 4: What training method is required for hard attention?

A) Standard backpropagation
B) Reinforcement learning
C> No training needed
D) Supervised learning only

Question 5: Which type of attention is used in modern Transformers?

A) Hard attention
B> Soft attention (scaled dot-product)
C) Random attention
D) No attention

Question 6: In the example with "sat", why might "cat" have high attention weight?

A) "cat" is the subject performing "sat"
B) "cat" is the closest word
C) "cat" is the last word
D) Random chance

Question 7: What is a disadvantage of soft attention?

A) Not differentiable
B> O(n) memory per query (must store all positions)
C) Cannot use backpropagation
D) Only focuses on one position

Question 8: Soft attention is also known as:

A) Sparse attention
B) Global attention or dense attention
C) Local attention
D) Hard attention

08. Soft Attention