Introduction
Soft attention (also called "global attention" or "dense attention") computes a weighted average over all source positions, where weights are determined by a learned attention mechanism. This produces a soft, probabilistic focus across all positions, allowing fully differentiable end-to-end training.
Key Characteristics
- Differentiable: Uses softmax to produce smooth attention weights
- probabilistic: Attention weights form a probability distribution (sum to 1)
- Global: Every position can potentially contribute to every other position
- Deterministic: Same input always produces same attention weights
Mathematical Formulation
eᵢⱼ = score(qᵢ, kⱼ) (alignment score)
αᵢⱼ = softmax(eᵢ)ⱼ = exp(eᵢⱼ) / Σₖ exp(eᵢₖ) (attention weight)
cᵢ = Σⱼ αᵢⱼ · vⱼ (context vector)
αᵢⱼ = softmax(eᵢ)ⱼ = exp(eᵢⱼ) / Σₖ exp(eᵢₖ) (attention weight)
cᵢ = Σⱼ αᵢⱼ · vⱼ (context vector)
Step-by-Step Process
Step 1: Compute Alignment Scores
For each query position i, compute scores against all key positions j:
eᵢ = [eᵢ₁, eᵢ₂, ..., eᵢₙ] = score(qᵢ, K)
Step 2: Apply Softmax
Convert scores to probabilities using softmax:
αᵢ = softmax(eᵢ) = exp(eᵢ) / Σₖ exp(eᵢₖ)
Σⱼ αᵢⱼ = 1 (weights sum to 1)
Σⱼ αᵢⱼ = 1 (weights sum to 1)
Step 3: Weighted Sum
Compute context as weighted average of values:
cᵢ = Σⱼ αᵢⱼ · vⱼ
Soft vs Hard Attention
| Property | Soft Attention | Hard Attention |
|---|---|---|
| Focus | Dense (all positions) | Sparse (single position) |
| Differentiable | Yes (via softmax) | No (requires sampling) |
| Training | Standard backprop | Reinforcement learning |
| Computational cost | O(n) per position | Variable (can be O(1)) |
| Memory | All positions stored | Only selected position |
| Gradient flow | All positions | Only selected position |
Advantages of Soft Attention
- Fully differentiable: Can be trained with standard backpropagation
- Stable gradients: Softmax provides stable gradient flow
- Efficient computation: Parallelizable across all positions
- Full context access: Each position considers all other positions
- Deterministic: Reproducible results across runs
Disadvantages
- Computational complexity: O(n·d) for each attention computation
- Memory intensity: Must attend to all positions (O(n) memory per query)
- Can be diffused: May spread attention too thin across many positions
Use Cases
Soft attention is the standard attention mechanism used in:
- Transformers: All modern transformer architectures use soft attention
- Seq2seq models: Neural machine translation
- Vision transformers: Image classification with ViT
- Multimodal models: Vision-language models
Variants of Soft Attention
1. Additive Soft Attention (Bahdanau)
eᵢⱼ = vᵀ tanh(W·qᵢ + U·kⱼ)
2. Multiplicative Soft Attention (Luong)
eᵢⱼ = qᵢᵀ · kⱼ
3. Scaled Dot-Product Attention (Transformer)
Attention(Q,K,V) = softmax(QKᵀ / √d) · V
Visualization Example
For a sentence "The cat sat on the mat":
When attending to position of "sat" (verb), soft attention might produce:
α = [0.05, 0.15, 0.50, 0.05, 0.10, 0.10, 0.05]
[The, cat, sat, on, the, mat]
Highest weight (0.50) on "sat" itself, second highest (0.15) on "cat" (agent)
[The, cat, sat, on, the, mat]
Highest weight (0.50) on "sat" itself, second highest (0.15) on "cat" (agent)