Introduction
Scaled dot-product attention is the attention mechanism used in the original Transformer paper "Attention is All You Need" (Vaswani et al., 2017). It computes attention using dot products between queries and keys, scaled by √dₖ to prevent gradient vanishing in high dimensions.
The Attention Formula
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
Step 1: Compute QKᵀ (similarity scores)
Q ∈ ℝ^{seq_len × dₖ}
K ∈ ℝ^{seq_len × dₖ}
QKᵀ ∈ ℝ^{seq_len × seq_len}
Step 2: Scale by √dₖ
S = QKᵀ / √dₖ
Step 3: Apply softmax
A = softmax(S, axis=-1)
Step 4: Multiply by V
Output = A · V ∈ ℝ^{seq_len × dᵥ}
Why Scale?
The scaling factor prevents vanishing gradients when dₖ is large:
- If Q and K have mean 0 and variance 1, then QKᵀ has variance dₖ
- Without scaling, large dot products push softmax into saturated regions
- Large dₖ leads to extremely small gradients during backpropagation
- Scaling by √dₖ ensures variance remains 1
Var(q · k) = dₖ · Var(q) · Var(k) = dₖ
After scaling: Var((q · k) / √dₖ) = 1
After scaling: Var((q · k) / √dₖ) = 1
Matrix Dimensions
Q: [batch, seq_len, dₖ]
K: [batch, seq_len, dₖ]
V: [batch, seq_len, dᵥ]
QKᵀ: [batch, seq_len, seq_len]
softmax(QKᵀ / √dₖ): [batch, seq_len, seq_len]
Output: [batch, seq_len, dᵥ]
K: [batch, seq_len, dₖ]
V: [batch, seq_len, dᵥ]
QKᵀ: [batch, seq_len, seq_len]
softmax(QKᵀ / √dₖ): [batch, seq_len, seq_len]
Output: [batch, seq_len, dᵥ]
Implementation Details
Key Operations
For each head in multi-head attention:
- Project input X to Q, K, V using learned weight matrices
- Compute dot product attention: QKᵀ / √dₖ
- Apply mask (optional, for causal attention)
- Apply softmax along key dimension
- Weighted sum with V: softmax(QKᵀ / √dₖ) · V
Computational Cost
For sequence length n and dimension d:
QKᵀ computation: O(n² · d)
Softmax: O(n²)
Weighted sum: O(n² · d)
Total: O(n² · d)
QKᵀ computation: O(n² · d)
Softmax: O(n²)
Weighted sum: O(n² · d)
Total: O(n² · d)
Masking (Optional)
Causal Mask
For decoder self-attention, mask future positions:
scores[i,j] = -∞ if j > i (future position)
This ensures position i can only attend to positions ≤ i
scores[i,j] = -∞ if j > i (future position)
This ensures position i can only attend to positions ≤ i
Padding Mask
Mask padding tokens to prevent attention to padding:
scores[i,padding] = -∞ if position is padding
Numerical Stability
For very large or very small values in QKᵀ, softmax can overflow or produce near-zero gradients. Common practices:
- Scale: Divide by √dₖ as described
- Subtract max: Compute softmax of (scores - max(scores)) to prevent overflow
- Temperature: Divide scores by temperature T before softmax
Advantages of Scaled Dot-Product
- Computational efficiency: Matrix multiplication is highly optimized on GPUs
- Memory efficient: Can be computed in chunks for long sequences
- Differentiable: Fully continuous operations
- Parallelizable: All positions computed simultaneously
Comparison with Additive Attention
| Aspect | Scaled Dot-Product | Additive (Bahdanau) |
|---|---|---|
| Computation | Matmul + scale | FFN + tanh |
| Parameters | Wᵠ, Wₖ, Wᵥ only | W, U, v (more) |
| Speed | Faster (optimized matmul) | Slower |
| Memory | Less | More |