11 - Scaled Dot-Product Attention

Introduction

Scaled dot-product attention is the attention mechanism used in the original Transformer paper "Attention is All You Need" (Vaswani et al., 2017). It computes attention using dot products between queries and keys, scaled by √dₖ to prevent gradient vanishing in high dimensions.

The Attention Formula

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Step 1: Compute QKᵀ (similarity scores) Q ∈ ℝ^{seq_len × dₖ} K ∈ ℝ^{seq_len × dₖ} QKᵀ ∈ ℝ^{seq_len × seq_len} Step 2: Scale by √dₖ S = QKᵀ / √dₖ Step 3: Apply softmax A = softmax(S, axis=-1) Step 4: Multiply by V Output = A · V ∈ ℝ^{seq_len × dᵥ}

Why Scale?

The scaling factor prevents vanishing gradients when dₖ is large:

If Q and K have mean 0 and variance 1, then QKᵀ has variance dₖ
Without scaling, large dot products push softmax into saturated regions
Large dₖ leads to extremely small gradients during backpropagation
Scaling by √dₖ ensures variance remains 1

Var(q · k) = dₖ · Var(q) · Var(k) = dₖ

After scaling: Var((q · k) / √dₖ) = 1

Matrix Dimensions

Q: [batch, seq_len, dₖ]

K: [batch, seq_len, dₖ]

V: [batch, seq_len, dᵥ]

QKᵀ: [batch, seq_len, seq_len]

softmax(QKᵀ / √dₖ): [batch, seq_len, seq_len]

Output: [batch, seq_len, dᵥ]

Implementation Details

Key Operations

For each head in multi-head attention:

Project input X to Q, K, V using learned weight matrices
Compute dot product attention: QKᵀ / √dₖ
Apply mask (optional, for causal attention)
Apply softmax along key dimension
Weighted sum with V: softmax(QKᵀ / √dₖ) · V

Computational Cost

For sequence length n and dimension d:

QKᵀ computation: O(n² · d)

Softmax: O(n²)

Weighted sum: O(n² · d)

Total: O(n² · d)

Masking (Optional)

Causal Mask

For decoder self-attention, mask future positions:

scores[i,j] = -∞ if j > i (future position)

This ensures position i can only attend to positions ≤ i

Padding Mask

Mask padding tokens to prevent attention to padding:

scores[i,padding] = -∞ if position is padding

Numerical Stability

For very large or very small values in QKᵀ, softmax can overflow or produce near-zero gradients. Common practices:

Scale: Divide by √dₖ as described
Subtract max: Compute softmax of (scores - max(scores)) to prevent overflow
Temperature: Divide scores by temperature T before softmax

Advantages of Scaled Dot-Product

Computational efficiency: Matrix multiplication is highly optimized on GPUs
Memory efficient: Can be computed in chunks for long sequences
Differentiable: Fully continuous operations
Parallelizable: All positions computed simultaneously

Comparison with Additive Attention

Aspect	Scaled Dot-Product	Additive (Bahdanau)
Computation	Matmul + scale	FFN + tanh
Parameters	Wᵠ, Wₖ, Wᵥ only	W, U, v (more)
Speed	Faster (optimized matmul)	Slower
Memory	Less	More

Test Your Understanding

Question 1: What is the formula for scaled dot-product attention?

A) softmax(Q + K) · V
B) softmax(QKᵀ / √dₖ) · V
C) tanh(QKᵀ) · V
D) Q · K · V

Question 2: Why do we divide QKᵀ by √dₖ?

A) To speed up computation
B) To prevent vanishing gradients from large dot products
C) To make Q and K orthogonal
D) To reduce memory

Question 3: If dₖ = 64 and we don't scale, what happens to variance?

A) Variance becomes 0
B) Variance becomes 64
C) Variance becomes 1
D) Variance becomes 8

Question 4: What is the shape of the output?

A) [batch, seq_len, seq_len]
B) [batch, seq_len, dₖ]
C) [batch, seq_len, dᵥ]
D) [batch, dₖ, dᵥ]

Question 5: In the attention matrix softmax(QKᵀ / √dₖ), what does row i represent?

A> Attention weights from position i to all positions
B) Query at position i
C) Key at position i
D) Value at position i

Question 6: What is the computational complexity of scaled dot-product attention?

A) O(n · d)
B) O(n² · d)
C) O(d²)
D) O(n)

Question 7: In causal masking, what do we set for future positions?

A) scores[i,j] = 0 if j > i
B) scores[i,j] = -∞ if j > i
C) scores[i,j] = 1 if j > i
D) scores[i,j] = scores[j,i] if j > i

Question 8: Which paper introduced scaled dot-product attention?

A) "Neural Machine Translation" (Bahdanau)
B) "Attention is All You Need" (Vaswani et al.)
C) "Effective Approaches" (Luong)
D) "BERT" (Devlin)

11. Scaled Dot-Product Attention