11. Scaled Dot-Product Attention

Introduction

Scaled dot-product attention is the attention mechanism used in the original Transformer paper "Attention is All You Need" (Vaswani et al., 2017). It computes attention using dot products between queries and keys, scaled by √dₖ to prevent gradient vanishing in high dimensions.

The Attention Formula

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
Step 1: Compute QKᵀ (similarity scores) Q ∈ ℝ^{seq_len × dₖ} K ∈ ℝ^{seq_len × dₖ} QKᵀ ∈ ℝ^{seq_len × seq_len} Step 2: Scale by √dₖ S = QKᵀ / √dₖ Step 3: Apply softmax A = softmax(S, axis=-1) Step 4: Multiply by V Output = A · V ∈ ℝ^{seq_len × dᵥ}

Why Scale?

The scaling factor prevents vanishing gradients when dₖ is large:

Var(q · k) = dₖ · Var(q) · Var(k) = dₖ

After scaling: Var((q · k) / √dₖ) = 1

Matrix Dimensions

Q: [batch, seq_len, dₖ]

K: [batch, seq_len, dₖ]

V: [batch, seq_len, dᵥ]

QKᵀ: [batch, seq_len, seq_len]

softmax(QKᵀ / √dₖ): [batch, seq_len, seq_len]

Output: [batch, seq_len, dᵥ]

Implementation Details

Key Operations

For each head in multi-head attention:

  1. Project input X to Q, K, V using learned weight matrices
  2. Compute dot product attention: QKᵀ / √dₖ
  3. Apply mask (optional, for causal attention)
  4. Apply softmax along key dimension
  5. Weighted sum with V: softmax(QKᵀ / √dₖ) · V

Computational Cost

For sequence length n and dimension d:

QKᵀ computation: O(n² · d)

Softmax: O(n²)

Weighted sum: O(n² · d)

Total: O(n² · d)

Masking (Optional)

Causal Mask

For decoder self-attention, mask future positions:

scores[i,j] = -∞ if j > i (future position)

This ensures position i can only attend to positions ≤ i

Padding Mask

Mask padding tokens to prevent attention to padding:

scores[i,padding] = -∞ if position is padding

Numerical Stability

For very large or very small values in QKᵀ, softmax can overflow or produce near-zero gradients. Common practices:

Advantages of Scaled Dot-Product

Comparison with Additive Attention

AspectScaled Dot-ProductAdditive (Bahdanau)
ComputationMatmul + scaleFFN + tanh
ParametersWᵠ, Wₖ, Wᵥ onlyW, U, v (more)
SpeedFaster (optimized matmul)Slower
MemoryLessMore

Test Your Understanding

Question 1: What is the formula for scaled dot-product attention?

  • A) softmax(Q + K) · V
  • B) softmax(QKᵀ / √dₖ) · V
  • C) tanh(QKᵀ) · V
  • D) Q · K · V

Question 2: Why do we divide QKᵀ by √dₖ?

  • A) To speed up computation
  • B) To prevent vanishing gradients from large dot products
  • C) To make Q and K orthogonal
  • D) To reduce memory

Question 3: If dₖ = 64 and we don't scale, what happens to variance?

  • A) Variance becomes 0
  • B) Variance becomes 64
  • C) Variance becomes 1
  • D) Variance becomes 8

Question 4: What is the shape of the output?

  • A) [batch, seq_len, seq_len]
  • B) [batch, seq_len, dₖ]
  • C) [batch, seq_len, dᵥ]
  • D) [batch, dₖ, dᵥ]

Question 5: In the attention matrix softmax(QKᵀ / √dₖ), what does row i represent?

  • A> Attention weights from position i to all positions
  • B) Query at position i
  • C) Key at position i
  • D) Value at position i

Question 6: What is the computational complexity of scaled dot-product attention?

  • A) O(n · d)
  • B) O(n² · d)
  • C) O(d²)
  • D) O(n)

Question 7: In causal masking, what do we set for future positions?

  • A) scores[i,j] = 0 if j > i
  • B) scores[i,j] = -∞ if j > i
  • C) scores[i,j] = 1 if j > i
  • D) scores[i,j] = scores[j,i] if j > i

Question 8: Which paper introduced scaled dot-product attention?

  • A) "Neural Machine Translation" (Bahdanau)
  • B) "Attention is All You Need" (Vaswani et al.)
  • C) "Effective Approaches" (Luong)
  • D) "BERT" (Devlin)