06. Alignment Scores

Introduction

Alignment scores are the core mechanism that determines how much attention a query should pay to each key. They measure the similarity or relevance between query and key representations, forming the foundation of all attention mechanisms.

What Are Alignment Scores?

Alignment scores eᵢⱼ represent how well the query at position i aligns with (or is attracted to) the key at position j. These scores form a matrix where:

eᵢⱼ = score(qᵢ, kⱼ)

Where qᵢ is the query at position i and kⱼ is the key at position j

Types of Scoring Functions

1. Dot Product (Scaled)

eᵢⱼ = (qᵢ · kⱼ) / √dₖ

Most common in modern Transformers. Fast and memory-efficient.

2. Additive (Bahdanau-style)

eᵢⱼ = vᵀ tanh(W₁qᵢ + W₂kⱼ)

More expressive but computationally heavier.

3. General (Multiplicative)

eᵢⱼ = qᵢᵀ W kⱼ

Similar to dot product but with learned weight matrix.

4. Bilinear (with projection)

eᵢⱼ = (Wqᵢ)ᵀ (Wkⱼ) = qᵢᵀ WᵀWkⱼ

Projects both query and key before dot product.

5. Location-based

eᵢⱼ = softmax(W · (qᵢ + kⱼ))

Depends only on query and relative position.

Comparison of Scoring Functions

MethodFormulaParametersComplexityUse Case
Dot Productq·k / √dNoneO(n·d)Transformers
Additivevᵀ tanh(Wq+Uk)W, U, vO(n·d²)RNN seq2seq
GeneralqᵀWkWO(n·d²)Luong attention
BilinearqᵀWᵀWkWO(n·d²)Graph attention

Matrix Form Representation

For efficiency, we compute all alignment scores at once:

E = QKᵀ / √dₖ

E[i,j] = eᵢⱼ = score between query i and key j

E ∈ ℝ^{seq_len × seq_len}

The Softmax Step

After computing raw scores, we apply softmax to normalize:

A = softmax(E, axis=-1)

A[i,j] = αᵢⱼ = exp(eᵢⱼ) / Σₖ exp(eᵢₖ)

Softmax ensures:

Scaling and Why It Matters

The scaling factor √dₖ prevents vanishing gradients when dₖ is large:

Variance(q·k) = dₖ · var(q) · var(k)

If q,k ~ N(0,1), then variance = dₖ

Scaling by √dₖ keeps variance = 1

Advanced Scoring Techniques

Multi-head Scores

Each attention head can use different scoring functions, capturing different aspects of alignment.

Relative Position Scoring

Scores incorporate relative positions between query and key, used in models like Shaw et al.'s relative attention.

Rotary Scoring

Rotary Position Embedding (RoPE) encodes position into the scoring function itself.

Test Your Understanding

Question 1: What does alignment score eᵢⱼ measure?

  • A) Distance between positions i and j
  • B) Relevance between query i and key j
  • C> Output value at position i
  • D) Weight of position i

Question 2: Why do we scale dot product scores by √dₖ?

  • A) To make scores larger
  • B) To prevent vanishing gradients from large dot products
  • C) To match softmax input
  • D) To speed up computation

Question 3: If QKᵀ produces scores in range [-10, 10], what does softmax produce?

  • A) Scores in same range
  • B> Scores that sum to 1 per row
  • C) Scores that sum to d
  • D) Scores in [0, 1]

Question 4: Which scoring function requires the most parameters?

  • A) Dot product
  • B) Additive (Bahdanau)
  • C) General (multiplicative)
  • D) Location-based

Question 5: What is the shape of matrix E in E = QKᵀ?

  • A) [d, d]
  • B) [seq_len, seq_len]
  • C) [batch, d]
  • D) [batch, seq_len]

Question 6: Which scoring function is used in the original Transformer paper?

  • A) Additive
  • B) General
  • C) Scaled dot-product
  • D) Location-based

Question 7: What happens if we don't scale dot products in high dimensions?

  • A) Faster training
  • B) Near-one-hot distributions after softmax (vanishing gradients)
  • C) Better accuracy
  • D) Lower memory usage

Question 8: In the formula eᵢⱼ = vᵀ tanh(W₁qᵢ + W₂kⱼ), what does tanh do?

  • A) Computes dot product
  • B) Adds non-linearity and bounds output
  • C) Normalizes to [0,1]
  • D) Projects dimensions