Introduction
Alignment scores are the core mechanism that determines how much attention a query should pay to each key. They measure the similarity or relevance between query and key representations, forming the foundation of all attention mechanisms.
What Are Alignment Scores?
Alignment scores eᵢⱼ represent how well the query at position i aligns with (or is attracted to) the key at position j. These scores form a matrix where:
- Rows correspond to query positions
- Columns correspond to key positions
- Each cell (i,j) shows the attention from query i to key j
Where qᵢ is the query at position i and kⱼ is the key at position j
Types of Scoring Functions
1. Dot Product (Scaled)
Most common in modern Transformers. Fast and memory-efficient.
2. Additive (Bahdanau-style)
More expressive but computationally heavier.
3. General (Multiplicative)
Similar to dot product but with learned weight matrix.
4. Bilinear (with projection)
Projects both query and key before dot product.
5. Location-based
Depends only on query and relative position.
Comparison of Scoring Functions
| Method | Formula | Parameters | Complexity | Use Case |
|---|---|---|---|---|
| Dot Product | q·k / √d | None | O(n·d) | Transformers |
| Additive | vᵀ tanh(Wq+Uk) | W, U, v | O(n·d²) | RNN seq2seq |
| General | qᵀWk | W | O(n·d²) | Luong attention |
| Bilinear | qᵀWᵀWk | W | O(n·d²) | Graph attention |
Matrix Form Representation
For efficiency, we compute all alignment scores at once:
E[i,j] = eᵢⱼ = score between query i and key j
E ∈ ℝ^{seq_len × seq_len}
The Softmax Step
After computing raw scores, we apply softmax to normalize:
A[i,j] = αᵢⱼ = exp(eᵢⱼ) / Σₖ exp(eᵢₖ)
Softmax ensures:
- All attention weights are positive
- All attention weights sum to 1 (per query)
- Higher scores get exponentially more weight
Scaling and Why It Matters
The scaling factor √dₖ prevents vanishing gradients when dₖ is large:
- When dₖ is large, dot products can have magnitude √dₖ
- Large values pushed through softmax produce near-one-hot distributions
- This leads to very small gradients during backpropagation
If q,k ~ N(0,1), then variance = dₖ
Scaling by √dₖ keeps variance = 1
Advanced Scoring Techniques
Multi-head Scores
Each attention head can use different scoring functions, capturing different aspects of alignment.
Relative Position Scoring
Scores incorporate relative positions between query and key, used in models like Shaw et al.'s relative attention.
Rotary Scoring
Rotary Position Embedding (RoPE) encodes position into the scoring function itself.