Introduction
The Query-Key-Value (QKV) mechanism is the foundation of modern attention systems. It provides a flexible framework where each token in a sequence can query information from all other tokens based on learned representations. This mechanism generalizes all previous attention variants (additive, multiplicative) into a unified framework.
Core Concept: Information Retrieval by Similarity
Think of the QKV mechanism like a library search system:
- Query (Q): What information are you looking for?
- Key (K): What information does each source contain?
- Value (V): What is the actual content of each source?
Mathematical Formulation
K = X · Wₖ (Key projection)
V = X · Wᵥ (Value projection)
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
Where:
- X: Input sequence of embeddings
- Wᵠ, Wₖ, Wᵥ: Learnable projection matrices
- dₖ: Dimension of keys (also scales the dot product)
- √dₖ: Scaling factor to prevent large gradients
Detailed Step-by-Step
Step 1: Project Input to Q, K, V
Each input token is linearly transformed into three different representations:
kⱼ = xⱼ · Wₖ (What do I contain?)
vⱼ = xⱼ · Wᵥ (What can I share?)
Step 2: Compute Similarity Scores
Compute dot products between queries and keys to determine relevance:
Higher scores mean position i should pay more attention to position j.
Step 3: Normalize with Softmax
Convert scores to probability distribution:
Step 4: Weighted Sum of Values
Compute final output as weighted sum of values:
Why Three Separate Representations?
Separating Q, K, V allows the model to learn different transformations for different roles:
- Query transformation: Learn what information to seek
- Key transformation: Learn what information to offer
- Value transformation: Learn what information to actually provide
This separation enables more expressive attention patterns than using the same representation for all three roles.
Dimensions and Shapes
Query Q: [batch, seq_len, dₖ]
Key K: [batch, seq_len, dₖ]
Value V: [batch, seq_len, dᵥ]
Attention: [batch, seq_len, seq_len]
Output: [batch, seq_len, dᵥ]
Role in Different Attention Types
Self-Attention
Q, K, V all come from the same sequence. Each position attends to all positions in the same sequence.
Cross-Attention
Q comes from one sequence (decoder), K and V come from another sequence (encoder). Enables cross-sequence interaction.
Multi-Head Attention
Multiple QKV projections run in parallel, each learning different aspects of attention.