05 - Query-Key-Value Mechanism | Mango Encyclopedia

Introduction

The Query-Key-Value (QKV) mechanism is the foundation of modern attention systems. It provides a flexible framework where each token in a sequence can query information from all other tokens based on learned representations. This mechanism generalizes all previous attention variants (additive, multiplicative) into a unified framework.

Core Concept: Information Retrieval by Similarity

Think of the QKV mechanism like a library search system:

Query (Q): What information are you looking for?
Key (K): What information does each source contain?
Value (V): What is the actual content of each source?

┌─────────────────────────────────────────────────────────────┐ │ QKV ATTENTION FLOW │ │ │ │ Query (Q) Key (K) Value (V) │ │ │ │ │ │ │ └────────┬───────────┘ │ │ │ ▼ │ │ │ ┌─────────┐ │ │ │ │ SCORE │ │ │ │ │ Q·Kᵀ/√d │ │ │ │ └─────────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────┐ Softmax │ │ │ │ ATTN │──────────────────────────┤ │ │ │ WEIGHTS │ αᵢⱼ │ │ │ └─────────┘ │ │ │ │ │ │ │ └──────────┬──────────────────┘ │ │ ▼ │ │ ┌─────────┐ │ │ │ OUTPUT │ │ │ │ Σ α·V │ │ │ └─────────┘ │ └─────────────────────────────────────────────────────────────┘

Mathematical Formulation

Q = X · Wᵠ (Query projection)

K = X · Wₖ (Key projection)

V = X · Wᵥ (Value projection)

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Where:

X: Input sequence of embeddings
Wᵠ, Wₖ, Wᵥ: Learnable projection matrices
dₖ: Dimension of keys (also scales the dot product)
√dₖ: Scaling factor to prevent large gradients

Detailed Step-by-Step

Step 1: Project Input to Q, K, V

Each input token is linearly transformed into three different representations:

qᵢ = xᵢ · Wᵠ (What am I looking for?)

kⱼ = xⱼ · Wₖ (What do I contain?)

vⱼ = xⱼ · Wᵥ (What can I share?)

Step 2: Compute Similarity Scores

Compute dot products between queries and keys to determine relevance:

eᵢⱼ = qᵢ · kⱼᵀ / √dₖ

Higher scores mean position i should pay more attention to position j.

Step 3: Normalize with Softmax

Convert scores to probability distribution:

αᵢⱼ = softmax(eᵢⱼ) = exp(eᵢⱼ) / Σₖ exp(eᵢₖ)

Step 4: Weighted Sum of Values

Compute final output as weighted sum of values:

outputᵢ = Σⱼ αᵢⱼ · vⱼ

Why Three Separate Representations?

Separating Q, K, V allows the model to learn different transformations for different roles:

Query transformation: Learn what information to seek
Key transformation: Learn what information to offer
Value transformation: Learn what information to actually provide

This separation enables more expressive attention patterns than using the same representation for all three roles.

Dimensions and Shapes

Input X: [batch, seq_len, d_model]

Query Q: [batch, seq_len, dₖ]

Key K: [batch, seq_len, dₖ]

Value V: [batch, seq_len, dᵥ]

Attention: [batch, seq_len, seq_len]

Output: [batch, seq_len, dᵥ]

Role in Different Attention Types

Self-Attention

Q, K, V all come from the same sequence. Each position attends to all positions in the same sequence.

Cross-Attention

Q comes from one sequence (decoder), K and V come from another sequence (encoder). Enables cross-sequence interaction.

Multi-Head Attention

Multiple QKV projections run in parallel, each learning different aspects of attention.

Test Your Understanding

Question 1: In the QKV framework, what does the Query represent?

A) The information to be retrieved
B) The database of information
C) The final output
D) The attention weight

Question 2: What is the purpose of the scaling factor √dₖ?

A) To normalize the values
B) To prevent large dot products that cause vanishing gradients
C) To speed up computation
D) To match dimensions

Question 3: If input X has shape [batch, seq_len, 512], what shape are Wᵠ, Wₖ, Wᵥ?

A) [512, 512]
B) [d_model, d_k] where d_k varies
C) [seq_len, seq_len]
D) [batch, batch]

Question 4: Why do we need separate projections for Q, K, and V?

A) To increase the number of parameters
B) To allow learning different transformations for different roles
C) To reduce computational cost
D) To make attention differentiable

Question 5: In cross-attention, where do Q, K, V come from?

A) Q from encoder, K and V from decoder
B) Q from decoder, K and V from encoder
C) All from encoder
D) All from decoder

Question 6: What is the output shape of the attention mechanism?

A) [batch, seq_len, seq_len]
B) [batch, seq_len, d_k]
C) [batch, seq_len, d_v]
D) [batch, d_model, d_model]

Question 7: Higher dot product between Q and K indicates what?

A) Lower attention weight
B) Higher relevance/attention weight
C) No relationship
D) Forward pass complete

Question 8: The formula Attention(Q,K,V) = softmax(QKᵀ/√d)·V computes which steps?

A) Only similarity scores
B) Only weighted sum
C) Scores + softmax + weighted sum
D) Only projection

05. Query-Key-Value Mechanism