05. Query-Key-Value Mechanism

Introduction

The Query-Key-Value (QKV) mechanism is the foundation of modern attention systems. It provides a flexible framework where each token in a sequence can query information from all other tokens based on learned representations. This mechanism generalizes all previous attention variants (additive, multiplicative) into a unified framework.

Core Concept: Information Retrieval by Similarity

Think of the QKV mechanism like a library search system:

┌─────────────────────────────────────────────────────────────┐ │ QKV ATTENTION FLOW │ │ │ │ Query (Q) Key (K) Value (V) │ │ │ │ │ │ │ └────────┬───────────┘ │ │ │ ▼ │ │ │ ┌─────────┐ │ │ │ │ SCORE │ │ │ │ │ Q·Kᵀ/√d │ │ │ │ └─────────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌─────────┐ Softmax │ │ │ │ ATTN │──────────────────────────┤ │ │ │ WEIGHTS │ αᵢⱼ │ │ │ └─────────┘ │ │ │ │ │ │ │ └──────────┬──────────────────┘ │ │ ▼ │ │ ┌─────────┐ │ │ │ OUTPUT │ │ │ │ Σ α·V │ │ │ └─────────┘ │ └─────────────────────────────────────────────────────────────┘

Mathematical Formulation

Q = X · Wᵠ (Query projection)

K = X · Wₖ (Key projection)

V = X · Wᵥ (Value projection)

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Where:

Detailed Step-by-Step

Step 1: Project Input to Q, K, V

Each input token is linearly transformed into three different representations:

qᵢ = xᵢ · Wᵠ (What am I looking for?)

kⱼ = xⱼ · Wₖ (What do I contain?)

vⱼ = xⱼ · Wᵥ (What can I share?)

Step 2: Compute Similarity Scores

Compute dot products between queries and keys to determine relevance:

eᵢⱼ = qᵢ · kⱼᵀ / √dₖ

Higher scores mean position i should pay more attention to position j.

Step 3: Normalize with Softmax

Convert scores to probability distribution:

αᵢⱼ = softmax(eᵢⱼ) = exp(eᵢⱼ) / Σₖ exp(eᵢₖ)

Step 4: Weighted Sum of Values

Compute final output as weighted sum of values:

outputᵢ = Σⱼ αᵢⱼ · vⱼ

Why Three Separate Representations?

Separating Q, K, V allows the model to learn different transformations for different roles:

This separation enables more expressive attention patterns than using the same representation for all three roles.

Dimensions and Shapes

Input X: [batch, seq_len, d_model]

Query Q: [batch, seq_len, dₖ]

Key K: [batch, seq_len, dₖ]

Value V: [batch, seq_len, dᵥ]

Attention: [batch, seq_len, seq_len]

Output: [batch, seq_len, dᵥ]

Role in Different Attention Types

Self-Attention

Q, K, V all come from the same sequence. Each position attends to all positions in the same sequence.

Cross-Attention

Q comes from one sequence (decoder), K and V come from another sequence (encoder). Enables cross-sequence interaction.

Multi-Head Attention

Multiple QKV projections run in parallel, each learning different aspects of attention.

Test Your Understanding

Question 1: In the QKV framework, what does the Query represent?

  • A) The information to be retrieved
  • B) The database of information
  • C) The final output
  • D) The attention weight

Question 2: What is the purpose of the scaling factor √dₖ?

  • A) To normalize the values
  • B) To prevent large dot products that cause vanishing gradients
  • C) To speed up computation
  • D) To match dimensions

Question 3: If input X has shape [batch, seq_len, 512], what shape are Wᵠ, Wₖ, Wᵥ?

  • A) [512, 512]
  • B) [d_model, d_k] where d_k varies
  • C) [seq_len, seq_len]
  • D) [batch, batch]

Question 4: Why do we need separate projections for Q, K, and V?

  • A) To increase the number of parameters
  • B) To allow learning different transformations for different roles
  • C) To reduce computational cost
  • D) To make attention differentiable

Question 5: In cross-attention, where do Q, K, V come from?

  • A) Q from encoder, K and V from decoder
  • B) Q from decoder, K and V from encoder
  • C) All from encoder
  • D) All from decoder

Question 6: What is the output shape of the attention mechanism?

  • A) [batch, seq_len, seq_len]
  • B) [batch, seq_len, d_k]
  • C) [batch, seq_len, d_v]
  • D) [batch, d_model, d_model]

Question 7: Higher dot product between Q and K indicates what?

  • A) Lower attention weight
  • B) Higher relevance/attention weight
  • C) No relationship
  • D) Forward pass complete

Question 8: The formula Attention(Q,K,V) = softmax(QKᵀ/√d)·V computes which steps?

  • A) Only similarity scores
  • B) Only weighted sum
  • C) Scores + softmax + weighted sum
  • D) Only projection