07. Context Vectors

Introduction

Context vectors are the weighted sum of value representations, computed using attention weights. They represent the information aggregated from all source positions, tailored specifically for each query position. The context vector is the final product of the attention mechanism.

Mathematical Definition

Given attention weights αᵢⱼ and values vⱼ, the context vector for query position i is:

cᵢ = Σⱼ αᵢⱼ · vⱼ

This is a weighted average of all values, where weights indicate how much each position contributes.

Position i Context Vector Computation: Attention weights: [αᵢ₁, αᵢ₂, αᵢ₃, ..., αᵢₙ] │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ v₁ v₂ vₙ │ │ │ αᵢ₁ · v₁ αᵢ₂ · v₂ αᵢₙ · vₙ │ │ │ └──────────────┼──────────────┘ ▼ cᵢ = Σⱼ αᵢⱼ · vⱼ

Role in Attention

Context vectors serve as the bridge between attention computation and output:

Context Vector Properties

1. Dimension

For multi-head attention with h heads, if each head produces d_v-dimensional context:

cᵢ = concat(head₁, head₂, ..., headₕ) · W⁰

Final dimension: d_model (same as input if d_model = h · d_v)

2. Variability

Each context vector is unique to its query position, computed independently. This allows different positions to focus on different aspects of the source information.

3. Information Content

The context vector contains information from all values, weighted by attention scores. Positions with higher attention weights contribute more information.

Comparison: Context Vector vs Hidden State

AspectContext VectorHidden State (RNN)
SourceWeighted sum of valuesSequential computation
InformationAll positions, query-specificProcessed sequence history
ComputationParallel (all at once)Sequential (step by step)
Gradient flowDirect to all positionsThrough time steps

Multi-layer Context Vectors

In deep networks, context vectors from layer l become inputs (or part of inputs) for layer l+1:

xᵢ⁽ˡ⁺¹⁾ = LayerNorm(xᵢ⁽ˡ⁾ + Attention(xᵢ⁽ˡ⁾))

or equivalently: xᵢ⁽ˡ⁺¹ᵢ = LayerNorm(xᵢ⁽ˡ⁾ + cᵢ⁽ˡ⁾)

Special Types of Context Vectors

1. Cross-Attention Context

In encoder-decoder attention, context vectors come from a different sequence:

c_decoder = Σ encoder_hidden · attention_weight

2. Memory-Augmented Context

External memory can be queried to produce context:

c = Attention(query, memory_keys, memory_values)

Context Vector in Transformers

The full Transformer attention computation:

Output = Attention(Q, K, V) = softmax(QKᵀ / √d) · V

The output matrix contains context vectors for all positions simultaneously.

Interpretability Through Context Vectors

Context vectors can be analyzed to understand what information each position has gathered:

Test Your Understanding

Question 1: What is the formula for computing a context vector cᵢ?

  • A) cᵢ = Σⱼ αᵢⱼ + vⱼ
  • B) cᵢ = Σⱼ αᵢⱼ · vⱼ
  • C) cᵢ = Σⱼ vⱼ
  • D) cᵢ = αᵢ · v

Question 2: What does the context vector represent?

  • A) Single source position
  • B) Weighted sum of all values based on attention
  • C) Decoder hidden state
  • D) Attention score

Question 3: If attention weights are [0.1, 0.5, 0.4] and values are [v1, v2, v3], what is the context vector?

  • A) v1 + v2 + v3
  • B) 0.1·v1 + 0.5·v2 + 0.4·v3
  • C) 0.5·v2 (highest weight only)
  • D) max(v1, v2, v3)

Question 4: In multi-head attention, what happens before final output?

  • A) Context vectors are averaged
  • B) Context vectors are concatenated and linearly transformed
  • C) Context vectors are passed through softmax
  • D) Context vectors are discarded

Question 5: Why is context vector described as "query-specific"?

  • A) All queries produce the same context
  • B) Each query position produces a different context vector based on its attention to values
  • C) Context vectors are the same for all layers
  • D) Context depends only on values, not queries

Question 6: What is the dimension of context vector if d_v = 64 and we have 8 heads?

  • A) 64
  • B) 8
  • C) 512 (8 × 64 = 512)
  • D) 1

Question 7: In cross-attention, where do context vector values come from?

  • A) Same sequence as query
  • B) Different sequence (encoder for decoder queries)
  • C) Random initialization
  • D) Previous layer only

Question 8: What property of softmax ensures attention weights sum to 1?

  • A) Linearity
  • B) Non-linearity
  • C) Normalization property
  • D) Derivative property