07 - Context Vectors | Mango Encyclopedia

Introduction

Context vectors are the weighted sum of value representations, computed using attention weights. They represent the information aggregated from all source positions, tailored specifically for each query position. The context vector is the final product of the attention mechanism.

Mathematical Definition

Given attention weights αᵢⱼ and values vⱼ, the context vector for query position i is:

cᵢ = Σⱼ αᵢⱼ · vⱼ

This is a weighted average of all values, where weights indicate how much each position contributes.

Position i Context Vector Computation: Attention weights: [αᵢ₁, αᵢ₂, αᵢ₃, ..., αᵢₙ] │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ v₁ v₂ vₙ │ │ │ αᵢ₁ · v₁ αᵢ₂ · v₂ αᵢₙ · vₙ │ │ │ └──────────────┼──────────────┘ ▼ cᵢ = Σⱼ αᵢⱼ · vⱼ

Role in Attention

Context vectors serve as the bridge between attention computation and output:

Aggregation: Summarize information from all positions
Query-specific: Different queries get different context vectors
Information routing: Attention weights determine which information flows

Context Vector Properties

1. Dimension

For multi-head attention with h heads, if each head produces d_v-dimensional context:

cᵢ = concat(head₁, head₂, ..., headₕ) · W⁰

Final dimension: d_model (same as input if d_model = h · d_v)

2. Variability

Each context vector is unique to its query position, computed independently. This allows different positions to focus on different aspects of the source information.

3. Information Content

The context vector contains information from all values, weighted by attention scores. Positions with higher attention weights contribute more information.

Comparison: Context Vector vs Hidden State

Aspect	Context Vector	Hidden State (RNN)
Source	Weighted sum of values	Sequential computation
Information	All positions, query-specific	Processed sequence history
Computation	Parallel (all at once)	Sequential (step by step)
Gradient flow	Direct to all positions	Through time steps

Multi-layer Context Vectors

In deep networks, context vectors from layer l become inputs (or part of inputs) for layer l+1:

xᵢ⁽ˡ⁺¹⁾ = LayerNorm(xᵢ⁽ˡ⁾ + Attention(xᵢ⁽ˡ⁾))

or equivalently: xᵢ⁽ˡ⁺¹ᵢ = LayerNorm(xᵢ⁽ˡ⁾ + cᵢ⁽ˡ⁾)

Special Types of Context Vectors

1. Cross-Attention Context

In encoder-decoder attention, context vectors come from a different sequence:

c_decoder = Σ encoder_hidden · attention_weight

2. Memory-Augmented Context

External memory can be queried to produce context:

c = Attention(query, memory_keys, memory_values)

Context Vector in Transformers

The full Transformer attention computation:

Output = Attention(Q, K, V) = softmax(QKᵀ / √d) · V

The output matrix contains context vectors for all positions simultaneously.

Interpretability Through Context Vectors

Context vectors can be analyzed to understand what information each position has gathered:

Which source positions contributed most (via attention weights)
What semantic information was aggregated
How information flows through the network

Test Your Understanding

Question 1: What is the formula for computing a context vector cᵢ?

A) cᵢ = Σⱼ αᵢⱼ + vⱼ
B) cᵢ = Σⱼ αᵢⱼ · vⱼ
C) cᵢ = Σⱼ vⱼ
D) cᵢ = αᵢ · v

Question 2: What does the context vector represent?

A) Single source position
B) Weighted sum of all values based on attention
C) Decoder hidden state
D) Attention score

Question 3: If attention weights are [0.1, 0.5, 0.4] and values are [v1, v2, v3], what is the context vector?

A) v1 + v2 + v3
B) 0.1·v1 + 0.5·v2 + 0.4·v3
C) 0.5·v2 (highest weight only)
D) max(v1, v2, v3)

Question 4: In multi-head attention, what happens before final output?

A) Context vectors are averaged
B) Context vectors are concatenated and linearly transformed
C) Context vectors are passed through softmax
D) Context vectors are discarded

Question 5: Why is context vector described as "query-specific"?

A) All queries produce the same context
B) Each query position produces a different context vector based on its attention to values
C) Context vectors are the same for all layers
D) Context depends only on values, not queries

Question 6: What is the dimension of context vector if d_v = 64 and we have 8 heads?

A) 64
B) 8
C) 512 (8 × 64 = 512)
D) 1

Question 7: In cross-attention, where do context vector values come from?

A) Same sequence as query
B) Different sequence (encoder for decoder queries)
C) Random initialization
D) Previous layer only

Question 8: What property of softmax ensures attention weights sum to 1?

A) Linearity
B) Non-linearity
C) Normalization property
D) Derivative property

07. Context Vectors