Introduction
Context vectors are the weighted sum of value representations, computed using attention weights. They represent the information aggregated from all source positions, tailored specifically for each query position. The context vector is the final product of the attention mechanism.
Mathematical Definition
Given attention weights αᵢⱼ and values vⱼ, the context vector for query position i is:
This is a weighted average of all values, where weights indicate how much each position contributes.
Role in Attention
Context vectors serve as the bridge between attention computation and output:
- Aggregation: Summarize information from all positions
- Query-specific: Different queries get different context vectors
- Information routing: Attention weights determine which information flows
Context Vector Properties
1. Dimension
For multi-head attention with h heads, if each head produces d_v-dimensional context:
Final dimension: d_model (same as input if d_model = h · d_v)
2. Variability
Each context vector is unique to its query position, computed independently. This allows different positions to focus on different aspects of the source information.
3. Information Content
The context vector contains information from all values, weighted by attention scores. Positions with higher attention weights contribute more information.
Comparison: Context Vector vs Hidden State
| Aspect | Context Vector | Hidden State (RNN) |
|---|---|---|
| Source | Weighted sum of values | Sequential computation |
| Information | All positions, query-specific | Processed sequence history |
| Computation | Parallel (all at once) | Sequential (step by step) |
| Gradient flow | Direct to all positions | Through time steps |
Multi-layer Context Vectors
In deep networks, context vectors from layer l become inputs (or part of inputs) for layer l+1:
or equivalently: xᵢ⁽ˡ⁺¹ᵢ = LayerNorm(xᵢ⁽ˡ⁾ + cᵢ⁽ˡ⁾)
Special Types of Context Vectors
1. Cross-Attention Context
In encoder-decoder attention, context vectors come from a different sequence:
2. Memory-Augmented Context
External memory can be queried to produce context:
Context Vector in Transformers
The full Transformer attention computation:
The output matrix contains context vectors for all positions simultaneously.
Interpretability Through Context Vectors
Context vectors can be analyzed to understand what information each position has gathered:
- Which source positions contributed most (via attention weights)
- What semantic information was aggregated
- How information flows through the network