Introduction
Positional encoding is a technique used in Transformers to inject information about the position of tokens in a sequence. Since self-attention is inherently permutation-invariant (meaning it doesn't inherently care about the order of tokens), positional encoding provides a way to encode sequence order information.
The Problem
Self-attention processes tokens in parallel, treating each position equally. Without positional information, the model cannot distinguish between:
- "The cat bit the dog" vs "The dog bit the cat"
- "I saw a man with a telescope" (ambiguous)
- Sequential dependencies in time series
Solution: Sinusoidal Positional Encoding
The original Transformer uses sinusoidal encoding:
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
pos = position in sequence (0, 1, 2, ...)
i = dimension index (0 to d_model/2)
Why Sinusoidal?
1. Self-attention can learn relative positions
Sinusoidal encodings have a useful property: the dot product between positions depends on their offset:
This allows the model to learn relative position relationships.
2. Can represent any position up to some maximum
Unlike learned embeddings, sinusoidal can generalize to positions not seen during training.
3. Multiple frequencies capture different scales
For i=1: period = 2π/10000 (wavelength ~62800)
...
This provides both fine-grained and coarse-grained position information.
Addition vs Concatenation
Positional encoding is added to the input embeddings:
X_input = X + PE (element-wise addition)
This is more memory-efficient than concatenation.
Properties
- Fixed (not learned): Sinusoidal PE is a deterministic function
- Same dimension as input: PE has same shape as token embeddings
- Added before attention: Modulates input before any attention computation
Modern Alternatives
- Learned positional embeddings: Trainable position embeddings
- RoPE (Rotary): Encodes position into attention rotation
- ALiBi: Attention bias based on distance
- Relative position biases: Encode relative offsets