Introduction
Relative positional encoding encodes the relative distance between positions rather than absolute positions. This is more natural for many tasks since the relationship between tokens (how far apart they are) often matters more than their absolute positions.
Why Relative Instead of Absolute?
Absolute positional encoding has limitations:
- "Word at position 5" vs "word at position 6" - absolute position matters less than relative distance
- For tasks like translation, what matters is how far apart words are, not where they are in the sentence
- Relative positions generalize better to different sequence lengths
Core Concept
Instead of encoding PE(pos), we encode the offset between positions:
For attention from position i to position j, encode offset = i - j
Or equivalently: distance = j - i (relative position)
Or equivalently: distance = j - i (relative position)
Shaw et al. Formulation
The original relative positional encoding from Shaw et al. (2018):
eᵢⱼ = qᵢᵀ kⱼ + qᵢᵀ w_{i-j} + uᵀ kⱼ + vᵀ w_{i-j}
Where w_{i-j} is a learned embedding for offset (i-j)
Where w_{i-j} is a learned embedding for offset (i-j)
The terms added are:
- qᵢᵀ w_{i-j}: Query attends to key with relative offset
- uᵀ kⱼ: Content-based key contribution
- vᵀ w_{i-j}: Bias for each offset
Simplified Relative Position Bias
Modern implementations often use a simpler approach:
eᵢⱼ = (qᵢ · kⱼ) / √d + b_{i-j}
where b_{i-j} is a learned bias term for offset (i-j)
where b_{i-j} is a learned bias term for offset (i-j)
Clipping Distance
To limit memory, positions are clipped to a maximum distance:
offset_clipped = clamp(i - j, -k, k)
This means offsets beyond ±k are treated the same
This means offsets beyond ±k are treated the same
This is useful because for very long distances, the exact offset matters less than "far away."
Comparison
| Aspect | Absolute PE | Relative PE |
|---|---|---|
| Encodes | Position number | Distance between positions |
| Attention | Position-specific | Offset-based |
| Generalization | May not extrapolate | Better generalization |
| Used in | BERT, GPT, original | T5, DeBERTa, XLNet |
Advantages of Relative
- Better generalization: Works better on sequences of different lengths
- Natural for translation: Preserves relative structure
- More expressivity: Can model position-dependent relationships
Disadvantages
- More complex: Requires special handling in attention
- Memory for bias: Need to store bias for each offset
- Clipping needed: Cannot have unbounded offset embeddings
Models Using Relative Position
- XLNet: Uses relative positional embeddings
- T5: Uses relative position biases
- DeBERTa: Uses disentangled attention with relative position