Introduction
Learned positional embeddings are trainable parameters that represent position information in sequences. Unlike sinusoidal encoding (which is a fixed mathematical function), learned embeddings are optimized during training to best fit the model's needs.
How It Works
position_embedding = PE[pos] (lookup by position)
input = token_embedding + position_embedding
Implementation
Each position has a dedicated embedding vector that is learned:
- Parameter shape: [max_seq_len, d_model]
- Lookup: Position p maps to embedding PE[p]
- Combined: Added to token embedding before layer processing
Comparison: Learned vs Sinusoidal
| Aspect | Learned Positional | Sinusoidal (Fixed) |
|---|---|---|
| Parameters | Trainable (max_len × d_model) | Fixed (no learnable params) |
| Generalization | Limited to seen positions | Can extrapolate to unseen positions |
| Flexibility | Can learn any pattern | Limited to sinusoidal basis |
| Memory | More (extra parameters) | Less (no extra params) |
| Used in | BERT, GPT, RoBERTa | Original Transformer |
Advantages
- Flexibility: Can learn position-specific patterns suited to the task
- Simplicity: Easy to implement and understand
- Widely used: Standard choice in most modern models (BERT, GPT)
- Can learn absolute positions: Explicitly represents each position
Disadvantages
- Limited extrapolation: Cannot handle positions beyond max_len during training
- More parameters: Requires learning max_len × d_model additional parameters
- Absolute position bias: May overfit to specific positions
Usage in Modern Models
BERT (Devlin et al., 2018)
Uses learned positional embeddings with max_position = 512
GPT-2/3 (Radford et al., 2019)
Learned positional embeddings with context length up to 2048 (GPT-2) or 4096 (GPT-3)
RoBERTa (Liu et al., 2019)
Also uses learned positional embeddings, same as BERT
Key Insight
Research shows that learned positional embeddings often perform comparably or better than sinusoidal in practice, even though sinusoidal can theoretically extrapolate. This suggests that models don't typically need to extrapolate far beyond their training positions anyway.
Combination with Token Embeddings
Both are d_model dimensional, added element-wise