14 - Learned Positional Embeddings

Introduction

Learned positional embeddings are trainable parameters that represent position information in sequences. Unlike sinusoidal encoding (which is a fixed mathematical function), learned embeddings are optimized during training to best fit the model's needs.

How It Works

PE ∈ ℝ^{max_len × d_model} (learnable parameters)

position_embedding = PE[pos] (lookup by position)

input = token_embedding + position_embedding

Implementation

Each position has a dedicated embedding vector that is learned:

Parameter shape: [max_seq_len, d_model]
Lookup: Position p maps to embedding PE[p]
Combined: Added to token embedding before layer processing

Comparison: Learned vs Sinusoidal

Aspect	Learned Positional	Sinusoidal (Fixed)
Parameters	Trainable (max_len × d_model)	Fixed (no learnable params)
Generalization	Limited to seen positions	Can extrapolate to unseen positions
Flexibility	Can learn any pattern	Limited to sinusoidal basis
Memory	More (extra parameters)	Less (no extra params)
Used in	BERT, GPT, RoBERTa	Original Transformer

Advantages

Flexibility: Can learn position-specific patterns suited to the task
Simplicity: Easy to implement and understand
Widely used: Standard choice in most modern models (BERT, GPT)
Can learn absolute positions: Explicitly represents each position

Disadvantages

Limited extrapolation: Cannot handle positions beyond max_len during training
More parameters: Requires learning max_len × d_model additional parameters
Absolute position bias: May overfit to specific positions

Usage in Modern Models

BERT (Devlin et al., 2018)

Uses learned positional embeddings with max_position = 512

GPT-2/3 (Radford et al., 2019)

Learned positional embeddings with context length up to 2048 (GPT-2) or 4096 (GPT-3)

RoBERTa (Liu et al., 2019)

Also uses learned positional embeddings, same as BERT

Key Insight

Research shows that learned positional embeddings often perform comparably or better than sinusoidal in practice, even though sinusoidal can theoretically extrapolate. This suggests that models don't typically need to extrapolate far beyond their training positions anyway.

Combination with Token Embeddings

Final input = TokenEmbedding(token_id) + PositionEmbedding(position)

Both are d_model dimensional, added element-wise

Test Your Understanding

Question 1: What is the shape of learned positional embeddings?

A) [d_model, d_model]
B) [max_len, d_model]
C) [vocab_size, d_model]
D) [seq_len, seq_len]

Question 2: Learned positional embeddings are:

A) Fixed mathematical function
B) Trainable parameters
C) Computed from token ids
D) Random noise

Question 3: Which models use learned positional embeddings?

A) Original Transformer only
B) BERT, GPT, RoBERTa
C) Only vision transformers
D) No major model uses them

Question 4: What is a disadvantage of learned positional embeddings?

A) Cannot learn patterns
B) Limited extrapolation to unseen positions
C) Too few parameters
D) Cannot be combined with token embeddings

Question 5: How is positional embedding added to token embedding?

A) Concatenation
B) Element-wise addition
C) Multiplication
D) Division

Question 6: What is the main difference from sinusoidal positional encoding?

A) Uses sine and cosine functions
B) Is trainable vs fixed
C) Has different dimension
D) Works only with images

Question 7: If max_len = 512 and d_model = 768, how many positional parameters are there?

A) 512
B) 768
C) 512 × 768 = 393,216
D) 512 + 768

Question 8: BERT uses which type of positional encoding?

A) Sinusoidal
B) Learned positional embeddings
C) ALiBi
D) RoPE

14. Learned Positional Embeddings