14. Learned Positional Embeddings

Introduction

Learned positional embeddings are trainable parameters that represent position information in sequences. Unlike sinusoidal encoding (which is a fixed mathematical function), learned embeddings are optimized during training to best fit the model's needs.

How It Works

PE ∈ ℝ^{max_len × d_model} (learnable parameters)

position_embedding = PE[pos] (lookup by position)

input = token_embedding + position_embedding

Implementation

Each position has a dedicated embedding vector that is learned:

Comparison: Learned vs Sinusoidal

AspectLearned PositionalSinusoidal (Fixed)
ParametersTrainable (max_len × d_model)Fixed (no learnable params)
GeneralizationLimited to seen positionsCan extrapolate to unseen positions
FlexibilityCan learn any patternLimited to sinusoidal basis
MemoryMore (extra parameters)Less (no extra params)
Used inBERT, GPT, RoBERTaOriginal Transformer

Advantages

Disadvantages

Usage in Modern Models

BERT (Devlin et al., 2018)

Uses learned positional embeddings with max_position = 512

GPT-2/3 (Radford et al., 2019)

Learned positional embeddings with context length up to 2048 (GPT-2) or 4096 (GPT-3)

RoBERTa (Liu et al., 2019)

Also uses learned positional embeddings, same as BERT

Key Insight

Research shows that learned positional embeddings often perform comparably or better than sinusoidal in practice, even though sinusoidal can theoretically extrapolate. This suggests that models don't typically need to extrapolate far beyond their training positions anyway.

Combination with Token Embeddings

Final input = TokenEmbedding(token_id) + PositionEmbedding(position)

Both are d_model dimensional, added element-wise

Test Your Understanding

Question 1: What is the shape of learned positional embeddings?

  • A) [d_model, d_model]
  • B) [max_len, d_model]
  • C) [vocab_size, d_model]
  • D) [seq_len, seq_len]

Question 2: Learned positional embeddings are:

  • A) Fixed mathematical function
  • B) Trainable parameters
  • C) Computed from token ids
  • D) Random noise

Question 3: Which models use learned positional embeddings?

  • A) Original Transformer only
  • B) BERT, GPT, RoBERTa
  • C) Only vision transformers
  • D) No major model uses them

Question 4: What is a disadvantage of learned positional embeddings?

  • A) Cannot learn patterns
  • B) Limited extrapolation to unseen positions
  • C) Too few parameters
  • D) Cannot be combined with token embeddings

Question 5: How is positional embedding added to token embedding?

  • A) Concatenation
  • B) Element-wise addition
  • C) Multiplication
  • D) Division

Question 6: What is the main difference from sinusoidal positional encoding?

  • A) Uses sine and cosine functions
  • B) Is trainable vs fixed
  • C) Has different dimension
  • D) Works only with images

Question 7: If max_len = 512 and d_model = 768, how many positional parameters are there?

  • A) 512
  • B) 768
  • C) 512 × 768 = 393,216
  • D) 512 + 768

Question 8: BERT uses which type of positional encoding?

  • A) Sinusoidal
  • B) Learned positional embeddings
  • C) ALiBi
  • D) RoPE