13 - Positional Encoding | Mango Encyclopedia

Introduction

Positional encoding is a technique used in Transformers to inject information about the position of tokens in a sequence. Since self-attention is inherently permutation-invariant (meaning it doesn't inherently care about the order of tokens), positional encoding provides a way to encode sequence order information.

The Problem

Self-attention processes tokens in parallel, treating each position equally. Without positional information, the model cannot distinguish between:

"The cat bit the dog" vs "The dog bit the cat"
"I saw a man with a telescope" (ambiguous)
Sequential dependencies in time series

Solution: Sinusoidal Positional Encoding

The original Transformer uses sinusoidal encoding:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:
pos = position in sequence (0, 1, 2, ...)
i = dimension index (0 to d_model/2)

Example: d_model = 4 (simplified) Position 0: [sin(0), cos(0), sin(0), cos(0)] = [0, 1, 0, 1] Position 1: [sin(1), cos(1), sin(0.01), cos(0.01)] Position 2: [sin(2), cos(2), sin(0.02), cos(0.02)] ... Each dimension has different frequency: - Even dimensions: low frequency - Odd dimensions: higher frequency

Why Sinusoidal?

1. Self-attention can learn relative positions

Sinusoidal encodings have a useful property: the dot product between positions depends on their offset:

PE(pos, k) · PE(pos + offset, k) = f(offset)

This allows the model to learn relative position relationships.

2. Can represent any position up to some maximum

Unlike learned embeddings, sinusoidal can generalize to positions not seen during training.

3. Multiple frequencies capture different scales

For i=0: period = 2π (wavelength ~6.28)
For i=1: period = 2π/10000 (wavelength ~62800)
...

This provides both fine-grained and coarse-grained position information.

Addition vs Concatenation

Positional encoding is added to the input embeddings:

input = token_embedding + positional_encoding

X_input = X + PE (element-wise addition)

This is more memory-efficient than concatenation.

Properties

Fixed (not learned): Sinusoidal PE is a deterministic function
Same dimension as input: PE has same shape as token embeddings
Added before attention: Modulates input before any attention computation

Modern Alternatives

Learned positional embeddings: Trainable position embeddings
RoPE (Rotary): Encodes position into attention rotation
ALiBi: Attention bias based on distance
Relative position biases: Encode relative offsets

Test Your Understanding

Question 1: Why do we need positional encoding in Transformers?

A) Self-attention is permutation-invariant
B) To increase parameters
C) To make attention faster
D) To reduce memory

Question 2: What is the formula for sinusoidal positional encoding?

A) PE(pos) = pos / d_model
B) PE(pos, 2i) = sin(pos/10000^(2i/d_model))
C) PE(pos) = learned_embedding[pos]
D) PE(pos) = 1 if pos % 2 == 0 else 0

Question 3: How is positional encoding combined with token embeddings?

A) Concatenation
B) Element-wise addition
C) Multiplication
D) Hadamard product

Question 4: What property do sinusoidal encodings have with dot products?

A) Always zero
B) Always one
C) Depends on offset (can learn relative positions)
D) Random

Question 5: Can sinusoidal positional encoding generalize to unseen positions?

A) No, only learned can
B) Yes, because it's a deterministic function
C) No, it always returns 0
D) Depends on batch size

Question 6: For dimension i in positional encoding, what happens as i increases?

A) Frequency decreases (longer period)
B) Frequency increases (shorter period)
C) Frequency stays same
D) Frequency becomes random

Question 7: What is the difference between PE(pos, 2i) and PE(pos, 2i+1)?

A) 2i uses sin, 2i+1 uses cos
B) 2i uses cos, 2i+1 uses sin
C) No difference
D) 2i uses tanh, 2i+1 uses sigmoid

Question 8: Which paper introduced sinusoidal positional encoding?

A) "BERT"
B) "Attention is All You Need"
C) "BERT: Pre-training"
D) "GPT"

13. Positional Encoding