15 - Relative Positional Encoding

Introduction

Relative positional encoding encodes the relative distance between positions rather than absolute positions. This is more natural for many tasks since the relationship between tokens (how far apart they are) often matters more than their absolute positions.

Why Relative Instead of Absolute?

Absolute positional encoding has limitations:

"Word at position 5" vs "word at position 6" - absolute position matters less than relative distance
For tasks like translation, what matters is how far apart words are, not where they are in the sentence
Relative positions generalize better to different sequence lengths

Core Concept

Instead of encoding PE(pos), we encode the offset between positions:

For attention from position i to position j, encode offset = i - j

Or equivalently: distance = j - i (relative position)

Shaw et al. Formulation

The original relative positional encoding from Shaw et al. (2018):

eᵢⱼ = qᵢᵀ kⱼ + qᵢᵀ w_{i-j} + uᵀ kⱼ + vᵀ w_{i-j}

Where w_{i-j} is a learned embedding for offset (i-j)

The terms added are:

qᵢᵀ w_{i-j}: Query attends to key with relative offset
uᵀ kⱼ: Content-based key contribution
vᵀ w_{i-j}: Bias for each offset

Simplified Relative Position Bias

Modern implementations often use a simpler approach:

eᵢⱼ = (qᵢ · kⱼ) / √d + b_{i-j}

where b_{i-j} is a learned bias term for offset (i-j)

Clipping Distance

To limit memory, positions are clipped to a maximum distance:

offset_clipped = clamp(i - j, -k, k)

This means offsets beyond ±k are treated the same

This is useful because for very long distances, the exact offset matters less than "far away."

Comparison

Aspect	Absolute PE	Relative PE
Encodes	Position number	Distance between positions
Attention	Position-specific	Offset-based
Generalization	May not extrapolate	Better generalization
Used in	BERT, GPT, original	T5, DeBERTa, XLNet

Advantages of Relative

Better generalization: Works better on sequences of different lengths
Natural for translation: Preserves relative structure
More expressivity: Can model position-dependent relationships

Disadvantages

More complex: Requires special handling in attention
Memory for bias: Need to store bias for each offset
Clipping needed: Cannot have unbounded offset embeddings

Models Using Relative Position

XLNet: Uses relative positional embeddings
T5: Uses relative position biases
DeBERTa: Uses disentangled attention with relative position

Test Your Understanding

Question 1: What does relative positional encoding encode?

A) Absolute position numbers
B) Distance between positions (offset)
C> Token content
D) Layer number

Question 2: What is the key insight of relative vs absolute position?

A) Absolute position matters more
B) Distance between words often more important than their exact positions
C) Both are identical
D) Relative requires more parameters

Question 3: Why do we clip relative positions?

A) To increase accuracy
B) To limit memory usage; far distances treated similarly
C) To make training faster
D) To prevent negative values

Question 4: In relative position, for attention from i to j, what offset is used?

A) i + j
B) i - j or j - i
C) max(i, j)
D) min(i, j)

Question 5: Which model uses relative positional encoding?

A) Original BERT
B) GPT-1
C) T5
D) Original Transformer

Question 6: In the formula eᵢⱼ = (qᵢ · kⱼ)/√d + b_{i-j}, what is b_{i-j}?

A) Learned bias for offset (i-j)
B) Query vector
C) Key vector
D) Value vector

Question 7: Why might relative position generalize better?

A) Uses sine/cosine functions
B) Focuses on relationships that are consistent across different sequence lengths
C) Has more parameters
D) Uses absolute positions

Question 8: Which paper introduced relative positional encoding for self-attention?

A) "Attention is All You Need"
B) "BERT"
C) "Self-Attention with Relative Position Representations" (Shaw et al.)
D) "GPT"

15. Relative Positional Encoding