15. Relative Positional Encoding

Introduction

Relative positional encoding encodes the relative distance between positions rather than absolute positions. This is more natural for many tasks since the relationship between tokens (how far apart they are) often matters more than their absolute positions.

Why Relative Instead of Absolute?

Absolute positional encoding has limitations:

Core Concept

Instead of encoding PE(pos), we encode the offset between positions:

For attention from position i to position j, encode offset = i - j

Or equivalently: distance = j - i (relative position)

Shaw et al. Formulation

The original relative positional encoding from Shaw et al. (2018):

eᵢⱼ = qᵢᵀ kⱼ + qᵢᵀ w_{i-j} + uᵀ kⱼ + vᵀ w_{i-j}

Where w_{i-j} is a learned embedding for offset (i-j)

The terms added are:

Simplified Relative Position Bias

Modern implementations often use a simpler approach:

eᵢⱼ = (qᵢ · kⱼ) / √d + b_{i-j}

where b_{i-j} is a learned bias term for offset (i-j)

Clipping Distance

To limit memory, positions are clipped to a maximum distance:

offset_clipped = clamp(i - j, -k, k)

This means offsets beyond ±k are treated the same

This is useful because for very long distances, the exact offset matters less than "far away."

Comparison

AspectAbsolute PERelative PE
EncodesPosition numberDistance between positions
AttentionPosition-specificOffset-based
GeneralizationMay not extrapolateBetter generalization
Used inBERT, GPT, originalT5, DeBERTa, XLNet

Advantages of Relative

Disadvantages

Models Using Relative Position

Test Your Understanding

Question 1: What does relative positional encoding encode?

  • A) Absolute position numbers
  • B) Distance between positions (offset)
  • C> Token content
  • D) Layer number

Question 2: What is the key insight of relative vs absolute position?

  • A) Absolute position matters more
  • B) Distance between words often more important than their exact positions
  • C) Both are identical
  • D) Relative requires more parameters

Question 3: Why do we clip relative positions?

  • A) To increase accuracy
  • B) To limit memory usage; far distances treated similarly
  • C) To make training faster
  • D) To prevent negative values

Question 4: In relative position, for attention from i to j, what offset is used?

  • A) i + j
  • B) i - j or j - i
  • C) max(i, j)
  • D) min(i, j)

Question 5: Which model uses relative positional encoding?

  • A) Original BERT
  • B) GPT-1
  • C) T5
  • D) Original Transformer

Question 6: In the formula eᵢⱼ = (qᵢ · kⱼ)/√d + b_{i-j}, what is b_{i-j}?

  • A) Learned bias for offset (i-j)
  • B) Query vector
  • C) Key vector
  • D) Value vector

Question 7: Why might relative position generalize better?

  • A) Uses sine/cosine functions
  • B) Focuses on relationships that are consistent across different sequence lengths
  • C) Has more parameters
  • D) Uses absolute positions

Question 8: Which paper introduced relative positional encoding for self-attention?

  • A) "Attention is All You Need"
  • B) "BERT"
  • C) "Self-Attention with Relative Position Representations" (Shaw et al.)
  • D) "GPT"