16. Rotary Position Embeddings (RoPE)

Introduction

Rotary Position Embedding (RoPE) is a position encoding method that encodes position information by rotating the key and query vectors in the attention mechanism. Unlike adding positional embeddings to inputs, RoPE integrates positional information directly into the attention computation through rotation matrices.

Core Concept: Rotation in 2D

RoPE leverages the fact that rotating a 2D vector encodes its position:

For a 2D vector [x, y], rotate by angle θ:

[x', y'] = [x·cos(θ) - y·sin(θ), x·sin(θ) + y·cos(θ)]

This is equivalent to multiplying by rotation matrix R(θ)

RoPE for Attention

In RoPE, we rotate query and key vectors by amounts proportional to their positions:

qₙ (rotated) = R(θₙ) · qₙ

kₘ (rotated) = R(θₘ) · kₘ

Attention(qₙ, kₘ) ∝ qₙᵀ R(θₘ)ᵀ R(θₙ) kₘ

Properties of Rotation

Rotation preserves inner products between rotated vectors if they are rotated by the same amount:

R(θₙ)qₙ · R(θₙ)kₘ = qₙ · kₘ (when rotated by same angle)

R(θₙ)qₙ · R(θₘ)kₘ = qₙ · (R(θₘ)ᵀ R(θₙ) kₘ) = qₙ · R(θₙ₋θₘ) kₘ

The inner product between rotated vectors depends on the relative rotation (θₙ - θₘ), which is the relative position!

Multi-Dimensional Implementation

For embedding dimension d, we pair dimensions and rotate each pair:

For dimensions (2i, 2i+1) in the vector:

θₙ,₂ᵢ = n · θᵢ where θᵢ = base^(−2i/d)

RoPE(qₙ, 2i) = qₙ,2i · cos(θₙ,₂ᵢ) - qₙ,2i+1 · sin(θₙ,₂ᵢ)

RoPE(qₙ, 2i+1) = qₙ,2i · sin(θₙ,₂ᵢ) + qₙ,2i+1 · cos(θₙ,₂ᵢ)
Example for dimension d=4 (2 pairs): Position n: - Pair 0 (dim 0,1): rotate by θ₀ = n · base^0 - Pair 1 (dim 2,3): rotate by θ₁ = n · base^(-2/d) Each pair has different rotation frequency for better coverage.

Advantages of RoPE

Models Using RoPE

Comparison with Other Methods

AspectSinusoidalLearnedRoPE
MethodAdditionAdditionRotation
ParametersNonemax_len × dNone
Relative positionIndirectNoDirect
Linear attentionNoNoYes

Test Your Understanding

Question 1: What does RoPE stand for?

  • A) Relative Position Embedding
  • B) Rotary Position Embedding
  • C> Random Position Encoding
  • D) Recursive Position Embedding

Question 2: How does RoPE encode position information?

  • A) By addition to embeddings
  • B) By rotation of query/key vectors
  • C) By concatenation
  • D) By multiplication with mask

Question 3: What property does rotation preserve for same-position vectors?

  • A) Norm only
  • B) Inner products
  • C) Orthogonality
  • D) Nothing preserved

Question 4: In RoPE, what does the inner product between rotated q and k depend on?

  • A) Absolute positions
  • B) Relative position (θₙ - θₘ)
  • C) Sum of positions
  • D) Product of positions

Question 5: Which dimension pairing does RoPE use?

  • A) All dimensions together
  • B> Pairs (2i, 2i+1) for rotation
  • C) Adjacent dimensions only
  • D) Every third dimension

Question 6: What is a key advantage of RoPE for long contexts?

  • A) Uses less memory
  • B) Better extrapolation to longer sequences
  • C) Faster training
  • D) More parameters

Question 7: Which model family first popularized RoPE?

  • A) BERT
  • B) GPT
  • C) Llama
  • D) T5

Question 8: RoPE is compatible with which type of attention?

  • A) Standard softmax attention only
  • B) Linear attention
  • C) Hard attention only
  • D) Sparse attention only