16 - Rotary Position Embeddings (RoPE)

Introduction

Rotary Position Embedding (RoPE) is a position encoding method that encodes position information by rotating the key and query vectors in the attention mechanism. Unlike adding positional embeddings to inputs, RoPE integrates positional information directly into the attention computation through rotation matrices.

Core Concept: Rotation in 2D

RoPE leverages the fact that rotating a 2D vector encodes its position:

For a 2D vector [x, y], rotate by angle θ:

[x', y'] = [x·cos(θ) - y·sin(θ), x·sin(θ) + y·cos(θ)]

This is equivalent to multiplying by rotation matrix R(θ)

RoPE for Attention

In RoPE, we rotate query and key vectors by amounts proportional to their positions:

qₙ (rotated) = R(θₙ) · qₙ

kₘ (rotated) = R(θₘ) · kₘ

Attention(qₙ, kₘ) ∝ qₙᵀ R(θₘ)ᵀ R(θₙ) kₘ

Properties of Rotation

Rotation preserves inner products between rotated vectors if they are rotated by the same amount:

R(θₙ)qₙ · R(θₙ)kₘ = qₙ · kₘ (when rotated by same angle)

R(θₙ)qₙ · R(θₘ)kₘ = qₙ · (R(θₘ)ᵀ R(θₙ) kₘ) = qₙ · R(θₙ₋θₘ) kₘ

The inner product between rotated vectors depends on the relative rotation (θₙ - θₘ), which is the relative position!

Multi-Dimensional Implementation

For embedding dimension d, we pair dimensions and rotate each pair:

For dimensions (2i, 2i+1) in the vector:

θₙ,₂ᵢ = n · θᵢ where θᵢ = base^(−2i/d)

RoPE(qₙ, 2i) = qₙ,2i · cos(θₙ,₂ᵢ) - qₙ,2i+1 · sin(θₙ,₂ᵢ)

RoPE(qₙ, 2i+1) = qₙ,2i · sin(θₙ,₂ᵢ) + qₙ,2i+1 · cos(θₙ,₂ᵢ)

Example for dimension d=4 (2 pairs): Position n: - Pair 0 (dim 0,1): rotate by θ₀ = n · base^0 - Pair 1 (dim 2,3): rotate by θ₁ = n · base^(-2/d) Each pair has different rotation frequency for better coverage.

Advantages of RoPE

Direct relative position encoding: Inner product naturally encodes relative position
No added parameters: Position is encoded through rotation, no positional embeddings needed
Works with linear attention: Compatible with linear attention variants
Long context: Better extrapolation to longer sequences
Efficient: Rotation is computationally cheap

Models Using RoPE

Llama (Meta): First major model to use RoPE
PaLM (Google): Uses RoPE variant
ChatGLM: Uses RoPE
Mistral: Uses RoPE

Comparison with Other Methods

Aspect	Sinusoidal	Learned	RoPE
Method	Addition	Addition	Rotation
Parameters	None	max_len × d	None
Relative position	Indirect	No	Direct
Linear attention	No	No	Yes

Test Your Understanding

Question 1: What does RoPE stand for?

A) Relative Position Embedding
B) Rotary Position Embedding
C> Random Position Encoding
D) Recursive Position Embedding

Question 2: How does RoPE encode position information?

A) By addition to embeddings
B) By rotation of query/key vectors
C) By concatenation
D) By multiplication with mask

Question 3: What property does rotation preserve for same-position vectors?

A) Norm only
B) Inner products
C) Orthogonality
D) Nothing preserved

Question 4: In RoPE, what does the inner product between rotated q and k depend on?

A) Absolute positions
B) Relative position (θₙ - θₘ)
C) Sum of positions
D) Product of positions

Question 5: Which dimension pairing does RoPE use?

A) All dimensions together
B> Pairs (2i, 2i+1) for rotation
C) Adjacent dimensions only
D) Every third dimension

Question 6: What is a key advantage of RoPE for long contexts?

A) Uses less memory
B) Better extrapolation to longer sequences
C) Faster training
D) More parameters

Question 7: Which model family first popularized RoPE?

A) BERT
B) GPT
C) Llama
D) T5

Question 8: RoPE is compatible with which type of attention?

A) Standard softmax attention only
B) Linear attention
C) Hard attention only
D) Sparse attention only

16. Rotary Position Embeddings (RoPE)