Introduction
Rotary Position Embedding (RoPE) is a position encoding method that encodes position information by rotating the key and query vectors in the attention mechanism. Unlike adding positional embeddings to inputs, RoPE integrates positional information directly into the attention computation through rotation matrices.
Core Concept: Rotation in 2D
RoPE leverages the fact that rotating a 2D vector encodes its position:
[x', y'] = [x·cos(θ) - y·sin(θ), x·sin(θ) + y·cos(θ)]
This is equivalent to multiplying by rotation matrix R(θ)
RoPE for Attention
In RoPE, we rotate query and key vectors by amounts proportional to their positions:
kₘ (rotated) = R(θₘ) · kₘ
Attention(qₙ, kₘ) ∝ qₙᵀ R(θₘ)ᵀ R(θₙ) kₘ
Properties of Rotation
Rotation preserves inner products between rotated vectors if they are rotated by the same amount:
R(θₙ)qₙ · R(θₘ)kₘ = qₙ · (R(θₘ)ᵀ R(θₙ) kₘ) = qₙ · R(θₙ₋θₘ) kₘ
The inner product between rotated vectors depends on the relative rotation (θₙ - θₘ), which is the relative position!
Multi-Dimensional Implementation
For embedding dimension d, we pair dimensions and rotate each pair:
θₙ,₂ᵢ = n · θᵢ where θᵢ = base^(−2i/d)
RoPE(qₙ, 2i) = qₙ,2i · cos(θₙ,₂ᵢ) - qₙ,2i+1 · sin(θₙ,₂ᵢ)
RoPE(qₙ, 2i+1) = qₙ,2i · sin(θₙ,₂ᵢ) + qₙ,2i+1 · cos(θₙ,₂ᵢ)
Advantages of RoPE
- Direct relative position encoding: Inner product naturally encodes relative position
- No added parameters: Position is encoded through rotation, no positional embeddings needed
- Works with linear attention: Compatible with linear attention variants
- Long context: Better extrapolation to longer sequences
- Efficient: Rotation is computationally cheap
Models Using RoPE
- Llama (Meta): First major model to use RoPE
- PaLM (Google): Uses RoPE variant
- ChatGLM: Uses RoPE
- Mistral: Uses RoPE
Comparison with Other Methods
| Aspect | Sinusoidal | Learned | RoPE |
|---|---|---|---|
| Method | Addition | Addition | Rotation |
| Parameters | None | max_len × d | None |
| Relative position | Indirect | No | Direct |
| Linear attention | No | No | Yes |