Introduction
ALiBi (Attention with Linear Biases) is a positional encoding method introduced by Press et al. (2021) in the paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation". It encodes positional information by adding a linear bias to attention scores based on the distance between positions, without using any learned positional embeddings.
Core Concept
ALiBi adds a penalty to attention scores based on how far apart positions are:
eᵢⱼ = (qᵢ · kⱼ) / √d - |i - j|· m
where m is a head-specific scalar
where m is a head-specific scalar
The bias is subtracted (not added), so distant positions get lower attention scores by default.
Head-Specific Slopes
Each attention head gets a different slope value m, following a geometric sequence:
m = 2^(-8/n) for head n out of h total heads
Example for 8 heads:
Head 0: m = 2^(-8/8) = 2^(-1) = 0.5
Head 1: m = 2^(-16/8) = 2^(-2) = 0.25
Head 2: m = 2^(-24/8) = 2^(-3) = 0.125
...
Example for 8 heads:
Head 0: m = 2^(-8/8) = 2^(-1) = 0.5
Head 1: m = 2^(-16/8) = 2^(-2) = 0.25
Head 2: m = 2^(-24/8) = 2^(-3) = 0.125
...
Why Linear and Not Sinusoidal?
- Simplicity: No sine/cosine computations needed
- Interpretable: Linear penalty is intuitive (farther = less attention)
- Extrapolation: Linear bias naturally extends to longer sequences
Matrix Form
ALiBi bias matrix (for distance-based bias):
[ 0, -1, -2, -3, ... ] [ 1, 0, -1, -2, ... ] [ 2, 1, 0, -1, ... ] [ 3, 2, 1, 0, ... ] [ ... ]
Multiply each entry by head-specific m before adding to attention scores
[ 0, -1, -2, -3, ... ] [ 1, 0, -1, -2, ... ] [ 2, 1, 0, -1, ... ] [ 3, 2, 1, 0, ... ] [ ... ]
Multiply each entry by head-specific m before adding to attention scores
Advantages
- No positional embeddings: Saves parameters and memory
- Extrapolation: Handles sequences longer than training length
- Simplicity: Just add a bias, no complex transformations
- Effective: Works well in practice (used in BloombergGPT)
Used In
- BLOOM: BigScience model uses ALiBi
- BloombergGPT: Financial language model
- Various LLaMA variants: Some implementations use ALiBi
Comparison
| Aspect | ALiBi | RoPE | Learned PE |
|---|---|---|---|
| Method | Linear bias on attention | Rotation | Added embedding |
| Parameters | Head slopes (h values) | None | max_len × d |
| Extrapolation | Excellent | Good | Limited |
| Complexity | Low | Medium | Low |