17 - ALiBi Attention Bias | Mango Encyclopedia

Introduction

ALiBi (Attention with Linear Biases) is a positional encoding method introduced by Press et al. (2021) in the paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation". It encodes positional information by adding a linear bias to attention scores based on the distance between positions, without using any learned positional embeddings.

Core Concept

ALiBi adds a penalty to attention scores based on how far apart positions are:

eᵢⱼ = (qᵢ · kⱼ) / √d - |i - j|· m

where m is a head-specific scalar

The bias is subtracted (not added), so distant positions get lower attention scores by default.

Head-Specific Slopes

Each attention head gets a different slope value m, following a geometric sequence:

m = 2^(-8/n) for head n out of h total heads

Example for 8 heads:
Head 0: m = 2^(-8/8) = 2^(-1) = 0.5
Head 1: m = 2^(-16/8) = 2^(-2) = 0.25
Head 2: m = 2^(-24/8) = 2^(-3) = 0.125
...

Why Linear and Not Sinusoidal?

Simplicity: No sine/cosine computations needed
Interpretable: Linear penalty is intuitive (farther = less attention)
Extrapolation: Linear bias naturally extends to longer sequences

Matrix Form

ALiBi bias matrix (for distance-based bias):

[ 0, -1, -2, -3, ... ] [ 1, 0, -1, -2, ... ] [ 2, 1, 0, -1, ... ] [ 3, 2, 1, 0, ... ] [ ... ]

Multiply each entry by head-specific m before adding to attention scores

Advantages

No positional embeddings: Saves parameters and memory
Extrapolation: Handles sequences longer than training length
Simplicity: Just add a bias, no complex transformations
Effective: Works well in practice (used in BloombergGPT)

Used In

BLOOM: BigScience model uses ALiBi
BloombergGPT: Financial language model
Various LLaMA variants: Some implementations use ALiBi

Comparison

Aspect	ALiBi	RoPE	Learned PE
Method	Linear bias on attention	Rotation	Added embedding
Parameters	Head slopes (h values)	None	max_len × d
Extrapolation	Excellent	Good	Limited
Complexity	Low	Medium	Low

Test Your Understanding

Question 1: What does ALiBi stand for?

A) Attention with Linear Inferences
B) Attention with Linear Biases
C) Adaptive Linear Attention
D> Algebraic Linear Bias

Question 2: In the formula eᵢⱼ = (qᵢ·kⱼ)/√d - |i-j|·m, what does m represent?

A) Learning rate
B) Head-specific slope scalar
C> Dimension
D) Token embedding

Question 3: How does ALiBi encode positional information?

A) Using sine/cosine
B) Adding linear bias based on distance
C) Learning embeddings
D) Rotation of vectors

Question 4: Why is a subtraction (not addition) used in the bias?

A) To make scores smaller
B) So distant positions get lower attention scores
C) To match softmax input
D) To increase gradients

Question 5: What is the formula for head-specific slope m?

A) m = 1/n
B) m = 2^(-8n/h)
C) m = log(2)/n
D) m = n/h

Question 6: What is a key advantage of ALiBi?

A) Uses more parameters
B) Excellent extrapolation to longer sequences
C) Requires sine/cosine computation
D) Only works with short sequences

Question 7: For 8 heads, what is the slope of the last head (n=8)?

A) 2^(-1) = 0.5
B) 2^(-8) = 0.0039
C) 2^(-16) = 0.000015
D) 2^(0) = 1

Question 8: Which model uses ALiBi?

A) BERT
B) GPT
C) BLOOM
D) T5

17. ALiBi Attention Bias