17. ALiBi Attention Bias

Introduction

ALiBi (Attention with Linear Biases) is a positional encoding method introduced by Press et al. (2021) in the paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation". It encodes positional information by adding a linear bias to attention scores based on the distance between positions, without using any learned positional embeddings.

Core Concept

ALiBi adds a penalty to attention scores based on how far apart positions are:

eᵢⱼ = (qᵢ · kⱼ) / √d - |i - j|· m

where m is a head-specific scalar

The bias is subtracted (not added), so distant positions get lower attention scores by default.

Head-Specific Slopes

Each attention head gets a different slope value m, following a geometric sequence:

m = 2^(-8/n) for head n out of h total heads

Example for 8 heads:
Head 0: m = 2^(-8/8) = 2^(-1) = 0.5
Head 1: m = 2^(-16/8) = 2^(-2) = 0.25
Head 2: m = 2^(-24/8) = 2^(-3) = 0.125
...

Why Linear and Not Sinusoidal?

Matrix Form

ALiBi bias matrix (for distance-based bias):

[ 0, -1, -2, -3, ... ] [ 1, 0, -1, -2, ... ] [ 2, 1, 0, -1, ... ] [ 3, 2, 1, 0, ... ] [ ... ]

Multiply each entry by head-specific m before adding to attention scores

Advantages

Used In

Comparison

AspectALiBiRoPELearned PE
MethodLinear bias on attentionRotationAdded embedding
ParametersHead slopes (h values)Nonemax_len × d
ExtrapolationExcellentGoodLimited
ComplexityLowMediumLow

Test Your Understanding

Question 1: What does ALiBi stand for?

  • A) Attention with Linear Inferences
  • B) Attention with Linear Biases
  • C) Adaptive Linear Attention
  • D> Algebraic Linear Bias

Question 2: In the formula eᵢⱼ = (qᵢ·kⱼ)/√d - |i-j|·m, what does m represent?

  • A) Learning rate
  • B) Head-specific slope scalar
  • C> Dimension
  • D) Token embedding

Question 3: How does ALiBi encode positional information?

  • A) Using sine/cosine
  • B) Adding linear bias based on distance
  • C) Learning embeddings
  • D) Rotation of vectors

Question 4: Why is a subtraction (not addition) used in the bias?

  • A) To make scores smaller
  • B) So distant positions get lower attention scores
  • C) To match softmax input
  • D) To increase gradients

Question 5: What is the formula for head-specific slope m?

  • A) m = 1/n
  • B) m = 2^(-8n/h)
  • C) m = log(2)/n
  • D) m = n/h

Question 6: What is a key advantage of ALiBi?

  • A) Uses more parameters
  • B) Excellent extrapolation to longer sequences
  • C) Requires sine/cosine computation
  • D) Only works with short sequences

Question 7: For 8 heads, what is the slope of the last head (n=8)?

  • A) 2^(-1) = 0.5
  • B) 2^(-8) = 0.0039
  • C) 2^(-16) = 0.000015
  • D) 2^(0) = 1

Question 8: Which model uses ALiBi?

  • A) BERT
  • B) GPT
  • C) BLOOM
  • D) T5