Introduction
Linear attention is an approach that reduces the O(n²) complexity of standard attention to O(n) by using kernel methods to approximate the softmax operation. Instead of computing full n×n attention matrix, linear attention computes attention in a single pass using associative properties of matrix multiplication.
The Problem with Standard Attention
Standard: Attention = softmax(QKᵀ)V
QKᵀ is O(n²) - the bottleneck
QKᵀ is O(n²) - the bottleneck
Linear Attention Reformulation
softmax(QKᵀ) ≈ φ(Q)(φ(K)ᵀV)
Where φ is a feature map
This uses the associativity of matrix multiplication:
(φ(Q)φ(Kᵀ))V = φ(Q)(φ(Kᵀ)V)
Where φ is a feature map
This uses the associativity of matrix multiplication:
(φ(Q)φ(Kᵀ))V = φ(Q)(φ(Kᵀ)V)
How It Works
Feature Map φ
Apply feature map to queries and keys before computing attention:
φ(x) = ELU(x) + 1 (commonly used)
Computes: Attention(Q,K,V) = φ(Q)(φ(K)ᵀV) / (φ(Q)(φ(K)ᵀ)1)
Computes: Attention(Q,K,V) = φ(Q)(φ(K)ᵀV) / (φ(Q)(φ(K)ᵀ)1)
Computation Steps
1. Compute KᵀV: O(n·d·d) = O(n·d²) but is just matrix multiply
2. Compute Q(KᵀV): O(n·d·d) = O(n·d²)
3. Total: O(n·d²) = O(n) in d!
2. Compute Q(KᵀV): O(n·d·d) = O(n·d²)
3. Total: O(n·d²) = O(n) in d!
Complexity Comparison
| Method | Complexity | For d=64, n=4096 |
|---|---|---|
| Standard Attention | O(n²·d) | 1M operations |
| Linear Attention | O(n·d²) | 16K operations |
Limitations
- Approximation: Kernel approximation may not be exact
- Less expressive: May not capture all patterns standard attention does
- Requires careful kernel selection: Wrong kernel hurts performance
Variants
- Performer: Uses random feature maps for unbiased estimation
- Linear Transformers: Uses ELU feature map
- CosFormer: Uses cosine-based kernels