Introduction
Performer (introduced by Choromanski et al., 2020) is a linear attention model that uses generalized attention mechanisms with random feature maps for unbiased softmax approximation. It achieves O(n) complexity while providing theoretically justified approximation of standard attention.
Problem with Standard Softmax
softmax(QKᵀ/√d) requires computing full n×n matrix
For n=4096: 16M entries to compute
For n=16384: 268M entries - memory explodes
For n=4096: 16M entries to compute
For n=16384: 268M entries - memory explodes
Random Feature Map for Softmax
The softmax kernel can be approximated using random features:
exp(qᵀk/√d) ≈ φ(q)ᵀφ(k)
Where φ(x) is constructed from random projections
Where φ(x) is constructed from random projections
The FAVOR+ Algorithm
Performer uses "FAVOR+" (Fast Attention Via positive Orthogonal Random Features):
φ(x) = [cos(x·w₁+b), sin(x·w₁+b), ..., cos(x·wₘ+b), sin(x·wₘ+b)] / √m
Where wᵢ ~ N(0, I) and bᵢ ~ Uniform(0, 2π)
Where wᵢ ~ N(0, I) and bᵢ ~ Uniform(0, 2π)
Algorithm Steps
1. Sample random matrices R ∈ ℝ^{d×m} and B ∈ ℝ^{m}
2. Compute Φ(Q) = cos(QR + B), Φ(K) = cos(KR + B)
3. Attention ≈ softmax_normalize(Φ(Q)Φ(K)ᵀ)Φ(K)V
Complexity: O(n·d·m + n·m·d) = O(n·m·d) where m << n
2. Compute Φ(Q) = cos(QR + B), Φ(K) = cos(KR + B)
3. Attention ≈ softmax_normalize(Φ(Q)Φ(K)ᵀ)Φ(K)V
Complexity: O(n·d·m + n·m·d) = O(n·m·d) where m << n
Theoretical Guarantees
- Unbiased: E[φ(q)ᵀφ(k)] = exp(qᵀk/√d)
- Low variance: Concentration around expectation
- Positive: Uses positive random features for stability
Comparison
| Aspect | Standard | Performer |
|---|---|---|
| Complexity | O(n²·d) | O(n·m·d) |
| Space | O(n²) | O(n·m) |
| Approximation | Exact | Theoretically justified |
Advantages
- Linear in sequence length: O(n) space and time
- Retrievable: Can recover exact attention via inductive吐出
- Flexible: Can use different kernels