Introduction
Kernelized attention reformulates the attention mechanism using kernel methods to achieve linear time and space complexity. Instead of computing exp(QKᵀ/√d), it uses kernel feature maps to approximate the softmax, enabling efficient computation through the "kernel trick."
The Kernel Trick
Standard softmax:
softmax(QKᵀ)[i,j] = exp(qᵢ·kⱼ/√d) / Σₖ exp(qᵢ·kₖ/√d)
Kernelized (approximate):
kernel(q,k) ≈ φ(q)·φ(k)ᵀ
softmax(QKᵀ)[i,j] = exp(qᵢ·kⱼ/√d) / Σₖ exp(qᵢ·kₖ/√d)
Kernelized (approximate):
kernel(q,k) ≈ φ(q)·φ(k)ᵀ
Feature Map Approximations
1. Random Feature Maps (Performer)
φ(x) = [cos(x·w₁+b), sin(x·w₁+b), cos(x·w₂+b), sin(x·w₂+b), ...]
Where wᵢ are random vectors from distribution
Where wᵢ are random vectors from distribution
2. Polynomial Kernels
K(q,k) = (q·k/√d + c)^d
Approximates softmax for certain d values
Approximates softmax for certain d values
3. ELU Feature Map
φ(x) = ELU(x) + 1
Used in Linear Transformers
Simpler but less accurate approximation
Used in Linear Transformers
Simpler but less accurate approximation
Kernelized Attention Formula
Attention(Q,K,V) ≈ (φ(Q)φ(K)ᵀ)V / (φ(Q)φ(K)ᵀ)1
Computed as:
S = φ(Q) @ (φ(K)ᵀV) [O(n·d²)]
Normalize by S / (φ(Q)φ(K)ᵀ)1 [O(n)]
Computed as:
S = φ(Q) @ (φ(K)ᵀV) [O(n·d²)]
Normalize by S / (φ(Q)φ(K)ᵀ)1 [O(n)]
Properties
| Aspect | Standard | Kernelized |
|---|---|---|
| Time | O(n²·d) | O(n·d²) or O(n) |
| Space | O(n²) | O(n) or O(d²) |
| Approximation | Exact | Approximately equal |
Error Analysis
Kernel approximation introduces error that can accumulate:
- Small errors: Individual attention weights may differ
- But: Overall output often similar enough
- Importance: Some attention patterns more affected than others