Introduction
Low-rank attention is based on the observation that the attention matrix often has low-rank structure. This means the matrix can be well-approximated by a product of smaller matrices, enabling efficient computation without storing the full n×n matrix.
Low-Rank Assumption
Attention matrix A = softmax(QKᵀ/√d) ∈ ℝ^{n×n}
Often approximately low-rank:
A ≈ UVᵀ where U ∈ ℝ^{n×r}, V ∈ ℝ^{n×r}, r << n
Often approximately low-rank:
A ≈ UVᵀ where U ∈ ℝ^{n×r}, V ∈ ℝ^{n×r}, r << n
Why Low-Rank?
Many attention patterns exhibit redundancy:
- Token similarity: Many tokens attend to similar positions
- Redundant attention: Different queries may produce similar attention patterns
- Spectral decay: Attention matrices often have rapidly decaying singular values
Low-Rank Approximation Methods
1. Nyström Method
Sample m landmark positions
Compute attention to landmarks
Approximate full attention from landmark relationships
A ≈ A(:,M) @ A(M,M)⁻¹ @ A(M,:)
Compute attention to landmarks
Approximate full attention from landmark relationships
A ≈ A(:,M) @ A(M,M)⁻¹ @ A(M,:)
2. Feature Map Approximation
A = softmax(QKᵀ) ≈ φ(Q)φ(K)ᵀ
φ(Q) ∈ ℝ^{n×m} is low-rank (m << n)
φ(Q) ∈ ℝ^{n×m} is low-rank (m << n)
Efficient Computation
Standard: O(n²·d)
Low-rank (A = UVᵀ, r << n):
Attention = AV = UVᵀV
First compute S = U(VᵀV) ∈ ℝ^{n×r}
Then output = US ∈ ℝ^{n×d}
Complexity: O(n·r·d) instead of O(n²·d)
Low-rank (A = UVᵀ, r << n):
Attention = AV = UVᵀV
First compute S = U(VᵀV) ∈ ℝ^{n×r}
Then output = US ∈ ℝ^{n×d}
Complexity: O(n·r·d) instead of O(n²·d)
Applications
- Linformer: Uses low-rank approximation via projection
- LRFormer: Transformer with low-rank attention