38 - Low-Rank Attention | Mango Encyclopedia

Introduction

Low-rank attention is based on the observation that the attention matrix often has low-rank structure. This means the matrix can be well-approximated by a product of smaller matrices, enabling efficient computation without storing the full n×n matrix.

Low-Rank Assumption

Attention matrix A = softmax(QKᵀ/√d) ∈ ℝ^{n×n}

Often approximately low-rank:
A ≈ UVᵀ where U ∈ ℝ^{n×r}, V ∈ ℝ^{n×r}, r << n

Why Low-Rank?

Many attention patterns exhibit redundancy:

Token similarity: Many tokens attend to similar positions
Redundant attention: Different queries may produce similar attention patterns
Spectral decay: Attention matrices often have rapidly decaying singular values

Low-Rank Approximation Methods

1. Nyström Method

Sample m landmark positions
Compute attention to landmarks
Approximate full attention from landmark relationships

A ≈ A(:,M) @ A(M,M)⁻¹ @ A(M,:)

2. Feature Map Approximation

A = softmax(QKᵀ) ≈ φ(Q)φ(K)ᵀ

φ(Q) ∈ ℝ^{n×m} is low-rank (m << n)

Efficient Computation

Standard: O(n²·d)

Low-rank (A = UVᵀ, r << n):
Attention = AV = UVᵀV
First compute S = U(VᵀV) ∈ ℝ^{n×r}
Then output = US ∈ ℝ^{n×d}

Complexity: O(n·r·d) instead of O(n²·d)

Applications

Linformer: Uses low-rank approximation via projection
LRFormer: Transformer with low-rank attention

Test Your Understanding

Question 1: Low-rank attention exploits the observation that attention matrices are:

A) Full rank (no structure)
B) Random
C) Approximately low-rank (redundant)
D) Dense

Question 2: In low-rank approximation A ≈ UVᵀ, what are the dimensions?

A) U ∈ ℝ^{n×n}, V ∈ ℝ^{n×n}
B) U ∈ ℝ^{n×r}, V ∈ ℝ^{n×r} where r << n
C) U ∈ ℝ^{r×r}, V ∈ ℝ^{r×r}
D) U ∈ ℝ^{n×d}, V ∈ ℝ^{n×d}

Question 3: Why do attention matrices often have low-rank structure?

A) Tokens are random
B) Many tokens attend to similar positions, creating redundancy
C) Attention is computed randomly
D) Softmax forces low-rank

Question 4: Nyström method approximates attention using:

A) All positions
B) Landmark positions sampled from full sequence
C) Random matrices
D) No approximation

Question 5: Complexity with low-rank approximation (rank r) is:

A) O(n²·d)
B) O(n·r·d)
C) O(r²·d)
D) O(n·d)

Question 6: Low-rank attention avoids storing:

A) The full n×n attention matrix
B) The Q, K, V matrices
C) The input embeddings
D> The output

Question 7: The Nyström approximation A ≈ A(:,M) @ A(M,M)⁻¹ @ A(M,:) uses:

A) All n positions
B) Only m landmark positions
C) Random positions
D) No positions

Question 8: Linformer uses low-rank attention via:

A) Random sampling
B) Feature map approximation
C) Matrix inversion
D) Full attention

38. Low-Rank Attention