38. Low-Rank Attention

Introduction

Low-rank attention is based on the observation that the attention matrix often has low-rank structure. This means the matrix can be well-approximated by a product of smaller matrices, enabling efficient computation without storing the full n×n matrix.

Low-Rank Assumption

Attention matrix A = softmax(QKᵀ/√d) ∈ ℝ^{n×n}

Often approximately low-rank:
A ≈ UVᵀ where U ∈ ℝ^{n×r}, V ∈ ℝ^{n×r}, r << n

Why Low-Rank?

Many attention patterns exhibit redundancy:

Low-Rank Approximation Methods

1. Nyström Method

Sample m landmark positions
Compute attention to landmarks
Approximate full attention from landmark relationships

A ≈ A(:,M) @ A(M,M)⁻¹ @ A(M,:)

2. Feature Map Approximation

A = softmax(QKᵀ) ≈ φ(Q)φ(K)ᵀ

φ(Q) ∈ ℝ^{n×m} is low-rank (m << n)

Efficient Computation

Standard: O(n²·d)

Low-rank (A = UVᵀ, r << n):
Attention = AV = UVᵀV
First compute S = U(VᵀV) ∈ ℝ^{n×r}
Then output = US ∈ ℝ^{n×d}

Complexity: O(n·r·d) instead of O(n²·d)

Applications

Test Your Understanding

Question 1: Low-rank attention exploits the observation that attention matrices are:

  • A) Full rank (no structure)
  • B) Random
  • C) Approximately low-rank (redundant)
  • D) Dense

Question 2: In low-rank approximation A ≈ UVᵀ, what are the dimensions?

  • A) U ∈ ℝ^{n×n}, V ∈ ℝ^{n×n}
  • B) U ∈ ℝ^{n×r}, V ∈ ℝ^{n×r} where r << n
  • C) U ∈ ℝ^{r×r}, V ∈ ℝ^{r×r}
  • D) U ∈ ℝ^{n×d}, V ∈ ℝ^{n×d}

Question 3: Why do attention matrices often have low-rank structure?

  • A) Tokens are random
  • B) Many tokens attend to similar positions, creating redundancy
  • C) Attention is computed randomly
  • D) Softmax forces low-rank

Question 4: Nyström method approximates attention using:

  • A) All positions
  • B) Landmark positions sampled from full sequence
  • C) Random matrices
  • D) No approximation

Question 5: Complexity with low-rank approximation (rank r) is:

  • A) O(n²·d)
  • B) O(n·r·d)
  • C) O(r²·d)
  • D) O(n·d)

Question 6: Low-rank attention avoids storing:

  • A) The full n×n attention matrix
  • B) The Q, K, V matrices
  • C) The input embeddings
  • D> The output

Question 7: The Nyström approximation A ≈ A(:,M) @ A(M,M)⁻¹ @ A(M,:) uses:

  • A) All n positions
  • B) Only m landmark positions
  • C) Random positions
  • D) No positions

Question 8: Linformer uses low-rank attention via:

  • A) Random sampling
  • B) Feature map approximation
  • C) Matrix inversion
  • D) Full attention