35. Linear Attention

Introduction

Linear attention is an approach that reduces the O(n²) complexity of standard attention to O(n) by using kernel methods to approximate the softmax operation. Instead of computing full n×n attention matrix, linear attention computes attention in a single pass using associative properties of matrix multiplication.

The Problem with Standard Attention

Standard: Attention = softmax(QKᵀ)V

QKᵀ is O(n²) - the bottleneck

Linear Attention Reformulation

softmax(QKᵀ) ≈ φ(Q)(φ(K)ᵀV)

Where φ is a feature map

This uses the associativity of matrix multiplication:
(φ(Q)φ(Kᵀ))V = φ(Q)(φ(Kᵀ)V)

How It Works

Feature Map φ

Apply feature map to queries and keys before computing attention:

φ(x) = ELU(x) + 1 (commonly used)

Computes: Attention(Q,K,V) = φ(Q)(φ(K)ᵀV) / (φ(Q)(φ(K)ᵀ)1)

Computation Steps

1. Compute KᵀV: O(n·d·d) = O(n·d²) but is just matrix multiply
2. Compute Q(KᵀV): O(n·d·d) = O(n·d²)
3. Total: O(n·d²) = O(n) in d!

Complexity Comparison

MethodComplexityFor d=64, n=4096
Standard AttentionO(n²·d)1M operations
Linear AttentionO(n·d²)16K operations

Limitations

Variants

Test Your Understanding

Question 1: Linear attention complexity is:

  • A) O(n²)
  • B) O(n)
  • C) O(n·d)
  • D) O(n·d²)

Question 2: The bottleneck in standard attention is:

  • A) V matrix
  • B) QKᵀ computation
  • C) Softmax
  • D) Output projection

Question 3: Linear attention uses:

  • A) Full n×n matrix
  • B) Kernel approximation with feature map
  • C) No attention
  • D) Fixed attention patterns

Question 4: A common feature map φ is:

  • A) ReLU
  • B) ELU + 1
  • C) Sigmoid
  • D) Tanh

Question 5: Linear attention exploits:

  • A) softmax properties
  • B) Associativity of matrix multiplication
  • C) Attention sparsity
  • D) Gradient flow

Question 6: A limitation of linear attention is:

  • A) O(n²) complexity
  • B) May not capture all patterns standard attention does
  • C) Cannot be trained
  • D) Uses too much memory

Question 7: For d=64, n=4096, linear attention operations vs standard:

  • A) Same
  • B) Much fewer (n·d² vs n²·d)
  • C) Much more
  • D) Slightly fewer

Question 8: Performer linear attention uses:

  • A) Fixed feature map
  • B) Random feature maps for unbiased estimation
  • C) No feature map
  • D) Learned feature map