35 - Linear Attention | Mango Encyclopedia

Introduction

Linear attention is an approach that reduces the O(n²) complexity of standard attention to O(n) by using kernel methods to approximate the softmax operation. Instead of computing full n×n attention matrix, linear attention computes attention in a single pass using associative properties of matrix multiplication.

The Problem with Standard Attention

Standard: Attention = softmax(QKᵀ)V

QKᵀ is O(n²) - the bottleneck

Linear Attention Reformulation

softmax(QKᵀ) ≈ φ(Q)(φ(K)ᵀV)

Where φ is a feature map

This uses the associativity of matrix multiplication:
(φ(Q)φ(Kᵀ))V = φ(Q)(φ(Kᵀ)V)

How It Works

Feature Map φ

Apply feature map to queries and keys before computing attention:

φ(x) = ELU(x) + 1 (commonly used)

Computes: Attention(Q,K,V) = φ(Q)(φ(K)ᵀV) / (φ(Q)(φ(K)ᵀ)1)

Computation Steps

1. Compute KᵀV: O(n·d·d) = O(n·d²) but is just matrix multiply
2. Compute Q(KᵀV): O(n·d·d) = O(n·d²)
3. Total: O(n·d²) = O(n) in d!

Complexity Comparison

Method	Complexity	For d=64, n=4096
Standard Attention	O(n²·d)	1M operations
Linear Attention	O(n·d²)	16K operations

Limitations

Approximation: Kernel approximation may not be exact
Less expressive: May not capture all patterns standard attention does
Requires careful kernel selection: Wrong kernel hurts performance

Variants

Performer: Uses random feature maps for unbiased estimation
Linear Transformers: Uses ELU feature map
CosFormer: Uses cosine-based kernels

Test Your Understanding

Question 1: Linear attention complexity is:

A) O(n²)
B) O(n)
C) O(n·d)
D) O(n·d²)

Question 2: The bottleneck in standard attention is:

A) V matrix
B) QKᵀ computation
C) Softmax
D) Output projection

Question 3: Linear attention uses:

A) Full n×n matrix
B) Kernel approximation with feature map
C) No attention
D) Fixed attention patterns

Question 4: A common feature map φ is:

A) ReLU
B) ELU + 1
C) Sigmoid
D) Tanh

Question 5: Linear attention exploits:

A) softmax properties
B) Associativity of matrix multiplication
C) Attention sparsity
D) Gradient flow

Question 6: A limitation of linear attention is:

A) O(n²) complexity
B) May not capture all patterns standard attention does
C) Cannot be trained
D) Uses too much memory

Question 7: For d=64, n=4096, linear attention operations vs standard:

A) Same
B) Much fewer (n·d² vs n²·d)
C) Much more
D) Slightly fewer

Question 8: Performer linear attention uses:

A) Fixed feature map
B) Random feature maps for unbiased estimation
C) No feature map
D) Learned feature map

35. Linear Attention