Introduction
The quadratic complexity problem refers to the O(n²) computational and memory requirements of standard self-attention, where n is the sequence length. This quadratic scaling makes it prohibitively expensive to process very long sequences.
The Source of Quadratic Complexity
Attention: softmax(QKᵀ/√d)V
QKᵀ computation: O(n·d) for each of n queries
Attention matrix size: n × n
QKᵀ computation: O(n·d) for each of n queries
Attention matrix size: n × n
Memory Requirements
| Sequence Length | Attention Matrix Size |
|---|---|
| 512 | 262K entries |
| 2048 | 4.2M entries |
| 8192 | 67M entries |
| 32768 | 1B entries |
Why It's a Problem
- Long documents: Processing 10K+ tokens becomes impossible
- Memory bound: Can't fit large attention matrices in GPU memory
- Training cost: O(n²) forward + backward pass
Solutions
| Solution | Complexity | Method |
|---|---|---|
| FlashAttention | O(n) memory | Tiled computation |
| Sparse attention | O(n·w) | Local + global + random |
| Linear attention | O(n·d²) or O(n) | Kernel approximation |
| Longformer/BigBird | O(n) | Window + global + random |
Real-World Impact
Without solving quadratic complexity, models would be limited to ~2000 tokens. Many practical applications need much longer contexts:
- Long documents: Books, legal documents (50K+ tokens)
- Code: Large codebases
- Video: Long video understanding