60 - Quadratic Complexity Problem

Introduction

The quadratic complexity problem refers to the O(n²) computational and memory requirements of standard self-attention, where n is the sequence length. This quadratic scaling makes it prohibitively expensive to process very long sequences.

The Source of Quadratic Complexity

Attention: softmax(QKᵀ/√d)V QKᵀ computation: O(n·d) for each of n queries Attention matrix size: n × n

Memory Requirements

Sequence Length	Attention Matrix Size
512	262K entries
2048	4.2M entries
8192	67M entries
32768	1B entries

Why It's a Problem

Long documents: Processing 10K+ tokens becomes impossible
Memory bound: Can't fit large attention matrices in GPU memory
Training cost: O(n²) forward + backward pass

Solutions

Solution	Complexity	Method
FlashAttention	O(n) memory	Tiled computation
Sparse attention	O(n·w)	Local + global + random
Linear attention	O(n·d²) or O(n)	Kernel approximation
Longformer/BigBird	O(n)	Window + global + random

Real-World Impact

Without solving quadratic complexity, models would be limited to ~2000 tokens. Many practical applications need much longer contexts:

Long documents: Books, legal documents (50K+ tokens)
Code: Large codebases
Video: Long video understanding

Test Your Understanding

Question 1: Standard self-attention has complexity:

A) O(n)
B) O(n²)
C) O(1)
D) O(log n)

Question 2: The quadratic complexity comes from:

A) V matrix computation
B) QKᵀ matrix (n×n attention scores)
C) Query projection
D) No complexity

Question 3: For n=8192, attention matrix has how many entries?

A) 8192
B) 67M
C) 8192² = 67M (correct)
D> 16K

Question 4: Without solving quadratic complexity, max sequence length is limited to approximately:

A) 100K tokens
B) ~2000 tokens (due to memory)
C) 512 tokens
D) 100 tokens

Question 5: FlashAttention solves quadratic complexity by:

A) Increasing memory
B) Tiled computation to achieve O(n) memory
C) Using more computation
D) Approximation only

Question 6: Which is NOT a solution to quadratic complexity?

A) FlashAttention
B) Sparse attention
C) Standard MHA (still O(n²))
D) Linear attention

Question 7: Longformer/BigBird achieve O(n) via:

A) Full attention (still O(n²))
B) Window + global + random attention patterns
C) No attention
D) Single layer only

Question 8: A practical application needing long contexts (>10K tokens) is:

A) Short text classification
B) Legal document processing
C) Single word
D) Emoji classification

60. Quadratic Complexity Problem