60. Quadratic Complexity Problem

Introduction

The quadratic complexity problem refers to the O(n²) computational and memory requirements of standard self-attention, where n is the sequence length. This quadratic scaling makes it prohibitively expensive to process very long sequences.

The Source of Quadratic Complexity

Attention: softmax(QKᵀ/√d)V QKᵀ computation: O(n·d) for each of n queries Attention matrix size: n × n

Memory Requirements

Sequence LengthAttention Matrix Size
512262K entries
20484.2M entries
819267M entries
327681B entries

Why It's a Problem

Solutions

SolutionComplexityMethod
FlashAttentionO(n) memoryTiled computation
Sparse attentionO(n·w)Local + global + random
Linear attentionO(n·d²) or O(n)Kernel approximation
Longformer/BigBirdO(n)Window + global + random

Real-World Impact

Without solving quadratic complexity, models would be limited to ~2000 tokens. Many practical applications need much longer contexts:

Test Your Understanding

Question 1: Standard self-attention has complexity:

  • A) O(n)
  • B) O(n²)
  • C) O(1)
  • D) O(log n)

Question 2: The quadratic complexity comes from:

  • A) V matrix computation
  • B) QKᵀ matrix (n×n attention scores)
  • C) Query projection
  • D) No complexity

Question 3: For n=8192, attention matrix has how many entries?

  • A) 8192
  • B) 67M
  • C) 8192² = 67M (correct)
  • D> 16K

Question 4: Without solving quadratic complexity, max sequence length is limited to approximately:

  • A) 100K tokens
  • B) ~2000 tokens (due to memory)
  • C) 512 tokens
  • D) 100 tokens

Question 5: FlashAttention solves quadratic complexity by:

  • A) Increasing memory
  • B) Tiled computation to achieve O(n) memory
  • C) Using more computation
  • D) Approximation only

Question 6: Which is NOT a solution to quadratic complexity?

  • A) FlashAttention
  • B) Sparse attention
  • C) Standard MHA (still O(n²))
  • D) Linear attention

Question 7: Longformer/BigBird achieve O(n) via:

  • A) Full attention (still O(n²))
  • B) Window + global + random attention patterns
  • C) No attention
  • D) Single layer only

Question 8: A practical application needing long contexts (>10K tokens) is:

  • A) Short text classification
  • B) Legal document processing
  • C) Single word
  • D) Emoji classification