28 - Sparse Attention | Mango Encyclopedia

Introduction

Sparse attention is an approach that reduces the O(n²) computational and memory complexity of standard attention by only computing attention between a subset of positions. Rather than attending to all n×n position pairs, sparse attention connects only O(n) or O(n log n) pairs.

Why Sparse Attention?

Quadratic bottleneck: Standard attention is O(n²) in sequence length
Memory intensive: Storing n×n attention matrices is expensive for large n
Long sequences: Real-world applications (documents, videos) have long contexts
Not all connections needed: Many attention patterns can be approximated sparsely

Sparse Attention Patterns

1. Fixed Patterns

a) Global + Local: - Global tokens attend to all others - Local tokens attend to neighbors within window w b) Stride: - Attend to positions at fixed intervals - e.g., every 3rd position c) Random: - Random positions selected for each token

2. Learned Patterns

Models learn which positions to attend to (Sparse Transformers):

Use attention heads with different sparsity patterns
Each head learns specific connectivity

3. BlockSparse

Divide sequence into blocks
Attention only within certain block pairs

Sparse Attention Categories

A. Local Attention (Sliding Window)

Each position attends to w neighbors (typically w is small, e.g., 32-512):

Complexity: O(n·w) instead of O(n²)

Captures local patterns, not long-range

B. Global Attention

Special tokens (or all tokens in a subset) attend to entire sequence:

Global tokens: O(n) each
All tokens to global: O(n)
Local connections: O(n·w)

C. Random Attention

Each position attends to r random positions:

Complexity: O(n·r)

Enables information flow across sequence

Combining Sparse Patterns

State-of-the-art sparse methods use combinations:

BigBird: Global + Local + Random

This combination can approximate full attention while maintaining O(n) complexity

Computational Savings

Method	Complexity	For n=4096, w=64
Full Attention	O(n²)	16M operations
Local Window	O(n·w)	262K operations
BigBird (Global+Local+Random)	O(n)	16K operations

Limitations

May miss important long-range dependencies
Pattern selection matters significantly
Some tasks require dense attention

Test Your Understanding

Question 1: What is the complexity of full attention?

A) O(n)
B) O(n log n)
C) O(n²)
D) O(1)

Question 2: Local attention with window size w has complexity:

A) O(n²)
B) O(n·w)
C) O(w²)
D) O(n/w)

Question 3: What does BigBird combine?

A) Local + Stride
B) Global + Local + Random
C) Chunk + Random
D) Fixed + Learned

Question 4: Random attention enables:

A) Faster computation
B) Information flow across the sequence
C) Lower memory
D) All of the above

Question 5: What is a limitation of sparse attention?

A) Too much memory
B) May miss important long-range dependencies
C) Cannot be combined with masks
D> Uses more parameters

Question 6: Global attention tokens attend to:

A) Only local neighbors
B) The entire sequence
C) No other tokens
D) Random positions

Question 7: For n=4096 with window size w=64, local attention complexity is:

A) 4096²
B) 4096 × 64
C) 64²
D) 4096 + 64

Question 8: Sparse attention is useful for:

A) Short sequences only
B) Long sequences where full attention is too expensive
C) Only vision transformers
D) Only text

28. Sparse Attention