Introduction
Sparse attention is an approach that reduces the O(n²) computational and memory complexity of standard attention by only computing attention between a subset of positions. Rather than attending to all n×n position pairs, sparse attention connects only O(n) or O(n log n) pairs.
Why Sparse Attention?
- Quadratic bottleneck: Standard attention is O(n²) in sequence length
- Memory intensive: Storing n×n attention matrices is expensive for large n
- Long sequences: Real-world applications (documents, videos) have long contexts
- Not all connections needed: Many attention patterns can be approximated sparsely
Sparse Attention Patterns
1. Fixed Patterns
a) Global + Local:
- Global tokens attend to all others
- Local tokens attend to neighbors within window w
b) Stride:
- Attend to positions at fixed intervals
- e.g., every 3rd position
c) Random:
- Random positions selected for each token
2. Learned Patterns
Models learn which positions to attend to (Sparse Transformers):
Use attention heads with different sparsity patterns
Each head learns specific connectivity
Each head learns specific connectivity
3. BlockSparse
Divide sequence into blocks
Attention only within certain block pairs
Attention only within certain block pairs
Sparse Attention Categories
A. Local Attention (Sliding Window)
Each position attends to w neighbors (typically w is small, e.g., 32-512):
Complexity: O(n·w) instead of O(n²)
Captures local patterns, not long-range
Captures local patterns, not long-range
B. Global Attention
Special tokens (or all tokens in a subset) attend to entire sequence:
Global tokens: O(n) each
All tokens to global: O(n)
Local connections: O(n·w)
All tokens to global: O(n)
Local connections: O(n·w)
C. Random Attention
Each position attends to r random positions:
Complexity: O(n·r)
Enables information flow across sequence
Enables information flow across sequence
Combining Sparse Patterns
State-of-the-art sparse methods use combinations:
BigBird: Global + Local + Random
This combination can approximate full attention while maintaining O(n) complexity
This combination can approximate full attention while maintaining O(n) complexity
Computational Savings
| Method | Complexity | For n=4096, w=64 |
|---|---|---|
| Full Attention | O(n²) | 16M operations |
| Local Window | O(n·w) | 262K operations |
| BigBird (Global+Local+Random) | O(n) | 16K operations |
Limitations
- May miss important long-range dependencies
- Pattern selection matters significantly
- Some tasks require dense attention