Introduction
BigBird is a sparse attention mechanism introduced by Zaheer et al. (2020) that can handle sequences up to 8× longer than standard Transformers. It combines three attention patterns: local (sliding window), global, and random attention to theoretically guarantee that full attention can be approximated with O(n) complexity.
Three Attention Patterns
1. Local Attention (Window)
Each position attends to w neighbors
Default w = 512
Captures local structure
Default w = 512
Captures local structure
2. Global Attention
All positions attend to special global tokens
Global tokens attend to all positions
Enables information aggregation
Global tokens attend to all positions
Enables information aggregation
3. Random Attention
Each position attends to r random positions
Default r = 3-5 random connections
Enables information spread across sequence
Default r = 3-5 random connections
Enables information spread across sequence
Theoretical Foundation
BigBird proves that sparse attention can approximate full attention because:
- Graph theory: The attention graph can be made transitive (connected) with O(n) edges
- Spectral clustering: Sparse connections preserve spectral properties
- Message passing: Information flows across the sequence via random connections
Complexity
Total edges per position:
- w local connections
- g global connections (global tokens attend to all)
- r random connections
O(w + g + r) = O(n) total
Full sequence: O(n) instead of O(n²)
- w local connections
- g global connections (global tokens attend to all)
- r random connections
O(w + g + r) = O(n) total
Full sequence: O(n) instead of O(n²)
Configuration
| Parameter | Typical Value | Description |
|---|---|---|
| num_global_tokens | 2 | Special global tokens |
| window_size | 512 | Local attention window |
| num_random_blocks | 3 | Random connections per token |
Key Insight
Random connections are crucial because local + global alone can create isolated clusters. Random connections bridge these clusters, ensuring the entire sequence is connected.