Introduction
Global attention allows certain tokens (global tokens) to attend to and be attended from by all positions in a sequence. This is in contrast to local attention which only considers neighboring positions. Global attention is essential for capturing long-range dependencies and is typically combined with local attention for efficiency.
Core Concept
Designate certain tokens as "global" that can attend to entire sequence:
Global Token at position g:
Can attend to: ALL positions {0, 1, ..., n-1}
Can be attended from: ALL positions {0, 1, ..., n-1}
Can attend to: ALL positions {0, 1, ..., n-1}
Can be attended from: ALL positions {0, 1, ..., n-1}
Implementation Examples
1. Longformer
Each token attends to:
- 512 local neighbors (left and right)
- All global tokens
Global tokens: CLS token + task-specific tokens
- 512 local neighbors (left and right)
- All global tokens
Global tokens: CLS token + task-specific tokens
2. BigBird
Global tokens attend to entire sequence
All tokens attend to global tokens
Local attention within windows
Random attention for connectivity
All tokens attend to global tokens
Local attention within windows
Random attention for connectivity
3. BERT-Style [CLS] Token
[CLS] token is global
Attends to all input tokens
Used for classification representations
Attends to all input tokens
Used for classification representations
Information Flow with Global Tokens
Global tokens serve as "information hubs" that collect and distribute information:
Without global token:
Position 0 ← Position 100 (need many hops through local attention)
With global token:
Position 0 → Global → Position 100 (2 hops)
Complexity Analysis
| Scenario | Complexity |
|---|---|
| Full attention | O(n²) |
| All global (g tokens) | O(g·n) for each direction |
| Local only | O(n·w) |
| Global + Local | O(g·n) + O(n·w) |
Types of Global Tokens
1. Learnable Tokens
Add special tokens like [CLS], [SEP] that are trained to aggregate information.
2. Existing Tokens
Designate certain input tokens (e.g., question tokens in QA) as global.
3. Additional Memory Tokens
Add m memory tokens to sequence
Input: [memory tokens] + [actual tokens]
Memory tokens attend globally to capture summary
Input: [memory tokens] + [actual tokens]
Memory tokens attend globally to capture summary
Advantages
- Long-range dependency: Direct paths between any position and global tokens
- Efficient: O(g·n) instead of O(n²) with few global tokens
- Information aggregation: Global tokens naturally aggregate sequence information
Disadvantages
- Bottleneck: Global tokens can become information bottleneck
- Fixed number: Limited number of global tokens may limit expressivity