Introduction
Local attention (also called sliding window attention) is a sparse attention mechanism where each position only attends to a fixed-size window of neighboring positions. It's efficient (O(n·w) instead of O(n²)) and captures local patterns, though it cannot directly model long-range dependencies without additional mechanisms.
Core Concept
For a window size w, position i can only attend to positions [max(0, i-w/2), min(n-1, i+w/2)]:
LocalAttention(Q, K, V, w):
For each position i:
Attend to positions (i-w/2) to (i+w/2)
Compute weighted sum with those positions
For each position i:
Attend to positions (i-w/2) to (i+w/2)
Compute weighted sum with those positions
Window Sizes
| Model | Window Size | Notes |
|---|---|---|
| GPT-3 | 2048 | Large context |
| Swin Transformer | 7×7 | For images |
| Longformer | 512 | Global at special tokens |
| BigBird | 512 | With global + random |
Variants
1. Centered Window
Position i attends to [i-w/2, i+w/2]
Example: w=3 for position 5:
Attends to: {4, 5, 6}
Example: w=3 for position 5:
Attends to: {4, 5, 6}
2. Left-aligned (Causal Local)
Position i attends to [max(0, i-w+1), i]
Example: w=3 for position 5:
Attends to: {3, 4, 5}
Example: w=3 for position 5:
Attends to: {3, 4, 5}
3. Strided Local
Position i attends to positions at stride S
Within each stride, small local context
Within each stride, small local context
Computational Complexity
For sequence length n, window size w:
Standard attention: O(n²) · d
Local attention: O(n · w) · d
Example: n=4096, w=64
- Standard: 16,777,216 · d
- Local: 262,144 · d (64× less!)
Strengths
- Efficient: Linear in sequence length
- Local patterns: Captures local syntax, structure
- Memory efficient: O(n·w) instead of O(n²)
Limitations
- No long-range: Cannot directly connect distant positions
- Information bottleneck: Long-range needs hierarchical propagation
Combined with Global Attention
To capture both local and long-range dependencies:
Global tokens: attend to ALL positions
Local tokens: attend to window w
Information from global tokens propagates through local attention
Local tokens: attend to window w
Information from global tokens propagates through local attention