Introduction
Block attention is a sparse attention mechanism where the sequence is divided into blocks, and attention is computed within blocks rather than across the entire sequence. This reduces complexity from O(n²) to O(n·b) where b is block size, or allows specific block-to-block attention patterns.
Block Partitioning
Sequence length n divided into B blocks
Block size b = n / B
Example: n=1024, block_size=64 → 16 blocks
Block size b = n / B
Example: n=1024, block_size=64 → 16 blocks
Attention Patterns
1. Within-Block Only
Attention only within each block (like local attention):
Complexity: O(B · b²) = O(n·b)
Block i can only attend to positions in block i
Block i can only attend to positions in block i
2. Block-to-Block (Strided)
Block i attends to blocks at stride S:
Block i → blocks: i, i-S, i+S, i-2S, ...
Enables larger receptive field
Enables larger receptive field
3. Hierarchical Blocks
Different block sizes at different levels:
Level 1: 64 tokens per block
Level 2: 256 tokens per block
Level 3: 1024 tokens per block
Level 2: 256 tokens per block
Level 3: 1024 tokens per block
Memory Efficiency
Standard: O(n²) memory for attention matrix
Block attention: O(B · b²) = O(n·b) memory
Example: n=4096, b=64
Standard: 16M entries
Block: 16K entries (1000× less)
Block attention: O(B · b²) = O(n·b) memory
Example: n=4096, b=64
Standard: 16M entries
Block: 16K entries (1000× less)
Used In
- Longformer: Uses block + global attention
- Swin Transformer: Window-based blocks
- Image Transformers: Patch-based blocks
Cross-Block Information
To allow information to flow between blocks:
- Hierarchical stacking: Later layers see aggregated block info
- Shifted windows (Swin): Adjacent blocks attend to each other
- Global tokens: Special tokens that attend across blocks