Introduction
Window attention (used in Swin Transformer) divides the image into non-overlapping windows, where each window contains a fixed number of patches. Self-attention is computed only within each window, making computation linear in image size.
Window-based Processing
Input: Feature map of size H × W
Window size: M × M patches (typically M=7)
Number of windows: (H/M) × (W/M)
Each window computes self-attention independently
Window size: M × M patches (typically M=7)
Number of windows: (H/M) × (W/M)
Each window computes self-attention independently
Computational Efficiency
| Method | Complexity |
|---|---|
| Global attention (ViT) | O((HW)²) |
| Window attention (Swin) | O(HW·M²) = O(HW) since M is constant |
Window Attention in Swin
Stage 1: H/4 × W/4 feature map, window size 7
Each window: 7×7 = 49 patches
Stage 2-4: Resolution decreases, windows remain size 7
Each window: 7×7 = 49 patches
Stage 2-4: Resolution decreases, windows remain size 7
Shifted Window Mechanism
To enable cross-window communication, Swin uses shifted windows between stages:
Even layers: Regular window partition
Odd layers: Window shifted by (M/2, M/2)
This creates connections between adjacent windows
Odd layers: Window shifted by (M/2, M/2)
This creates connections between adjacent windows
Benefits
- Linear complexity: O(HW) instead of O((HW)²)
- Cross-window connections: Shifted windows enable information flow
- Hierarchical: Reduced resolution at each stage captures larger receptive field