Introduction
Swin Transformer (Shifted WINdow Transformer) is a hierarchical vision transformer that uses window-based self-attention with shifted windows. It achieves linear complexity for image size by computing attention within local windows while enabling cross-window communication through shifting.
Architecture Overview
Stages 1-4:
Each stage: Patch Merging → Window Attention × N → Shifted Window Attention × N
Resolution reduction: H/4 → H/8 → H/16 → H/32
Window size: 7×7 (constant across stages)
Each stage: Patch Merging → Window Attention × N → Shifted Window Attention × N
Resolution reduction: H/4 → H/8 → H/16 → H/32
Window size: 7×7 (constant across stages)
Shifted Window Attention Detail
Each stage has L transformer blocks with alternating window patterns:
Block 1: Window attention with regular partition
Block 2: Window attention with shifted partition (by ceil(M/2))
Block 3: Back to regular
Block 4: Shifted
...
Block 2: Window attention with shifted partition (by ceil(M/2))
Block 3: Back to regular
Block 4: Shifted
...
Efficient Batch Computation
For shifted windows, number of windows increases. Swin uses efficient masking:
Original: (H/M + 1) × (W/M + 1) windows for shifted
Efficient: Pad and roll to create multiple regular grids
Compute attention in parallel batches
Apply mask to prevent cross-window attention
Efficient: Pad and roll to create multiple regular grids
Compute attention in parallel batches
Apply mask to prevent cross-window attention
Relative Position Bias
Swin uses relative position bias in attention:
Attention(Q,K,V) = softmax(QKᵀ/√d + B)V
B ∈ ℝ^{M²×M²} is learned relative position bias per window
B ∈ ℝ^{M²×M²} is learned relative position bias per window
Comparison with ViT
| Aspect | ViT | Swin |
|---|---|---|
| Attention | Global (all patches) | Window + shifted |
| Complexity | O(n²) | O(n) |
| Structure | Single scale | Hierarchical |
| Position bias | Learned 1D | Learned 2D relative |