Introduction
Sliding window attention (also called sliding window attention or local attention) is a mechanism where each position attends only to a fixed-size window of neighboring positions. This is the same concept as local attention, but often specifically refers to the symmetric left-right context used in models like Swin Transformer.
How It Works
For window size w at position i:
Attend to positions: [i-w/2, i+w/2]
Complexity: O(n·w) instead of O(n²)
Attend to positions: [i-w/2, i+w/2]
Complexity: O(n·w) instead of O(n²)
Symmetric vs Asymmetric
Symmetric (Swin-style)
Each position attends to w/2 neighbors on each side:
Position i attends to:
max(0, i-w/2) to min(n-1, i+w/2)
max(0, i-w/2) to min(n-1, i+w/2)
Asymmetric (some GPT variants)
Can attend to w previous positions but not future (causal):
Position i attends to:
max(0, i-w+1) to i
max(0, i-w+1) to i
Multi-Stage Windows
Models like Swin use hierarchical windows across stages:
| Stage | Window Size | Feature Map |
|---|---|---|
| 1 | 7×7 | 56×56 |
| 2 | 7×7 | 28×28 |
| 3 | 7×7 | 14×14 |
| 4 | 7×7 | 7×7 |
Computation Pattern
For window size w:
┌─────────────────────────────┐
│ Window 0: positions 0..w-1 │
│ Window 1: positions w..2w-1│
│ Window 2: positions 2w..3w-1│
│ ... │
└─────────────────────────────┘
Each window computes self-attention internally
Key Properties
- Local receptive field: w neighbors per position
- Linear complexity: O(n·w)
- Hierarchical propagation: Multi-stage can capture larger context
Extension: Shifted Window
Swin Transformer uses shifted windows between stages:
Stage 1: windows aligned to grid
Stage 2: windows shifted by w/2
This enables cross-window connections
Stage 2: windows shifted by w/2
This enables cross-window connections