43 - Window Attention | Mango Encyclopedia

Introduction

Window attention (used in Swin Transformer) divides the image into non-overlapping windows, where each window contains a fixed number of patches. Self-attention is computed only within each window, making computation linear in image size.

Window-based Processing

Input: Feature map of size H × W

Window size: M × M patches (typically M=7)

Number of windows: (H/M) × (W/M)

Each window computes self-attention independently

Computational Efficiency

Method	Complexity
Global attention (ViT)	O((HW)²)
Window attention (Swin)	O(HW·M²) = O(HW) since M is constant

Window Attention in Swin

Stage 1: H/4 × W/4 feature map, window size 7
Each window: 7×7 = 49 patches

Stage 2-4: Resolution decreases, windows remain size 7

Shifted Window Mechanism

To enable cross-window communication, Swin uses shifted windows between stages:

Even layers: Regular window partition
Odd layers: Window shifted by (M/2, M/2)

This creates connections between adjacent windows

Benefits

Linear complexity: O(HW) instead of O((HW)²)
Cross-window connections: Shifted windows enable information flow
Hierarchical: Reduced resolution at each stage captures larger receptive field

Test Your Understanding

Question 1: Window attention divides the image into:

A) Overlapping windows
B) Non-overlapping windows
C) Random windows
D> Single large window

Question 2: Complexity of window attention with window size M is:

A) O((HW)²)
B) O(HW·M²)
C) O(M²)
D) O(1)

Question 3: Swin uses what window size?

A) 3×3
B) 7×7
C) 16×16
D) 32×32

Question 4: Shifted windows are used to:

A) Make computation slower
B) Enable cross-window information flow
C) Increase window size
D) Reduce accuracy

Question 5: Number of windows on H×W feature map with window size M is:

A) H×W
B) (H/M)×(W/M)
C) M×M
D) (H×W)/M

Question 6: Shifted window partition is applied at:

A) Every layer
B) Odd-numbered layers
C) Even-numbered layers
D) First layer only

Question 7: When window is shifted by M/2, the shift distance is:

A) M
B) M/2
C) 0
D) 2M

Question 8: Window attention makes computation linear in:

A) Window size only
B) Image resolution (H×W)
C) Both constant
D) Batch size

43. Window Attention