43. Window Attention

Introduction

Window attention (used in Swin Transformer) divides the image into non-overlapping windows, where each window contains a fixed number of patches. Self-attention is computed only within each window, making computation linear in image size.

Window-based Processing

Input: Feature map of size H × W

Window size: M × M patches (typically M=7)

Number of windows: (H/M) × (W/M)

Each window computes self-attention independently

Computational Efficiency

MethodComplexity
Global attention (ViT)O((HW)²)
Window attention (Swin)O(HW·M²) = O(HW) since M is constant

Window Attention in Swin

Stage 1: H/4 × W/4 feature map, window size 7
Each window: 7×7 = 49 patches

Stage 2-4: Resolution decreases, windows remain size 7

Shifted Window Mechanism

To enable cross-window communication, Swin uses shifted windows between stages:

Even layers: Regular window partition
Odd layers: Window shifted by (M/2, M/2)

This creates connections between adjacent windows

Benefits

Test Your Understanding

Question 1: Window attention divides the image into:

  • A) Overlapping windows
  • B) Non-overlapping windows
  • C) Random windows
  • D> Single large window

Question 2: Complexity of window attention with window size M is:

  • A) O((HW)²)
  • B) O(HW·M²)
  • C) O(M²)
  • D) O(1)

Question 3: Swin uses what window size?

  • A) 3×3
  • B) 7×7
  • C) 16×16
  • D) 32×32

Question 4: Shifted windows are used to:

  • A) Make computation slower
  • B) Enable cross-window information flow
  • C) Increase window size
  • D) Reduce accuracy

Question 5: Number of windows on H×W feature map with window size M is:

  • A) H×W
  • B) (H/M)×(W/M)
  • C) M×M
  • D) (H×W)/M

Question 6: Shifted window partition is applied at:

  • A) Every layer
  • B) Odd-numbered layers
  • C) Even-numbered layers
  • D) First layer only

Question 7: When window is shifted by M/2, the shift distance is:

  • A) M
  • B) M/2
  • C) 0
  • D) 2M

Question 8: Window attention makes computation linear in:

  • A) Window size only
  • B) Image resolution (H×W)
  • C) Both constant
  • D) Batch size