31. Sliding Window Attention

Introduction

Sliding window attention (also called sliding window attention or local attention) is a mechanism where each position attends only to a fixed-size window of neighboring positions. This is the same concept as local attention, but often specifically refers to the symmetric left-right context used in models like Swin Transformer.

How It Works

For window size w at position i:

Attend to positions: [i-w/2, i+w/2]

Complexity: O(n·w) instead of O(n²)

Symmetric vs Asymmetric

Symmetric (Swin-style)

Each position attends to w/2 neighbors on each side:

Position i attends to:
max(0, i-w/2) to min(n-1, i+w/2)

Asymmetric (some GPT variants)

Can attend to w previous positions but not future (causal):

Position i attends to:
max(0, i-w+1) to i

Multi-Stage Windows

Models like Swin use hierarchical windows across stages:

StageWindow SizeFeature Map
17×756×56
27×728×28
37×714×14
47×77×7

Computation Pattern

For window size w: ┌─────────────────────────────┐ │ Window 0: positions 0..w-1 │ │ Window 1: positions w..2w-1│ │ Window 2: positions 2w..3w-1│ │ ... │ └─────────────────────────────┘ Each window computes self-attention internally

Key Properties

Extension: Shifted Window

Swin Transformer uses shifted windows between stages:

Stage 1: windows aligned to grid
Stage 2: windows shifted by w/2

This enables cross-window connections

Test Your Understanding

Question 1: Sliding window attention complexity is:

  • A) O(n²)
  • B) O(n·w)
  • C) O(w²)
  • D) O(n)

Question 2: With window size w=8, position 10 attends to positions:

  • A) 0 to 10
  • B) 6 to 14
  • C) 10 only
  • D) 0 to 80

Question 3: What does shifted window enable?

  • A) Larger windows
  • B) Cross-window connections
  • C) Smaller windows
  • D) No benefit

Question 4: Swin Transformer uses window size:

  • A) 3×3
  • B) 7×7
  • C) 16×16
  • D) 32×32

Question 5: Hierarchical windows across stages help capture:

  • A) Only local patterns
  • B) Larger receptive fields through stacking
  • C) Random patterns
  • D) No context

Question 6: Symmetric sliding window has w/2 neighbors:

  • A) On left only
  • B) On right only
  • C) On each side
  • D) No neighbors

Question 7: Multi-stage sliding window is similar to:

  • A) Single layer CNN
  • B) Hierarchical feature extraction
  • C) Recurrent network
  • D) Fully connected network

Question 8: Position 5 with w=4 in symmetric window attends to:

  • A) {1, 2, 3, 4, 5}
  • B) {3, 4, 5, 6, 7}
  • C) {1, 2, 3, 5}
  • D) {5}