31 - Sliding Window Attention | Mango Encyclopedia

Introduction

Sliding window attention (also called sliding window attention or local attention) is a mechanism where each position attends only to a fixed-size window of neighboring positions. This is the same concept as local attention, but often specifically refers to the symmetric left-right context used in models like Swin Transformer.

How It Works

For window size w at position i:

Attend to positions: [i-w/2, i+w/2]

Complexity: O(n·w) instead of O(n²)

Symmetric vs Asymmetric

Symmetric (Swin-style)

Each position attends to w/2 neighbors on each side:

Position i attends to:
max(0, i-w/2) to min(n-1, i+w/2)

Asymmetric (some GPT variants)

Can attend to w previous positions but not future (causal):

Position i attends to:
max(0, i-w+1) to i

Multi-Stage Windows

Models like Swin use hierarchical windows across stages:

Stage	Window Size	Feature Map
1	7×7	56×56
2	7×7	28×28
3	7×7	14×14
4	7×7	7×7

Computation Pattern

For window size w: ┌─────────────────────────────┐ │ Window 0: positions 0..w-1 │ │ Window 1: positions w..2w-1│ │ Window 2: positions 2w..3w-1│ │ ... │ └─────────────────────────────┘ Each window computes self-attention internally

Key Properties

Local receptive field: w neighbors per position
Linear complexity: O(n·w)
Hierarchical propagation: Multi-stage can capture larger context

Extension: Shifted Window

Swin Transformer uses shifted windows between stages:

Stage 1: windows aligned to grid
Stage 2: windows shifted by w/2

This enables cross-window connections

Test Your Understanding

Question 1: Sliding window attention complexity is:

A) O(n²)
B) O(n·w)
C) O(w²)
D) O(n)

Question 2: With window size w=8, position 10 attends to positions:

A) 0 to 10
B) 6 to 14
C) 10 only
D) 0 to 80

Question 3: What does shifted window enable?

A) Larger windows
B) Cross-window connections
C) Smaller windows
D) No benefit

Question 4: Swin Transformer uses window size:

A) 3×3
B) 7×7
C) 16×16
D) 32×32

Question 5: Hierarchical windows across stages help capture:

A) Only local patterns
B) Larger receptive fields through stacking
C) Random patterns
D) No context

Question 6: Symmetric sliding window has w/2 neighbors:

A) On left only
B) On right only
C) On each side
D) No neighbors

Question 7: Multi-stage sliding window is similar to:

A) Single layer CNN
B) Hierarchical feature extraction
C) Recurrent network
D) Fully connected network

Question 8: Position 5 with w=4 in symmetric window attends to:

A) {1, 2, 3, 4, 5}
B) {3, 4, 5, 6, 7}
C) {1, 2, 3, 5}
D) {5}

31. Sliding Window Attention