44 - Swin Transformer Attention | Mango Encyclopedia

Introduction

Swin Transformer (Shifted WINdow Transformer) is a hierarchical vision transformer that uses window-based self-attention with shifted windows. It achieves linear complexity for image size by computing attention within local windows while enabling cross-window communication through shifting.

Architecture Overview

Stages 1-4:
Each stage: Patch Merging → Window Attention × N → Shifted Window Attention × N

Resolution reduction: H/4 → H/8 → H/16 → H/32

Window size: 7×7 (constant across stages)

Shifted Window Attention Detail

Each stage has L transformer blocks with alternating window patterns:

Block 1: Window attention with regular partition
Block 2: Window attention with shifted partition (by ceil(M/2))
Block 3: Back to regular
Block 4: Shifted
...

Efficient Batch Computation

For shifted windows, number of windows increases. Swin uses efficient masking:

Original: (H/M + 1) × (W/M + 1) windows for shifted

Efficient: Pad and roll to create multiple regular grids
Compute attention in parallel batches
Apply mask to prevent cross-window attention

Relative Position Bias

Swin uses relative position bias in attention:

Attention(Q,K,V) = softmax(QKᵀ/√d + B)V

B ∈ ℝ^{M²×M²} is learned relative position bias per window

Comparison with ViT

Aspect	ViT	Swin
Attention	Global (all patches)	Window + shifted
Complexity	O(n²)	O(n)
Structure	Single scale	Hierarchical
Position bias	Learned 1D	Learned 2D relative

Test Your Understanding

Question 1: Swin stands for:

A) Switch Window
B) Shifted WINdow Transformer
C) Swivel Window
D) Swinging Window

Question 2: Swin achieves O(n) complexity by using:

A) Global attention
B) Window attention + shifted windows
C) No attention
D) Full attention

Question 3: Swin has how many stages?

A) 1
B) 2
C) 4
D) 8

Question 4: Window size in Swin is:

A) 3×3
B) 7×7
C) 16×16
D) Variable

Question 5: Swin uses relative position bias in:

A> Attention computation
B) Patch embedding
C) No position bias
D> Only absolute positions

Question 6: Shifted window partition enables:

A) Smaller windows
B) Cross-window information exchange
C) No information exchange
D) Slower computation

Question 7: Resolution reduction in Swin happens via:

A) Strided convolution
B) Patch merging
C) Pooling only
D) No reduction

Question 8: Compared to ViT, Swin has:

A) Same single-scale structure
B) Hierarchical structure with decreasing resolution
C) Higher complexity
D) Worse performance

44. Swin Transformer Attention