44. Swin Transformer Attention

Introduction

Swin Transformer (Shifted WINdow Transformer) is a hierarchical vision transformer that uses window-based self-attention with shifted windows. It achieves linear complexity for image size by computing attention within local windows while enabling cross-window communication through shifting.

Architecture Overview

Stages 1-4:
Each stage: Patch Merging → Window Attention × N → Shifted Window Attention × N

Resolution reduction: H/4 → H/8 → H/16 → H/32

Window size: 7×7 (constant across stages)

Shifted Window Attention Detail

Each stage has L transformer blocks with alternating window patterns:

Block 1: Window attention with regular partition
Block 2: Window attention with shifted partition (by ceil(M/2))
Block 3: Back to regular
Block 4: Shifted
...

Efficient Batch Computation

For shifted windows, number of windows increases. Swin uses efficient masking:

Original: (H/M + 1) × (W/M + 1) windows for shifted

Efficient: Pad and roll to create multiple regular grids
Compute attention in parallel batches
Apply mask to prevent cross-window attention

Relative Position Bias

Swin uses relative position bias in attention:

Attention(Q,K,V) = softmax(QKᵀ/√d + B)V

B ∈ ℝ^{M²×M²} is learned relative position bias per window

Comparison with ViT

AspectViTSwin
AttentionGlobal (all patches)Window + shifted
ComplexityO(n²)O(n)
StructureSingle scaleHierarchical
Position biasLearned 1DLearned 2D relative

Test Your Understanding

Question 1: Swin stands for:

  • A) Switch Window
  • B) Shifted WINdow Transformer
  • C) Swivel Window
  • D) Swinging Window

Question 2: Swin achieves O(n) complexity by using:

  • A) Global attention
  • B) Window attention + shifted windows
  • C) No attention
  • D) Full attention

Question 3: Swin has how many stages?

  • A) 1
  • B) 2
  • C) 4
  • D) 8

Question 4: Window size in Swin is:

  • A) 3×3
  • B) 7×7
  • C) 16×16
  • D) Variable

Question 5: Swin uses relative position bias in:

  • A> Attention computation
  • B) Patch embedding
  • C) No position bias
  • D> Only absolute positions

Question 6: Shifted window partition enables:

  • A) Smaller windows
  • B) Cross-window information exchange
  • C) No information exchange
  • D) Slower computation

Question 7: Resolution reduction in Swin happens via:

  • A) Strided convolution
  • B) Patch merging
  • C) Pooling only
  • D) No reduction

Question 8: Compared to ViT, Swin has:

  • A) Same single-scale structure
  • B) Hierarchical structure with decreasing resolution
  • C) Higher complexity
  • D) Worse performance