41 - Vision Attention | Mango Encyclopedia

Introduction

Vision attention refers to attention mechanisms applied to image data. Unlike text where tokens are discrete words, images require different processing strategies - either treating image patches as tokens or using spatial attention mechanisms designed for 2D data.

Image as Sequence of Patches

Most modern vision transformers convert images to patch sequences:

Input: Image of size H × W × C

Patchify: Divide into p × p patches
Number of patches: N = (H·W) / p²

Each patch flattened and projected to embedding dimension
Sequence length N becomes the "token" sequence

Attention Types in Vision

1. Spatial Attention

Focus on specific regions of the image:

Attention across spatial positions
Similar to text self-attention but operating on 2D coordinates

2. Channel Attention

Focus on what features to emphasize:

Squeeze-and-Excitation (SE): channel-wise attention

c = GAP(x) → fc → sigmoid → scale

3. Self-Attention over Patches

ViT-style: treat each patch as a token:

Patch 1, Patch 2, ..., Patch N
Each patch attends to all other patches
Captures long-range dependencies across image

Challenges in Vision Attention

Memory: Full attention on high-res images is expensive (n can be large)
2D Structure: Need to preserve spatial information lost by flattening
Various Scales: Objects at different sizes require multi-scale attention

Solutions

Solution	Method
Swin Transformer	Window-based local attention + shifted windows
Pyramid ViT	Hierarchical feature maps with decreasing resolution
Spatial attention	Focus on spatial regions
Channel attention	Focus on feature channels (SE-Net)

Example: SE (Squeeze-and-Excitation) Attention

Input: x ∈ ℝ^{H×W×C}

Squeeze: global average pooling → z ∈ ℝ^{C}

Excitation: fc(z) → fc(σ(z)) → scale ∈ ℝ^{C}

Output: scale × x (channel-wise scaled)

Test Your Understanding

Question 1: In ViT, an image is converted to sequence by:

A) Pixel-by-pixel
B) Dividing into patches and flattening each
C) Using convolution only
D) No conversion needed

Question 2: For 224×224 image with 16×16 patches, number of patches is:

A) 14
B) 196
C) 256
D) 224

Question 3: Spatial attention focuses on:

A) Which channels to emphasize
B) Which spatial regions to focus on
C) Only corners
D) Random positions

Question 4: SE (Squeeze-and-Excitation) attention is:

A) Spatial attention
B) Channel attention
C) Patch attention
D) No attention

Question 5: Full self-attention on high-resolution images is challenging because:

A) Too few tokens
B) Number of patches becomes very large (n² memory)
C) Images don't have structure
D) Patches are too small

Question 6: Swin Transformer handles the large number of patches by:

A) Using full self-attention
B) Using window-based local attention + shifted windows
C) Ignoring most patches
D) Using RNN

Question 7: In SE attention, the "squeeze" operation is:

A) Convolution
B) Global average pooling
C) Max pooling
D) Attention computation

Question 8: Vision attention preserves spatial information via:

A) Removing position embeddings
B) Using positional encodings designed for 2D
C> Using random positions
D) Ignoring spatial structure

41. Vision Attention