41. Vision Attention

Introduction

Vision attention refers to attention mechanisms applied to image data. Unlike text where tokens are discrete words, images require different processing strategies - either treating image patches as tokens or using spatial attention mechanisms designed for 2D data.

Image as Sequence of Patches

Most modern vision transformers convert images to patch sequences:

Input: Image of size H × W × C

Patchify: Divide into p × p patches
Number of patches: N = (H·W) / p²

Each patch flattened and projected to embedding dimension
Sequence length N becomes the "token" sequence

Attention Types in Vision

1. Spatial Attention

Focus on specific regions of the image:

Attention across spatial positions
Similar to text self-attention but operating on 2D coordinates

2. Channel Attention

Focus on what features to emphasize:

Squeeze-and-Excitation (SE): channel-wise attention

c = GAP(x) → fc → sigmoid → scale

3. Self-Attention over Patches

ViT-style: treat each patch as a token:

Patch 1, Patch 2, ..., Patch N
Each patch attends to all other patches
Captures long-range dependencies across image

Challenges in Vision Attention

Solutions

SolutionMethod
Swin TransformerWindow-based local attention + shifted windows
Pyramid ViTHierarchical feature maps with decreasing resolution
Spatial attentionFocus on spatial regions
Channel attentionFocus on feature channels (SE-Net)

Example: SE (Squeeze-and-Excitation) Attention

Input: x ∈ ℝ^{H×W×C}

Squeeze: global average pooling → z ∈ ℝ^{C}

Excitation: fc(z) → fc(σ(z)) → scale ∈ ℝ^{C}

Output: scale × x (channel-wise scaled)

Test Your Understanding

Question 1: In ViT, an image is converted to sequence by:

  • A) Pixel-by-pixel
  • B) Dividing into patches and flattening each
  • C) Using convolution only
  • D) No conversion needed

Question 2: For 224×224 image with 16×16 patches, number of patches is:

  • A) 14
  • B) 196
  • C) 256
  • D) 224

Question 3: Spatial attention focuses on:

  • A) Which channels to emphasize
  • B) Which spatial regions to focus on
  • C) Only corners
  • D) Random positions

Question 4: SE (Squeeze-and-Excitation) attention is:

  • A) Spatial attention
  • B) Channel attention
  • C) Patch attention
  • D) No attention

Question 5: Full self-attention on high-resolution images is challenging because:

  • A) Too few tokens
  • B) Number of patches becomes very large (n² memory)
  • C) Images don't have structure
  • D) Patches are too small

Question 6: Swin Transformer handles the large number of patches by:

  • A) Using full self-attention
  • B) Using window-based local attention + shifted windows
  • C) Ignoring most patches
  • D) Using RNN

Question 7: In SE attention, the "squeeze" operation is:

  • A) Convolution
  • B) Global average pooling
  • C) Max pooling
  • D) Attention computation

Question 8: Vision attention preserves spatial information via:

  • A) Removing position embeddings
  • B) Using positional encodings designed for 2D
  • C> Using random positions
  • D) Ignoring spatial structure