Introduction
Vision attention refers to attention mechanisms applied to image data. Unlike text where tokens are discrete words, images require different processing strategies - either treating image patches as tokens or using spatial attention mechanisms designed for 2D data.
Image as Sequence of Patches
Most modern vision transformers convert images to patch sequences:
Input: Image of size H × W × C
Patchify: Divide into p × p patches
Number of patches: N = (H·W) / p²
Each patch flattened and projected to embedding dimension
Sequence length N becomes the "token" sequence
Patchify: Divide into p × p patches
Number of patches: N = (H·W) / p²
Each patch flattened and projected to embedding dimension
Sequence length N becomes the "token" sequence
Attention Types in Vision
1. Spatial Attention
Focus on specific regions of the image:
Attention across spatial positions
Similar to text self-attention but operating on 2D coordinates
Similar to text self-attention but operating on 2D coordinates
2. Channel Attention
Focus on what features to emphasize:
Squeeze-and-Excitation (SE): channel-wise attention
c = GAP(x) → fc → sigmoid → scale
c = GAP(x) → fc → sigmoid → scale
3. Self-Attention over Patches
ViT-style: treat each patch as a token:
Patch 1, Patch 2, ..., Patch N
Each patch attends to all other patches
Captures long-range dependencies across image
Each patch attends to all other patches
Captures long-range dependencies across image
Challenges in Vision Attention
- Memory: Full attention on high-res images is expensive (n can be large)
- 2D Structure: Need to preserve spatial information lost by flattening
- Various Scales: Objects at different sizes require multi-scale attention
Solutions
| Solution | Method |
|---|---|
| Swin Transformer | Window-based local attention + shifted windows |
| Pyramid ViT | Hierarchical feature maps with decreasing resolution |
| Spatial attention | Focus on spatial regions |
| Channel attention | Focus on feature channels (SE-Net) |
Example: SE (Squeeze-and-Excitation) Attention
Input: x ∈ ℝ^{H×W×C}
Squeeze: global average pooling → z ∈ ℝ^{C}
Excitation: fc(z) → fc(σ(z)) → scale ∈ ℝ^{C}
Output: scale × x (channel-wise scaled)
Squeeze: global average pooling → z ∈ ℝ^{C}
Excitation: fc(z) → fc(σ(z)) → scale ∈ ℝ^{C}
Output: scale × x (channel-wise scaled)