Introduction
Vision Transformer (ViT) is a model that applies Transformer architecture (originally designed for text) directly to image classification by treating image patches as tokens. It was introduced by Dosovitskiy et al. (2020) in "An Image is Worth 16x16 Words".
Title Origin
The paper title "An Image is Worth 16x16 Words" refers to the fact that ViT divides an image into 16×16 patches, treating each patch as a "word" (token) in a sequence.
Architecture
Input: Image H × W × C
Patch embedding:
1. Reshape to (H·W/p²) patches of size p×p×C
2. Flatten each patch to vector of size p²·C
3. Linear project to d_model dimensions
Add positional embeddings
Add [CLS] token (optional, for classification)
Process through standard Transformer encoder
Patch embedding:
1. Reshape to (H·W/p²) patches of size p×p×C
2. Flatten each patch to vector of size p²·C
3. Linear project to d_model dimensions
Add positional embeddings
Add [CLS] token (optional, for classification)
Process through standard Transformer encoder
Patch Processing Example
Image: 224×224×3
Patch size: 16×16
Number of patches: (224/16)² = 14² = 196 patches
Each patch: 16×16×3 = 768 values
Project to d_model (e.g., 768)
Sequence length: 196 tokens
Patch size: 16×16
Number of patches: (224/16)² = 14² = 196 patches
Each patch: 16×16×3 = 768 values
Project to d_model (e.g., 768)
Sequence length: 196 tokens
Positional Embeddings in ViT
ViT uses learned 1D positional embeddings (since patches have no natural order):
Position embedding: E_pos ∈ ℝ^{N+1 × d_model}
N patches + 1 [CLS] token
2D position information can be added via 2D-aware positional encodings
N patches + 1 [CLS] token
2D position information can be added via 2D-aware positional encodings
Pre-training vs From-Scratch
| Training | Requirements | Performance |
|---|---|---|
| From scratch | Large datasets (JFT-300M) | Good with big data |
| Fine-tuned | Smaller datasets + pretraining | Best results |
| Linear probing | Fix pretrained features | Tests representation quality |
Key Findings
- Data efficiency: ViT needs large datasets for best performance
- CNN + ViT: Hybrid models work better with less data
- Global attention: ViT captures long-range dependencies across image