42 - Vision Transformer (ViT) | Mango Encyclopedia

Introduction

Vision Transformer (ViT) is a model that applies Transformer architecture (originally designed for text) directly to image classification by treating image patches as tokens. It was introduced by Dosovitskiy et al. (2020) in "An Image is Worth 16x16 Words".

Title Origin

The paper title "An Image is Worth 16x16 Words" refers to the fact that ViT divides an image into 16×16 patches, treating each patch as a "word" (token) in a sequence.

Architecture

Input: Image H × W × C

Patch embedding:
1. Reshape to (H·W/p²) patches of size p×p×C
2. Flatten each patch to vector of size p²·C
3. Linear project to d_model dimensions

Add positional embeddings
Add [CLS] token (optional, for classification)
Process through standard Transformer encoder

Patch Processing Example

Image: 224×224×3
Patch size: 16×16
Number of patches: (224/16)² = 14² = 196 patches

Each patch: 16×16×3 = 768 values
Project to d_model (e.g., 768)

Sequence length: 196 tokens

Positional Embeddings in ViT

ViT uses learned 1D positional embeddings (since patches have no natural order):

Position embedding: E_pos ∈ ℝ^{N+1 × d_model}

N patches + 1 [CLS] token

2D position information can be added via 2D-aware positional encodings

Pre-training vs From-Scratch

Training	Requirements	Performance
From scratch	Large datasets (JFT-300M)	Good with big data
Fine-tuned	Smaller datasets + pretraining	Best results
Linear probing	Fix pretrained features	Tests representation quality

Key Findings

Data efficiency: ViT needs large datasets for best performance
CNN + ViT: Hybrid models work better with less data
Global attention: ViT captures long-range dependencies across image

Test Your Understanding

Question 1: ViT treats image patches as:

A) Random noise
B) Tokens (like words in text)
C) Only convolution features
D) Single pixels

Question 2: For a 224×224 image with 16×16 patches, number of patches is:

A) 224
B) 256
C) 196
D) 14

Question 3: The paper title "An Image is Worth 16x16 Words" means:

A) Image is worth $16
B) 16×16 patches treated as tokens
C> Image has 16 words
D) Words are 16×16 pixels

Question 4: ViT uses what type of positional embeddings?

A) Sinusoidal only
B) Learned 1D positional embeddings
C) No positional embeddings
D) Fixed 2D positions

Question 5: ViT needs large datasets because:

A) It has too few parameters
B) Transformer architectures need more data than CNNs
C) It's not well designed
D) Images are simple

Question 6: The [CLS] token in ViT is used for:

A) Padding
B) Classification output representation
C) No special purpose
D) Attention only

Question 7: For patch size 16×16 with 3 color channels, flattened patch has:

A) 16 values
B) 256 values
C) 768 values
D) 16×16 = 256 values (not 768)

Question 8: ViT was introduced in the paper:

A) "Attention is All You Need"
B) "An Image is Worth 16x16 Words"
C) "BERT"
D) "ResNet"

42. Vision Transformer (ViT)