Introduction
Image-text attention is a specific form of cross-modal attention that connects visual and textual information. It's fundamental to models like CLIP, DALL-E, and vision-language models for tasks like image captioning, visual question answering, and text-to-image generation.
Image Representation
Images are typically processed in one of two ways:
1. Patch-based (ViT-style)
Image → patches → linear project → tokens
For 224×224 image with 16×16 patches: 196 tokens
For 224×224 image with 16×16 patches: 196 tokens
2. Region-based (Faster R-CNN style)
Object detection → ROI features → region tokens
Typically 36-100 regions per image
Typically 36-100 regions per image
Text Representation
Text → tokenize → embed → transformer encoder
Standard token embeddings with positional encoding
Standard token embeddings with positional encoding
Attention Patterns
Text Querying Image
In image captioning or VQA:
Q = text token (e.g., "cat")
K, V = image features (patches or regions)
Attention shows which image regions correspond to "cat"
K, V = image features (patches or regions)
Attention shows which image regions correspond to "cat"
Image Querying Text
In CLIP image-to-text retrieval:
Q = image feature
K, V = text token embeddings
Measures image-text similarity
K, V = text token embeddings
Measures image-text similarity
CLIP-style Contrastive Learning
Image encoder: ViT → image embedding I
Text encoder: Transformer → text embedding T
Loss: maximize similarity(I, T) for matched pairs
minimize similarity(I, T) for mismatched pairs
Text encoder: Transformer → text embedding T
Loss: maximize similarity(I, T) for matched pairs
minimize similarity(I, T) for mismatched pairs
Applications
| Application | Attention Direction |
|---|---|
| Image Captioning | Text → Image |
| VQA | Question → Image |
| Image Retrieval | Image → Text / Text → Image |
| Text-to-Image | Text → Image (generation) |