47 - Image-Text Attention | Mango Encyclopedia

Introduction

Image-text attention is a specific form of cross-modal attention that connects visual and textual information. It's fundamental to models like CLIP, DALL-E, and vision-language models for tasks like image captioning, visual question answering, and text-to-image generation.

Image Representation

Images are typically processed in one of two ways:

1. Patch-based (ViT-style)

Image → patches → linear project → tokens

For 224×224 image with 16×16 patches: 196 tokens

2. Region-based (Faster R-CNN style)

Object detection → ROI features → region tokens

Typically 36-100 regions per image

Text Representation

Text → tokenize → embed → transformer encoder

Standard token embeddings with positional encoding

Attention Patterns

Text Querying Image

In image captioning or VQA:

Q = text token (e.g., "cat")
K, V = image features (patches or regions)

Attention shows which image regions correspond to "cat"

Image Querying Text

In CLIP image-to-text retrieval:

Q = image feature
K, V = text token embeddings

Measures image-text similarity

CLIP-style Contrastive Learning

Image encoder: ViT → image embedding I
Text encoder: Transformer → text embedding T

Loss: maximize similarity(I, T) for matched pairs
minimize similarity(I, T) for mismatched pairs

Applications

Application	Attention Direction
Image Captioning	Text → Image
VQA	Question → Image
Image Retrieval	Image → Text / Text → Image
Text-to-Image	Text → Image (generation)

Test Your Understanding

Question 1: In CLIP, image and text are encoded into:

A) Different spaces
B) Same embedding space for contrastive learning
C) No embeddings
D) Random vectors

Question 2: For image captioning, attention flows from:

A) Image to text
B) Text to image
C> No attention
D) Text to text

Question 3: Patch-based image representation divides image into:

A) Single pixel
B) Rectangular patches
C> Circles
D) Random shapes

Question 4: In CLIP loss, matched pairs should have:

A) Low similarity
B) High similarity
C) No similarity
D) Zero similarity

Question 5: Image-text attention is a type of:

A) Self-attention
B) Cross-modal attention
C) No attention
D> Masked attention

Question 6: A 224×224 image with 16×16 patches produces how many tokens?

A) 224
B) 16
C) 196
D) 256

Question 7: Region-based image features typically come from:

A) Random sampling
B) Object detection (e.g., Faster R-CNN)
C) No detection
D) Direct pixels

Question 8: In VQA, question text queries image regions via:

A) Self-attention
B) Cross-modal attention (Q from text, K,V from image)
C) No attention
D) Causal attention

47. Image-Text Attention