47. Image-Text Attention

Introduction

Image-text attention is a specific form of cross-modal attention that connects visual and textual information. It's fundamental to models like CLIP, DALL-E, and vision-language models for tasks like image captioning, visual question answering, and text-to-image generation.

Image Representation

Images are typically processed in one of two ways:

1. Patch-based (ViT-style)

Image → patches → linear project → tokens

For 224×224 image with 16×16 patches: 196 tokens

2. Region-based (Faster R-CNN style)

Object detection → ROI features → region tokens

Typically 36-100 regions per image

Text Representation

Text → tokenize → embed → transformer encoder

Standard token embeddings with positional encoding

Attention Patterns

Text Querying Image

In image captioning or VQA:

Q = text token (e.g., "cat")
K, V = image features (patches or regions)

Attention shows which image regions correspond to "cat"

Image Querying Text

In CLIP image-to-text retrieval:

Q = image feature
K, V = text token embeddings

Measures image-text similarity

CLIP-style Contrastive Learning

Image encoder: ViT → image embedding I
Text encoder: Transformer → text embedding T

Loss: maximize similarity(I, T) for matched pairs
minimize similarity(I, T) for mismatched pairs

Applications

ApplicationAttention Direction
Image CaptioningText → Image
VQAQuestion → Image
Image RetrievalImage → Text / Text → Image
Text-to-ImageText → Image (generation)

Test Your Understanding

Question 1: In CLIP, image and text are encoded into:

  • A) Different spaces
  • B) Same embedding space for contrastive learning
  • C) No embeddings
  • D) Random vectors

Question 2: For image captioning, attention flows from:

  • A) Image to text
  • B) Text to image
  • C> No attention
  • D) Text to text

Question 3: Patch-based image representation divides image into:

  • A) Single pixel
  • B) Rectangular patches
  • C> Circles
  • D) Random shapes

Question 4: In CLIP loss, matched pairs should have:

  • A) Low similarity
  • B) High similarity
  • C) No similarity
  • D) Zero similarity

Question 5: Image-text attention is a type of:

  • A) Self-attention
  • B) Cross-modal attention
  • C) No attention
  • D> Masked attention

Question 6: A 224×224 image with 16×16 patches produces how many tokens?

  • A) 224
  • B) 16
  • C) 196
  • D) 256

Question 7: Region-based image features typically come from:

  • A) Random sampling
  • B) Object detection (e.g., Faster R-CNN)
  • C) No detection
  • D) Direct pixels

Question 8: In VQA, question text queries image regions via:

  • A) Self-attention
  • B) Cross-modal attention (Q from text, K,V from image)
  • C) No attention
  • D) Causal attention