46. Cross-Modal Attention

Introduction

Cross-modal attention is attention that connects different modalities (text, image, audio, etc.) in multimodal models. It allows one modality to query and attend to information from another modality, enabling joint understanding and generation across modalities.

Cross-Modal Architecture

Q = from modality A (e.g., text)
K, V = from modality B (e.g., image)

CrossAttention(Q_modA, K_modB, V_modB)

Common Applications

TaskQ SourceK, V Source
Image CaptioningText decoderImage encoder
VQAQuestionImage regions
CLIP alignmentText featuresImage features
Video RetrievalText queryVideo frames

How Cross-Modal Attention Works

Example: Image Captioning

At step generating word "cat":

Q = decoder hidden state (representing "cat" context)

K = image patch features (what the image contains)

V = same image features

Attention weights show which image regions relate to "cat"

Bilateral Cross-Attention

Some models like CLIP use bidirectional cross-attention:

Image→Text attention: Q from image, K,V from text

Text→Image attention: Q from text, K,V from image

Challenges

Test Your Understanding

Question 1: In cross-modal attention, queries come from while keys/values come from:

  • A) Same modality
  • B) Different modality
  • C) Random modality
  • D) No modality

Question 2: In VQA (Visual Question Answering), Q comes from and K,V come from:

  • A) Image; text
  • B) Text question; image
  • C) Image; image
  • D) Text; text

Question 3: CLIP uses cross-modal attention for:

  • A) Language tasks only
  • B) Aligning image and text representations
  • C) Image classification only
  • D) No cross-modal

Question 4: Cross-modal attention is essential for:

  • A) Text-only models
  • B) Multimodal understanding and generation
  • C) Single modality only
  • D) Image-only models

Question 5: In image captioning, when generating "cat", attention shows relevance to:

  • A) Random text
  • B) Image regions showing cats
  • C) No attention
  • D) Padding only

Question 6: Some models like CLIP use:

  • A) Unidirectional cross-attention only
  • B) Bidirectional cross-attention
  • C) No cross-attention
  • D) Single direction

Question 7: Cross-modal attention enables:

  • A) Only text processing
  • B) Only image processing
  • C) Joint understanding across modalities
  • D) Single modality only

Question 8: A challenge in cross-modal attention is:

  • A) Same representations across modalities
  • B) Representation alignment between different modalities
  • C) No challenge
  • D) Too easy