Introduction
Cross-modal attention is attention that connects different modalities (text, image, audio, etc.) in multimodal models. It allows one modality to query and attend to information from another modality, enabling joint understanding and generation across modalities.
Cross-Modal Architecture
Q = from modality A (e.g., text)
K, V = from modality B (e.g., image)
CrossAttention(Q_modA, K_modB, V_modB)
K, V = from modality B (e.g., image)
CrossAttention(Q_modA, K_modB, V_modB)
Common Applications
| Task | Q Source | K, V Source |
|---|---|---|
| Image Captioning | Text decoder | Image encoder |
| VQA | Question | Image regions |
| CLIP alignment | Text features | Image features |
| Video Retrieval | Text query | Video frames |
How Cross-Modal Attention Works
Example: Image Captioning
At step generating word "cat":
Q = decoder hidden state (representing "cat" context)
K = image patch features (what the image contains)
V = same image features
Attention weights show which image regions relate to "cat"
Q = decoder hidden state (representing "cat" context)
K = image patch features (what the image contains)
V = same image features
Attention weights show which image regions relate to "cat"
Bilateral Cross-Attention
Some models like CLIP use bidirectional cross-attention:
Image→Text attention: Q from image, K,V from text
Text→Image attention: Q from text, K,V from image
Text→Image attention: Q from text, K,V from image
Challenges
- Representation alignment: Different modalities have different representations
- Attention patterns: May need modality-specific attention mechanisms
- Alignment supervision: Need paired data for training