Introduction
Cross-attention is an attention mechanism where queries come from one sequence and keys/values come from a different sequence. It enables the decoder to access information from the encoder in encoder-decoder architectures, and is fundamental to multimodal models that combine different data types.
Core Concept
┌─────────────────┐ ┌─────────────────┐
│ Sequence A │ │ Sequence B │
│ (Query source) │ │ (Key/Value) │
└────────┬────────┘ └────────┬────────┘
│ │
│ Q K, V
│ │
▼ ▼
┌─────────────────────────────────┐
│ CROSS-ATTENTION │
│ Attention(Q_from_A, K_from_B, │
│ V_from_B) │
└─────────────────────────────────┘
│
▼
Output: Sequence A's positions attending to
information from Sequence B
Mathematical Formulation
Q = X_decoder · W_Q
K = X_encoder · W_K
V = X_encoder · W_V
CrossAttention(Q, K, V) = softmax(QKᵀ/√d) · V
K = X_encoder · W_K
V = X_encoder · W_V
CrossAttention(Q, K, V) = softmax(QKᵀ/√d) · V
Cross-Attention vs Self-Attention
| Aspect | Self-Attention | Cross-Attention |
|---|---|---|
| Q source | Same sequence | Sequence A (decoder) |
| K, V source | Same sequence | Sequence B (encoder) |
| Captures | Within-sequence relationships | Cross-sequence relationships |
| Symmetric? | Yes (Q,K,V from same) | No (asymmetric) |
Use Cases
1. Machine Translation (Encoder-Decoder)
Decoder queries attend to encoder keys/values (source language):
French encoder output → keys/values
English decoder query → attention output
Result: English token attending to relevant French positions
English decoder query → attention output
Result: English token attending to relevant French positions
2. Vision-Language (e.g., CLIP)
Image encoder → keys/values
Text encoder → queries
Result: Text tokens attending to relevant image regions
Text encoder → queries
Result: Text tokens attending to relevant image regions
3. Document Question Answering
Document encoder → keys/values
Question encoder → queries
Result: Question answering using document context
Question encoder → queries
Result: Question answering using document context
Key Properties
- Asymmetric: Q and K,V can come from different sources
- Bidirectional (typically): No causal mask usually applied to cross-attention
- Flexible: Works with any combination of modalities
In Transformer Decoder
Each decoder layer has cross-attention after masked self-attention:
Layer = MaskedSelfAttention(Q_decoder)
→ Add & Norm
→ CrossAttention(Q_norm, K_encoder, V_encoder)
→ Add & Norm
→ FFN
→ Add & Norm
Multimodal Applications
Cross-attention enables models to connect different modalities:
- Image captioning: Text attending to image features
- VQA: Question attending to image regions
- Speech recognition: Text attending to audio spectrograms
- Video understanding: Text attending to video frames