21 - Cross-Attention | Mango Encyclopedia

Introduction

Cross-attention is an attention mechanism where queries come from one sequence and keys/values come from a different sequence. It enables the decoder to access information from the encoder in encoder-decoder architectures, and is fundamental to multimodal models that combine different data types.

Core Concept

┌─────────────────┐ ┌─────────────────┐ │ Sequence A │ │ Sequence B │ │ (Query source) │ │ (Key/Value) │ └────────┬────────┘ └────────┬────────┘ │ │ │ Q K, V │ │ ▼ ▼ ┌─────────────────────────────────┐ │ CROSS-ATTENTION │ │ Attention(Q_from_A, K_from_B, │ │ V_from_B) │ └─────────────────────────────────┘ │ ▼ Output: Sequence A's positions attending to information from Sequence B

Mathematical Formulation

Q = X_decoder · W_Q
K = X_encoder · W_K
V = X_encoder · W_V

CrossAttention(Q, K, V) = softmax(QKᵀ/√d) · V

Cross-Attention vs Self-Attention

Aspect	Self-Attention	Cross-Attention
Q source	Same sequence	Sequence A (decoder)
K, V source	Same sequence	Sequence B (encoder)
Captures	Within-sequence relationships	Cross-sequence relationships
Symmetric?	Yes (Q,K,V from same)	No (asymmetric)

Use Cases

1. Machine Translation (Encoder-Decoder)

Decoder queries attend to encoder keys/values (source language):

French encoder output → keys/values
English decoder query → attention output
Result: English token attending to relevant French positions

2. Vision-Language (e.g., CLIP)

Image encoder → keys/values
Text encoder → queries
Result: Text tokens attending to relevant image regions

3. Document Question Answering

Document encoder → keys/values
Question encoder → queries
Result: Question answering using document context

Key Properties

Asymmetric: Q and K,V can come from different sources
Bidirectional (typically): No causal mask usually applied to cross-attention
Flexible: Works with any combination of modalities

In Transformer Decoder

Each decoder layer has cross-attention after masked self-attention:

Layer = MaskedSelfAttention(Q_decoder) → Add & Norm → CrossAttention(Q_norm, K_encoder, V_encoder) → Add & Norm → FFN → Add & Norm

Multimodal Applications

Cross-attention enables models to connect different modalities:

Image captioning: Text attending to image features
VQA: Question attending to image regions
Speech recognition: Text attending to audio spectrograms
Video understanding: Text attending to video frames

Test Your Understanding

Question 1: In cross-attention, where do queries come from?

A) Same sequence as K and V
B) A different sequence than K and V
C) Random initialization
D) Only from encoder

Question 2: What is the key difference between self-attention and cross-attention?

A) Cross-attention uses more parameters
B) Cross-attention connects two different sequences
C) Self-attention is faster
D) Cross-attention is not used in Transformers

Question 3: In machine translation, where is cross-attention used?

A) Encoder to attend to decoder
B) Decoder to attend to encoder source
C) Within the encoder only
D> Within the decoder only

Question 4: In CLIP (image-text model), what uses cross-attention?

A) Image encoder attending to text
B) Text encoder attending to image
C) Both directions use cross-attention
D) Neither uses cross-attention

Question 5: Is cross-attention typically symmetric?

A) Yes, always symmetric
B) No, Q and K,V come from different sources
C) Only in vision models
D) Only in text models

Question 6: In the Transformer decoder, where does cross-attention sit?

A> After masked self-attention (within each layer)
B) Before masked self-attention
C) Only in encoder
D) At the very end

Question 7: Which modality combinations use cross-attention?

A) Text-to-text only
B) Image-to-image only
C) Any modality combination (text, image, audio, video)
D) Only text and image

Question 8: Cross-attention output dimension is typically:

A) Same as Q dimension
B) Same as K dimension
C) Same as V dimension
D) Sum of all three

21. Cross-Attention