21. Cross-Attention

Introduction

Cross-attention is an attention mechanism where queries come from one sequence and keys/values come from a different sequence. It enables the decoder to access information from the encoder in encoder-decoder architectures, and is fundamental to multimodal models that combine different data types.

Core Concept

┌─────────────────┐ ┌─────────────────┐ │ Sequence A │ │ Sequence B │ │ (Query source) │ │ (Key/Value) │ └────────┬────────┘ └────────┬────────┘ │ │ │ Q K, V │ │ ▼ ▼ ┌─────────────────────────────────┐ │ CROSS-ATTENTION │ │ Attention(Q_from_A, K_from_B, │ │ V_from_B) │ └─────────────────────────────────┘ │ ▼ Output: Sequence A's positions attending to information from Sequence B

Mathematical Formulation

Q = X_decoder · W_Q
K = X_encoder · W_K
V = X_encoder · W_V

CrossAttention(Q, K, V) = softmax(QKᵀ/√d) · V

Cross-Attention vs Self-Attention

AspectSelf-AttentionCross-Attention
Q sourceSame sequenceSequence A (decoder)
K, V sourceSame sequenceSequence B (encoder)
CapturesWithin-sequence relationshipsCross-sequence relationships
Symmetric?Yes (Q,K,V from same)No (asymmetric)

Use Cases

1. Machine Translation (Encoder-Decoder)

Decoder queries attend to encoder keys/values (source language):

French encoder output → keys/values
English decoder query → attention output
Result: English token attending to relevant French positions

2. Vision-Language (e.g., CLIP)

Image encoder → keys/values
Text encoder → queries
Result: Text tokens attending to relevant image regions

3. Document Question Answering

Document encoder → keys/values
Question encoder → queries
Result: Question answering using document context

Key Properties

In Transformer Decoder

Each decoder layer has cross-attention after masked self-attention:

Layer = MaskedSelfAttention(Q_decoder) → Add & Norm → CrossAttention(Q_norm, K_encoder, V_encoder) → Add & Norm → FFN → Add & Norm

Multimodal Applications

Cross-attention enables models to connect different modalities:

Test Your Understanding

Question 1: In cross-attention, where do queries come from?

  • A) Same sequence as K and V
  • B) A different sequence than K and V
  • C) Random initialization
  • D) Only from encoder

Question 2: What is the key difference between self-attention and cross-attention?

  • A) Cross-attention uses more parameters
  • B) Cross-attention connects two different sequences
  • C) Self-attention is faster
  • D) Cross-attention is not used in Transformers

Question 3: In machine translation, where is cross-attention used?

  • A) Encoder to attend to decoder
  • B) Decoder to attend to encoder source
  • C) Within the encoder only
  • D> Within the decoder only

Question 4: In CLIP (image-text model), what uses cross-attention?

  • A) Image encoder attending to text
  • B) Text encoder attending to image
  • C) Both directions use cross-attention
  • D) Neither uses cross-attention

Question 5: Is cross-attention typically symmetric?

  • A) Yes, always symmetric
  • B) No, Q and K,V come from different sources
  • C) Only in vision models
  • D) Only in text models

Question 6: In the Transformer decoder, where does cross-attention sit?

  • A> After masked self-attention (within each layer)
  • B) Before masked self-attention
  • C) Only in encoder
  • D) At the very end

Question 7: Which modality combinations use cross-attention?

  • A) Text-to-text only
  • B) Image-to-image only
  • C) Any modality combination (text, image, audio, video)
  • D) Only text and image

Question 8: Cross-attention output dimension is typically:

  • A) Same as Q dimension
  • B) Same as K dimension
  • C) Same as V dimension
  • D) Sum of all three