Introduction
Encoder-decoder attention is a specific attention mechanism used in sequence-to-sequence models where the queries come from the decoder (output) and the keys and values come from the encoder (input). This allows the decoder to focus on relevant parts of the source sequence when generating each output token.
Architecture Overview
The standard seq2seq architecture with attention consists of:
1. Encoder
The encoder processes the source sequence and produces a sequence of hidden states. Each position in the encoder has a corresponding hidden state that captures information about that position and its context.
2. Decoder
The decoder generates the target sequence one token at a time. At each step, it uses previously generated tokens to predict the next token.
3. Attention Mechanism
The attention mechanism connects the decoder to the encoder, allowing each decoder step to "look at" all encoder hidden states.
Mathematical Formulation
For a source sequence of length n and target generation at time step t:
Attention Weights: αₜⱼ = softmax(eₜⱼ) for j = 1 to n
Context Vector: cₜ = Σⱼ αₜⱼ · hⱼ
Decoder State: sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)
Output: yₜ = g(sₜ, yₜ₋₁, cₜ)
Where:
- sₜ₋₁: Previous decoder hidden state
- hⱼ: Encoder hidden state at position j
- a(·): Alignment function (scoring function)
- αₜⱼ: Attention weight showing how much position j contributes to position t
- cₜ: Context vector for decoder step t
Alignment Function Types
Additive Attention (Bahdanau)
The original attention mechanism uses a feed-forward network to compute alignment scores:
Where W, U, and v are learnable parameters.
Multiplicative Attention (Luong)
Uses dot products for computing alignment:
eₜⱼ = sₜ₋₁ · hⱼ [Dot product]
Attention in Different RNN Variants
1. Bahdanau (Bidirectional Encoder)
Uses a bidirectional encoder, so each source position has a hidden state capturing context from both directions. The decoder receives both the previous decoder state AND the context vector.
2. Luong (Unidirectional Encoder)
Luong et al. proposed different scoring functions and tested them with standard unidirectional encoders.
Example: Machine Translation
Source: "The cat sat on the mat" (English)
Target: "Le chat s'est assis sur le tapis" (French)
At step generating "chat" (cat in French):
- Decoder state represents the partial translation "Le "
- Attention scores show high scores for "cat" and "The"
- Context vector combines representations of "cat" and "The"
- This context helps generate "chat"
Encoder-Decoder vs Self-Attention
Key differences:
- Directionality: Encoder-decoder attention is cross-attention (asymmetric); self-attention is symmetric
- Source: Encoder-decoder connects two different sequences; self-attention connects positions within the same sequence
- Queries: In encoder-decoder, queries come from decoder; in self-attention, queries come from the same sequence
Modern Usage
While Transformers have replaced RNNs in most tasks, encoder-decoder attention remains a key component:
- T5 Model: Uses encoder-decoder transformer architecture
- BART: Sequence-to-sequence pretraining with encoder-decoder
- Vision-Language Models: Cross-attention from language decoder to visual encoder
- Speech Recognition: Attention between acoustic features and transcript