Introduction
The Transformer decoder is the decoder half of the Transformer architecture. It generates output sequences autoregressively (one token at a time), attending to the encoder's output and its own previously generated tokens. The decoder uses masked self-attention to prevent attending to future positions.
Architecture Overview
Input: Previously Generated Tokens (shifted right)
│
▼
┌───────────────────────────────┐
│ DECODER BLOCK │
│ ┌─────────────────────────┐ │
│ │ Masked Self-Attention │ │
│ │ (causal - no future) │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Add & Norm │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Cross-Attention │ │
│ │ (attend to encoder) │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Add & Norm │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Feed-Forward │ │
│ │ Network │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Add & Norm │ │
│ └─────────────────────────┘ │
└───────────────────────────────┘
│
▼
Linear + Softmax (vocab prediction)
Decoder Components
1. Masked Self-Attention (Causal)
Prevents attending to future positions by applying a causal mask:
For position i, can only attend to positions 0 to i
mask[i,j] = 0 if j ≤ i, else -∞
eᵢⱼ_masked = eᵢⱼ + mask[i,j]
mask[i,j] = 0 if j ≤ i, else -∞
eᵢⱼ_masked = eᵢⱼ + mask[i,j]
2. Cross-Attention
Attends to encoder output (keys and values from encoder):
Q_decoder = from decoder
K_encoder = from encoder
V_encoder = from encoder
This is how decoder accesses encoder information
K_encoder = from encoder
V_encoder = from encoder
This is how decoder accesses encoder information
3. Feed-Forward Network
Same as encoder FFN: position-wise transformation with expansion to d_ff.
Autoregressive Generation
The decoder generates one token at a time:
Step 1: Input = [START], output = "The"
Step 2: Input = [START, "The"], output = "cat"
Step 3: Input = [START, "The", "cat"], output = "sat"
...
Step 2: Input = [START, "The"], output = "cat"
Step 3: Input = [START, "The", "cat"], output = "sat"
...
Full Decoder Stack
N identical layers (6 in original Transformer)
Each layer has: - Masked self-attention - Cross-attention - Feed-forward network
Each layer has: - Masked self-attention - Cross-attention - Feed-forward network
Key Differences: Encoder vs Decoder
| Aspect | Encoder | Decoder |
|---|---|---|
| Self-Attention | Unmasked (bidirectional) | Masked (causal) |
| Cross-Attention | None | Yes (Q from decoder, K,V from encoder) |
| Input | Source sequence | Previously generated tokens |
| Output | Contextual embeddings | Vocabulary probabilities |
Used In
- GPT series: Decoder-only language models
- T5: Encoder-decoder for translation/summarization
- BART: Encoder-decoder for generation
- Original Transformer: NMT (translation)