19. Transformer Decoder

Introduction

The Transformer decoder is the decoder half of the Transformer architecture. It generates output sequences autoregressively (one token at a time), attending to the encoder's output and its own previously generated tokens. The decoder uses masked self-attention to prevent attending to future positions.

Architecture Overview

Input: Previously Generated Tokens (shifted right) │ ▼ ┌───────────────────────────────┐ │ DECODER BLOCK │ │ ┌─────────────────────────┐ │ │ │ Masked Self-Attention │ │ │ │ (causal - no future) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Cross-Attention │ │ │ │ (attend to encoder) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Feed-Forward │ │ │ │ Network │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ └─────────────────────────┘ │ └───────────────────────────────┘ │ ▼ Linear + Softmax (vocab prediction)

Decoder Components

1. Masked Self-Attention (Causal)

Prevents attending to future positions by applying a causal mask:

For position i, can only attend to positions 0 to i

mask[i,j] = 0 if j ≤ i, else -∞

eᵢⱼ_masked = eᵢⱼ + mask[i,j]

2. Cross-Attention

Attends to encoder output (keys and values from encoder):

Q_decoder = from decoder
K_encoder = from encoder
V_encoder = from encoder

This is how decoder accesses encoder information

3. Feed-Forward Network

Same as encoder FFN: position-wise transformation with expansion to d_ff.

Autoregressive Generation

The decoder generates one token at a time:

Step 1: Input = [START], output = "The"
Step 2: Input = [START, "The"], output = "cat"
Step 3: Input = [START, "The", "cat"], output = "sat"
...

Full Decoder Stack

N identical layers (6 in original Transformer)

Each layer has: - Masked self-attention - Cross-attention - Feed-forward network

Key Differences: Encoder vs Decoder

AspectEncoderDecoder
Self-AttentionUnmasked (bidirectional)Masked (causal)
Cross-AttentionNoneYes (Q from decoder, K,V from encoder)
InputSource sequencePreviously generated tokens
OutputContextual embeddingsVocabulary probabilities

Used In

Test Your Understanding

Question 1: Why is decoder self-attention masked?

  • A) To hide position information
  • B) To prevent attending to future (yet-to-be-generated) tokens
  • C) To speed up computation
  • D) To reduce memory

Question 2: What is cross-attention in the decoder?

  • A) Attention within decoder
  • B) Q from decoder attending to K,V from encoder
  • C) Attention between decoder layers
  • D) Self-attention in encoder

Question 3: How does the decoder generate tokens?

  • A) All at once in parallel
  • B) One at a time autoregressively
  • C> Randomly
  • D) From encoder only

Question 4: What is the input to the decoder during generation?

  • A) Only the source sequence
  • B) Previously generated tokens
  • C) Random noise
  • D) Encoder embeddings only

Question 5: Which models use decoder-only architecture?

  • A) BERT
  • B) T5
  • C) GPT-2, GPT-3, GPT-4
  • D) ViT

Question 6: What does the causal mask do to attention scores for future positions?

  • A) Sets them to 0
  • B) Sets them to -∞
  • C) Doubles them
  • D) Leaves them unchanged

Question 7: How many sub-layers does each decoder layer have?

  • A) 1
  • B) 2
  • C) 3 (masked self, cross, FFN)
  • D) 4

Question 8: The decoder's final output is:

  • A> Embeddings for each position
  • B) Vocabulary probability distribution
  • C) Encoder representations
  • D) Attention weights