19 - Transformer Decoder | Mango Encyclopedia

Introduction

The Transformer decoder is the decoder half of the Transformer architecture. It generates output sequences autoregressively (one token at a time), attending to the encoder's output and its own previously generated tokens. The decoder uses masked self-attention to prevent attending to future positions.

Architecture Overview

Input: Previously Generated Tokens (shifted right) │ ▼ ┌───────────────────────────────┐ │ DECODER BLOCK │ │ ┌─────────────────────────┐ │ │ │ Masked Self-Attention │ │ │ │ (causal - no future) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Cross-Attention │ │ │ │ (attend to encoder) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Feed-Forward │ │ │ │ Network │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ └─────────────────────────┘ │ └───────────────────────────────┘ │ ▼ Linear + Softmax (vocab prediction)

Decoder Components

1. Masked Self-Attention (Causal)

Prevents attending to future positions by applying a causal mask:

For position i, can only attend to positions 0 to i

mask[i,j] = 0 if j ≤ i, else -∞

eᵢⱼ_masked = eᵢⱼ + mask[i,j]

2. Cross-Attention

Attends to encoder output (keys and values from encoder):

Q_decoder = from decoder
K_encoder = from encoder
V_encoder = from encoder

This is how decoder accesses encoder information

3. Feed-Forward Network

Same as encoder FFN: position-wise transformation with expansion to d_ff.

Autoregressive Generation

The decoder generates one token at a time:

Step 1: Input = [START], output = "The"
Step 2: Input = [START, "The"], output = "cat"
Step 3: Input = [START, "The", "cat"], output = "sat"
...

Full Decoder Stack

N identical layers (6 in original Transformer)

Each layer has: - Masked self-attention - Cross-attention - Feed-forward network

Key Differences: Encoder vs Decoder

Aspect	Encoder	Decoder
Self-Attention	Unmasked (bidirectional)	Masked (causal)
Cross-Attention	None	Yes (Q from decoder, K,V from encoder)
Input	Source sequence	Previously generated tokens
Output	Contextual embeddings	Vocabulary probabilities

Used In

GPT series: Decoder-only language models
T5: Encoder-decoder for translation/summarization
BART: Encoder-decoder for generation
Original Transformer: NMT (translation)

Test Your Understanding

Question 1: Why is decoder self-attention masked?

A) To hide position information
B) To prevent attending to future (yet-to-be-generated) tokens
C) To speed up computation
D) To reduce memory

Question 2: What is cross-attention in the decoder?

A) Attention within decoder
B) Q from decoder attending to K,V from encoder
C) Attention between decoder layers
D) Self-attention in encoder

Question 3: How does the decoder generate tokens?

A) All at once in parallel
B) One at a time autoregressively
C> Randomly
D) From encoder only

Question 4: What is the input to the decoder during generation?

A) Only the source sequence
B) Previously generated tokens
C) Random noise
D) Encoder embeddings only

Question 5: Which models use decoder-only architecture?

A) BERT
B) T5
C) GPT-2, GPT-3, GPT-4
D) ViT

Question 6: What does the causal mask do to attention scores for future positions?

A) Sets them to 0
B) Sets them to -∞
C) Doubles them
D) Leaves them unchanged

Question 7: How many sub-layers does each decoder layer have?

A) 1
B) 2
C) 3 (masked self, cross, FFN)
D) 4

Question 8: The decoder's final output is:

A> Embeddings for each position
B) Vocabulary probability distribution
C) Encoder representations
D) Attention weights

19. Transformer Decoder