02. Encoder-Decoder Attention

Introduction

Encoder-decoder attention is a specific attention mechanism used in sequence-to-sequence models where the queries come from the decoder (output) and the keys and values come from the encoder (input). This allows the decoder to focus on relevant parts of the source sequence when generating each output token.

Architecture Overview

The standard seq2seq architecture with attention consists of:

1. Encoder

The encoder processes the source sequence and produces a sequence of hidden states. Each position in the encoder has a corresponding hidden state that captures information about that position and its context.

2. Decoder

The decoder generates the target sequence one token at a time. At each step, it uses previously generated tokens to predict the next token.

3. Attention Mechanism

The attention mechanism connects the decoder to the encoder, allowing each decoder step to "look at" all encoder hidden states.

┌─────────────────────────────────────────────────────────────┐ │ ENCODER │ │ [Encoder Hidden States: h₁, h₂, h₃, ..., hₙ] │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ ATTENTION │ │ Query: Decoder state Keys: Encoder states │ │ Output: Context vector weighted by attention scores │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ DECODER │ │ [Generates output one token at a time] │ └─────────────────────────────────────────────────────────────┘

Mathematical Formulation

For a source sequence of length n and target generation at time step t:

Attention Score: eₜⱼ = a(sₜ₋₁, hⱼ)

Attention Weights: αₜⱼ = softmax(eₜⱼ) for j = 1 to n

Context Vector: cₜ = Σⱼ αₜⱼ · hⱼ

Decoder State: sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)

Output: yₜ = g(sₜ, yₜ₋₁, cₜ)

Where:

Alignment Function Types

Additive Attention (Bahdanau)

The original attention mechanism uses a feed-forward network to compute alignment scores:

eₜⱼ = vᵀ tanh(W·sₜ₋₁ + U·hⱼ)

Where W, U, and v are learnable parameters.

Multiplicative Attention (Luong)

Uses dot products for computing alignment:

eₜⱼ = (W·sₜ₋₁) · hⱼ [General]

eₜⱼ = sₜ₋₁ · hⱼ [Dot product]

Attention in Different RNN Variants

1. Bahdanau (Bidirectional Encoder)

Uses a bidirectional encoder, so each source position has a hidden state capturing context from both directions. The decoder receives both the previous decoder state AND the context vector.

2. Luong (Unidirectional Encoder)

Luong et al. proposed different scoring functions and tested them with standard unidirectional encoders.

Example: Machine Translation

Source: "The cat sat on the mat" (English)

Target: "Le chat s'est assis sur le tapis" (French)

At step generating "chat" (cat in French):

  • Decoder state represents the partial translation "Le "
  • Attention scores show high scores for "cat" and "The"
  • Context vector combines representations of "cat" and "The"
  • This context helps generate "chat"

Encoder-Decoder vs Self-Attention

Key differences:

Modern Usage

While Transformers have replaced RNNs in most tasks, encoder-decoder attention remains a key component:

Test Your Understanding

Question 1: In encoder-decoder attention, where do queries originate?

  • A) Encoder hidden states
  • B) Decoder hidden states
  • C) Source sequence embeddings
  • D) Target sequence embeddings

Question 2: What does the context vector cₜ represent?

  • A) The decoder hidden state at time t
  • B) A weighted sum of all encoder hidden states
  • C) The target token at position t
  • D) The alignment scores

Question 3: In the formula αₜⱼ = softmax(eₜⱼ), what does j index over?

  • A) Decoder positions
  • B) Encoder positions (1 to n)
  • C) Attention heads
  • D) Layers

Question 4: Which alignment function was used in the original Bahdanau attention?

  • A) Dot product
  • B) General (multiplicative)
  • C) Additive (feed-forward network)
  • D) Cosine similarity

Question 5: What is the key difference between encoder-decoder attention and self-attention?

  • A) Encoder-decoder is symmetric; self-attention is asymmetric
  • B) Encoder-decoder connects two different sequences; self-attention connects positions within the same sequence
  • C) Self-attention uses more parameters
  • D) Encoder-decoder is faster

Question 6: In the translation example, why might "The" have high attention when generating "chat"?

  • A) Because "The" is the most common word
  • B) Because "The" modifies "cat" in the source sentence
  • C) Because of position bias
  • D) Because of residual connections