03. Additive (Bahdanau) Attention

Introduction

Additive attention, also known as Bahdanau attention, was introduced in the seminal paper "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2014). It was the first attention mechanism to enable dynamic alignment in neural machine translation.

Key Innovation

Before Bahdanau attention, seq2seq models relied on a fixed context vector. Bahdanau's key innovation was allowing the model to learn which source positions to focus on for each target word, effectively creating soft alignment between source and target sequences.

Architecture

The Bahdanau attention mechanism uses a feed-forward neural network to compute alignment scores. This is in contrast to later multiplicative attention which uses dot products.

Alignment Score: eₜᵢ = vᵀ tanh(W·sₜ₋₁ + U·hᵢ)

Attention Weights: αₜᵢ = softmax(eₜᵢ)

Context Vector: cₜ = Σᵢ αₜᵢ · hᵢ

Decoder Hidden: sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)

Where:

Step-by-Step Computation

Step 1: Compute Alignment Scores

For each source position i, compute how well it aligns with the current decoder state:

eₜᵢ = vᵀ tanh(W·sₜ₋₁ + U·hᵢ)

Step 2: Compute Attention Weights

Apply softmax to normalize alignment scores:

αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ)

Step 3: Compute Context Vector

Weighted sum of encoder hidden states:

cₜ = Σᵢ αₜᵢ · hᵢ

Step 4: Update Decoder State

Combine previous state, input, and context:

sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)

Why Use tanh?

The tanh activation is crucial for several reasons:

Example: French to English Translation

Source (French): "Le chat noir mange la souris"

Target (English): "The black cat eats the mouse"

When generating "black" at target position 3:

  • Decoder state s₂ contains information about "The"
  • Alignment scores e₃ᵢ computed for each source position
  • Position with "noir" would have high α₃ᵢ
  • Context vector c₃ emphasizes representation of "noir"
  • Decoder uses this to generate "black"

Comparison with Other Attention Types

FeatureAdditive (Bahdanau)Multiplicative (Luong)
Scoring methodFeed-forward networkMatrix multiplication (dot product)
ParametersW, U, v (more)W only (fewer)
Computational costO(n·d)O(n·d) but simpler
Introduced20142015
Used inOriginal NMT systemsGNMT, standard translation

Advantages of Additive Attention

Limitations

Modern Relevance

While Transformers primarily use scaled dot-product attention, the Bahdanau attention concept remains fundamental. The query-key-value framework and the idea of learned alignment patterns stem directly from this work.

Test Your Understanding

Question 1: Who introduced additive attention?

  • A) Vaswani et al.
  • B) Bahdanau et al.
  • C) Luong et al.
  • D) Devlin et al.

Question 2: In the alignment score formula eₜᵢ = vᵀ tanh(W·sₜ₋₁ + U·hᵢ), what does tanh do?

  • A) Normalizes the output to sum to 1
  • B) Adds non-linearity and bounds output to [-1, 1]
  • C) Computes the dot product
  • D) Applies dropout

Question 3: Which weight matrices are used in additive attention?

  • A) Only W
  • B) W, U, and v
  • C) Q, K, V only
  • D) No weight matrices

Question 4: What is the purpose of softmax in attention?

  • A) To bound alignment scores
  • B) To normalize weights to sum to 1
  • C) To add non-linearity
  • D) To compute the dot product

Question 5: How does additive attention differ from multiplicative attention?

  • A) Uses dot product instead of feed-forward
  • B) Uses feed-forward network instead of dot product
  • C) Is non-differentiable
  • D) Cannot be used in translation

Question 6: In the context vector cₜ = Σᵢ αₜᵢ · hᵢ, what does αₜᵢ represent?

  • A) Alignment score before softmax
  • B) Attention weight (probability) for position i
  • C) Encoder hidden state
  • D) Decoder hidden state

Question 7: Why is the output of tanh bounded to [-1, 1]?

  • A) To speed up computation
  • B) To keep alignment scores stable and prevent explosion
  • C) To make it differentiable
  • D) To match the scale of softmax

Question 8: What year was Bahdanau attention introduced?

  • A) 2015
  • B) 2016
  • C) 2014
  • D) 2017