03 - Additive (Bahdanau) Attention

Introduction

Additive attention, also known as Bahdanau attention, was introduced in the seminal paper "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2014). It was the first attention mechanism to enable dynamic alignment in neural machine translation.

Key Innovation

Before Bahdanau attention, seq2seq models relied on a fixed context vector. Bahdanau's key innovation was allowing the model to learn which source positions to focus on for each target word, effectively creating soft alignment between source and target sequences.

Architecture

The Bahdanau attention mechanism uses a feed-forward neural network to compute alignment scores. This is in contrast to later multiplicative attention which uses dot products.

Alignment Score: eₜᵢ = vᵀ tanh(W·sₜ₋₁ + U·hᵢ)

Attention Weights: αₜᵢ = softmax(eₜᵢ)

Context Vector: cₜ = Σᵢ αₜᵢ · hᵢ

Decoder Hidden: sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)

Where:

sₜ₋₁: Previous decoder hidden state (query)
hᵢ: Encoder hidden state at position i (key/value)
W, U: Weight matrices for transforming query and key
v: Weight vector for scoring
tanh: Hyperbolic tangent activation (keeps values bounded)

Step-by-Step Computation

Step 1: Compute Alignment Scores

For each source position i, compute how well it aligns with the current decoder state:

eₜᵢ = vᵀ tanh(W·sₜ₋₁ + U·hᵢ)

Step 2: Compute Attention Weights

Apply softmax to normalize alignment scores:

αₜᵢ = exp(eₜᵢ) / Σⱼ exp(eₜⱼ)

Step 3: Compute Context Vector

Weighted sum of encoder hidden states:

cₜ = Σᵢ αₜᵢ · hᵢ

Step 4: Update Decoder State

Combine previous state, input, and context:

sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)

Why Use tanh?

The tanh activation is crucial for several reasons:

Bounded output: tanh outputs values in [-1, 1], keeping alignment scores stable
Non-linearity: Allows learning complex alignment patterns
Zero-centered: Helps gradient flow during training

Example: French to English Translation

Source (French): "Le chat noir mange la souris"

Target (English): "The black cat eats the mouse"

When generating "black" at target position 3:

Decoder state s₂ contains information about "The"
Alignment scores e₃ᵢ computed for each source position
Position with "noir" would have high α₃ᵢ
Context vector c₃ emphasizes representation of "noir"
Decoder uses this to generate "black"

Comparison with Other Attention Types

Feature	Additive (Bahdanau)	Multiplicative (Luong)
Scoring method	Feed-forward network	Matrix multiplication (dot product)
Parameters	W, U, v (more)	W only (fewer)
Computational cost	O(n·d)	O(n·d) but simpler
Introduced	2014	2015
Used in	Original NMT systems	GNMT, standard translation

Advantages of Additive Attention

Flexibility: Feed-forward network can learn complex alignment patterns
Interpretability: Attention weights show learned alignments
Differentiable: End-to-end trainable with backpropagation
Handles variable lengths: Naturally processes sequences of different sizes

Limitations

More parameters: Requires W, U, and v matrices
Slower computation: FFN is slower than dot product
Fixed alignment: Cannot handle cross-head patterns like in multi-head attention

Modern Relevance

While Transformers primarily use scaled dot-product attention, the Bahdanau attention concept remains fundamental. The query-key-value framework and the idea of learned alignment patterns stem directly from this work.

Test Your Understanding

Question 1: Who introduced additive attention?

A) Vaswani et al.
B) Bahdanau et al.
C) Luong et al.
D) Devlin et al.

Question 2: In the alignment score formula eₜᵢ = vᵀ tanh(W·sₜ₋₁ + U·hᵢ), what does tanh do?

A) Normalizes the output to sum to 1
B) Adds non-linearity and bounds output to [-1, 1]
C) Computes the dot product
D) Applies dropout

Question 3: Which weight matrices are used in additive attention?

A) Only W
B) W, U, and v
C) Q, K, V only
D) No weight matrices

Question 4: What is the purpose of softmax in attention?

A) To bound alignment scores
B) To normalize weights to sum to 1
C) To add non-linearity
D) To compute the dot product

Question 5: How does additive attention differ from multiplicative attention?

A) Uses dot product instead of feed-forward
B) Uses feed-forward network instead of dot product
C) Is non-differentiable
D) Cannot be used in translation

Question 6: In the context vector cₜ = Σᵢ αₜᵢ · hᵢ, what does αₜᵢ represent?

A) Alignment score before softmax
B) Attention weight (probability) for position i
C) Encoder hidden state
D) Decoder hidden state

Question 7: Why is the output of tanh bounded to [-1, 1]?

A) To speed up computation
B) To keep alignment scores stable and prevent explosion
C) To make it differentiable
D) To match the scale of softmax

Question 8: What year was Bahdanau attention introduced?

A) 2015
B) 2016
C) 2014
D) 2017

03. Additive (Bahdanau) Attention