Introduction
Additive attention, also known as Bahdanau attention, was introduced in the seminal paper "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2014). It was the first attention mechanism to enable dynamic alignment in neural machine translation.
Key Innovation
Before Bahdanau attention, seq2seq models relied on a fixed context vector. Bahdanau's key innovation was allowing the model to learn which source positions to focus on for each target word, effectively creating soft alignment between source and target sequences.
Architecture
The Bahdanau attention mechanism uses a feed-forward neural network to compute alignment scores. This is in contrast to later multiplicative attention which uses dot products.
Attention Weights: αₜᵢ = softmax(eₜᵢ)
Context Vector: cₜ = Σᵢ αₜᵢ · hᵢ
Decoder Hidden: sₜ = f(sₜ₋₁, yₜ₋₁, cₜ)
Where:
- sₜ₋₁: Previous decoder hidden state (query)
- hᵢ: Encoder hidden state at position i (key/value)
- W, U: Weight matrices for transforming query and key
- v: Weight vector for scoring
- tanh: Hyperbolic tangent activation (keeps values bounded)
Step-by-Step Computation
Step 1: Compute Alignment Scores
For each source position i, compute how well it aligns with the current decoder state:
Step 2: Compute Attention Weights
Apply softmax to normalize alignment scores:
Step 3: Compute Context Vector
Weighted sum of encoder hidden states:
Step 4: Update Decoder State
Combine previous state, input, and context:
Why Use tanh?
The tanh activation is crucial for several reasons:
- Bounded output: tanh outputs values in [-1, 1], keeping alignment scores stable
- Non-linearity: Allows learning complex alignment patterns
- Zero-centered: Helps gradient flow during training
Example: French to English Translation
Source (French): "Le chat noir mange la souris"
Target (English): "The black cat eats the mouse"
When generating "black" at target position 3:
- Decoder state s₂ contains information about "The"
- Alignment scores e₃ᵢ computed for each source position
- Position with "noir" would have high α₃ᵢ
- Context vector c₃ emphasizes representation of "noir"
- Decoder uses this to generate "black"
Comparison with Other Attention Types
| Feature | Additive (Bahdanau) | Multiplicative (Luong) |
|---|---|---|
| Scoring method | Feed-forward network | Matrix multiplication (dot product) |
| Parameters | W, U, v (more) | W only (fewer) |
| Computational cost | O(n·d) | O(n·d) but simpler |
| Introduced | 2014 | 2015 |
| Used in | Original NMT systems | GNMT, standard translation |
Advantages of Additive Attention
- Flexibility: Feed-forward network can learn complex alignment patterns
- Interpretability: Attention weights show learned alignments
- Differentiable: End-to-end trainable with backpropagation
- Handles variable lengths: Naturally processes sequences of different sizes
Limitations
- More parameters: Requires W, U, and v matrices
- Slower computation: FFN is slower than dot product
- Fixed alignment: Cannot handle cross-head patterns like in multi-head attention
Modern Relevance
While Transformers primarily use scaled dot-product attention, the Bahdanau attention concept remains fundamental. The query-key-value framework and the idea of learned alignment patterns stem directly from this work.