Introduction
Multiplicative attention, introduced by Luong et al. (2015) in "Effective Approaches to Attention-based Neural Machine Translation", offers a more computationally efficient alternative to additive attention. It uses matrix multiplication (dot products) instead of feed-forward networks for computing alignment scores.
Three Scoring Functions
Luong et al. proposed three types of alignment scoring functions:
1. Dot Product (Simplest)
Direct dot product between decoder state and encoder state. Requires same dimensionality.
2. General (Multiplicative)
Uses a learnable weight matrix W to project states to compatible dimensions.
3. Concat (Additive-like)
Concatenates decoder state and encoder state, then applies feed-forward.
Complete Attention Computation
αₜ = softmax(eₜ)
cₜ = Σᵢ αₜᵢ · hᵢ
Output Modes
Luong et al. introduced two ways to use the context vector:
1. Dot Product Attention (Input Feeding)
The context vector is concatenated with the decoder hidden state:
2. Global vs Local Attention
Global attention: Attends to all source positions (like Bahdanau)
Local attention: Attends only to a subset of positions around the current step
Example: Comparing Dot Product vs General
If decoder state sₜ has dimension 512 and encoder state hᵢ has dimension 256:
- Dot product: Not possible (dimension mismatch)
- General: W ∈ ℝ^{512×256}, compute sₜᵀ·W·hᵢ ∈ ℝ
- Result: Single scalar alignment score
Comparison: Additive vs Multiplicative
| Aspect | Additive (Bahdanau) | Multiplicative (Luong) |
|---|---|---|
| Formula | vᵀ tanh(Ws + Uh) | sᵀ·W·h or sᵀ·h |
| Parameters | W, U, v | W only (or none) |
| Computation | Feed-forward + tanh | Matrix multiply |
| Speed | Slower | Faster |
| Memory | More | Less |
| Flexibility | More expressive | Less expressive |
Advantages of Multiplicative Attention
- Computational efficiency: Matrix multiplication is highly optimized on GPUs
- Fewer parameters: W matrix is the only learnable parameter
- Simpler gradient flow: Direct multiplication is easier to optimize
- Better parallelism: Easier to parallelize across batch and sequence dimensions
Disadvantages
- Less expressive: Cannot capture complex interactions as well as FFN
- Dimension matching: May need projection for dimension mismatch
- Sensitivity: Dot product scores can be large, requiring scaling
Impact on Modern Architectures
The general multiplicative attention (sᵀ·W·h) evolved into what we now call scaled dot-product attention in Transformers:
The scaling factor √d was introduced to prevent gradient vanishing in dot-product attention.