04 - Multiplicative (Luong) Attention

Introduction

Multiplicative attention, introduced by Luong et al. (2015) in "Effective Approaches to Attention-based Neural Machine Translation", offers a more computationally efficient alternative to additive attention. It uses matrix multiplication (dot products) instead of feed-forward networks for computing alignment scores.

Three Scoring Functions

Luong et al. proposed three types of alignment scoring functions:

1. Dot Product (Simplest)

eₜᵢ = sₜᵀ · hᵢ

Direct dot product between decoder state and encoder state. Requires same dimensionality.

2. General (Multiplicative)

eₜᵢ = sₜᵀ · W · hᵢ

Uses a learnable weight matrix W to project states to compatible dimensions.

3. Concat (Additive-like)

eₜᵢ = vᵀ · tanh(W[sₜ; hᵢ])

Concatenates decoder state and encoder state, then applies feed-forward.

Complete Attention Computation

eₜ = [sₜᵀ · h₁, sₜᵀ · h₂, ..., sₜᵀ · hₙ]

αₜ = softmax(eₜ)

cₜ = Σᵢ αₜᵢ · hᵢ

Output Modes

Luong et al. introduced two ways to use the context vector:

1. Dot Product Attention (Input Feeding)

The context vector is concatenated with the decoder hidden state:

ĥₜ = tanh(W[cₜ; sₜ])

2. Global vs Local Attention

Global attention: Attends to all source positions (like Bahdanau)

Local attention: Attends only to a subset of positions around the current step

Example: Comparing Dot Product vs General

If decoder state sₜ has dimension 512 and encoder state hᵢ has dimension 256:

Dot product: Not possible (dimension mismatch)
General: W ∈ ℝ^{512×256}, compute sₜᵀ·W·hᵢ ∈ ℝ
Result: Single scalar alignment score

Comparison: Additive vs Multiplicative

Aspect	Additive (Bahdanau)	Multiplicative (Luong)
Formula	vᵀ tanh(Ws + Uh)	sᵀ·W·h or sᵀ·h
Parameters	W, U, v	W only (or none)
Computation	Feed-forward + tanh	Matrix multiply
Speed	Slower	Faster
Memory	More	Less
Flexibility	More expressive	Less expressive

Advantages of Multiplicative Attention

Computational efficiency: Matrix multiplication is highly optimized on GPUs
Fewer parameters: W matrix is the only learnable parameter
Simpler gradient flow: Direct multiplication is easier to optimize
Better parallelism: Easier to parallelize across batch and sequence dimensions

Disadvantages

Less expressive: Cannot capture complex interactions as well as FFN
Dimension matching: May need projection for dimension mismatch
Sensitivity: Dot product scores can be large, requiring scaling

Impact on Modern Architectures

The general multiplicative attention (sᵀ·W·h) evolved into what we now call scaled dot-product attention in Transformers:

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

The scaling factor √d was introduced to prevent gradient vanishing in dot-product attention.

Test Your Understanding

Question 1: How many scoring functions did Luong et al. propose?

A) 1
B) 2
C) 3
D) 4

Question 2: What is the simplest multiplicative scoring function?

A) General
B) Concat
C) Dot product
D) Additive

Question 3: In eₜᵢ = sₜᵀ · W · hᵢ, what are the dimensions if sₜ ∈ ℝ^{512} and hᵢ ∈ ℝ^{256}?

A) W ∈ ℝ^{512×256}
B) W ∈ ℝ^{256×512}
C) W ∈ ℝ^{512×512}
D) W ∈ ℝ^{256×256}

Question 4: What is the advantage of multiplicative over additive attention?

A) More parameters
B) Better expressiveness
C) Computational efficiency
D) More expressive FFN

Question 5: Why was the scaling factor √d introduced in dot-product attention?

A) To speed up computation
B) To prevent gradient vanishing from large dot products
C) To match softmax requirements
D) To reduce memory usage

Question 6: What does local attention attend to?

A) All source positions
B) Only positions around current step
C) Only the first position
D) Random positions

Question 7: Which scoring function is used in modern Transformers?

A) Additive
B) Concat
C) Scaled dot-product
D) General with tanh

Question 8: In the concat scoring function, what does [sₜ; hᵢ] represent?

A) Element-wise product
B) Concatenation along feature dimension
C) Dot product
D) Matrix multiplication

04. Multiplicative (Luong) Attention