04. Multiplicative (Luong) Attention

Introduction

Multiplicative attention, introduced by Luong et al. (2015) in "Effective Approaches to Attention-based Neural Machine Translation", offers a more computationally efficient alternative to additive attention. It uses matrix multiplication (dot products) instead of feed-forward networks for computing alignment scores.

Three Scoring Functions

Luong et al. proposed three types of alignment scoring functions:

1. Dot Product (Simplest)

eₜᵢ = sₜᵀ · hᵢ

Direct dot product between decoder state and encoder state. Requires same dimensionality.

2. General (Multiplicative)

eₜᵢ = sₜᵀ · W · hᵢ

Uses a learnable weight matrix W to project states to compatible dimensions.

3. Concat (Additive-like)

eₜᵢ = vᵀ · tanh(W[sₜ; hᵢ])

Concatenates decoder state and encoder state, then applies feed-forward.

Complete Attention Computation

eₜ = [sₜᵀ · h₁, sₜᵀ · h₂, ..., sₜᵀ · hₙ]

αₜ = softmax(eₜ)

cₜ = Σᵢ αₜᵢ · hᵢ

Output Modes

Luong et al. introduced two ways to use the context vector:

1. Dot Product Attention (Input Feeding)

The context vector is concatenated with the decoder hidden state:

ĥₜ = tanh(W[cₜ; sₜ])

2. Global vs Local Attention

Global attention: Attends to all source positions (like Bahdanau)

Local attention: Attends only to a subset of positions around the current step

Example: Comparing Dot Product vs General

If decoder state sₜ has dimension 512 and encoder state hᵢ has dimension 256:

  • Dot product: Not possible (dimension mismatch)
  • General: W ∈ ℝ^{512×256}, compute sₜᵀ·W·hᵢ ∈ ℝ
  • Result: Single scalar alignment score

Comparison: Additive vs Multiplicative

AspectAdditive (Bahdanau)Multiplicative (Luong)
Formulavᵀ tanh(Ws + Uh)sᵀ·W·h or sᵀ·h
ParametersW, U, vW only (or none)
ComputationFeed-forward + tanhMatrix multiply
SpeedSlowerFaster
MemoryMoreLess
FlexibilityMore expressiveLess expressive

Advantages of Multiplicative Attention

Disadvantages

Impact on Modern Architectures

The general multiplicative attention (sᵀ·W·h) evolved into what we now call scaled dot-product attention in Transformers:

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

The scaling factor √d was introduced to prevent gradient vanishing in dot-product attention.

Test Your Understanding

Question 1: How many scoring functions did Luong et al. propose?

  • A) 1
  • B) 2
  • C) 3
  • D) 4

Question 2: What is the simplest multiplicative scoring function?

  • A) General
  • B) Concat
  • C) Dot product
  • D) Additive

Question 3: In eₜᵢ = sₜᵀ · W · hᵢ, what are the dimensions if sₜ ∈ ℝ^{512} and hᵢ ∈ ℝ^{256}?

  • A) W ∈ ℝ^{512×256}
  • B) W ∈ ℝ^{256×512}
  • C) W ∈ ℝ^{512×512}
  • D) W ∈ ℝ^{256×256}

Question 4: What is the advantage of multiplicative over additive attention?

  • A) More parameters
  • B) Better expressiveness
  • C) Computational efficiency
  • D) More expressive FFN

Question 5: Why was the scaling factor √d introduced in dot-product attention?

  • A) To speed up computation
  • B) To prevent gradient vanishing from large dot products
  • C) To match softmax requirements
  • D) To reduce memory usage

Question 6: What does local attention attend to?

  • A) All source positions
  • B) Only positions around current step
  • C) Only the first position
  • D) Random positions

Question 7: Which scoring function is used in modern Transformers?

  • A) Additive
  • B) Concat
  • C) Scaled dot-product
  • D) General with tanh

Question 8: In the concat scoring function, what does [sₜ; hᵢ] represent?

  • A) Element-wise product
  • B) Concatenation along feature dimension
  • C) Dot product
  • D) Matrix multiplication