36 - Kernelized Attention | Mango Encyclopedia

Introduction

Kernelized attention reformulates the attention mechanism using kernel methods to achieve linear time and space complexity. Instead of computing exp(QKᵀ/√d), it uses kernel feature maps to approximate the softmax, enabling efficient computation through the "kernel trick."

The Kernel Trick

Standard softmax:
softmax(QKᵀ)[i,j] = exp(qᵢ·kⱼ/√d) / Σₖ exp(qᵢ·kₖ/√d)

Kernelized (approximate):
kernel(q,k) ≈ φ(q)·φ(k)ᵀ

Feature Map Approximations

1. Random Feature Maps (Performer)

φ(x) = [cos(x·w₁+b), sin(x·w₁+b), cos(x·w₂+b), sin(x·w₂+b), ...]

Where wᵢ are random vectors from distribution

2. Polynomial Kernels

K(q,k) = (q·k/√d + c)^d

Approximates softmax for certain d values

3. ELU Feature Map

φ(x) = ELU(x) + 1

Used in Linear Transformers
Simpler but less accurate approximation

Kernelized Attention Formula

Attention(Q,K,V) ≈ (φ(Q)φ(K)ᵀ)V / (φ(Q)φ(K)ᵀ)1

Computed as:
S = φ(Q) @ (φ(K)ᵀV) [O(n·d²)]
Normalize by S / (φ(Q)φ(K)ᵀ)1 [O(n)]

Properties

Aspect	Standard	Kernelized
Time	O(n²·d)	O(n·d²) or O(n)
Space	O(n²)	O(n) or O(d²)
Approximation	Exact	Approximately equal

Error Analysis

Kernel approximation introduces error that can accumulate:

Small errors: Individual attention weights may differ
But: Overall output often similar enough
Importance: Some attention patterns more affected than others

Test Your Understanding

Question 1: Kernelized attention uses the kernel trick to:

A) Make attention more accurate
B) Reduce complexity from O(n²) to O(n) or O(n·d²)
C) Increase memory
D) Replace softmax

Question 2: What does φ(x) represent in kernelized attention?

A) Raw input
B) Feature map of input
C> Loss function
D) Attention weight

Question 3: Performer uses which type of feature map?

A) Fixed deterministic
B> Random feature maps
C) Learned feature maps
D) No feature map

Question 4: ELU feature map is used in:

A) Performer
B) Linear Transformers
C) BigBird
D) Longformer

Question 5: The kernel trick replaces:

A) exp(q·k) with φ(q)·φ(k)ᵀ
B) Q with K
C) V with Q
D) softmax with sum

Question 6: Polynomial kernel K(q,k) = (q·k/√d + c)^d approximates:

A) ReLU attention
B) Softmax attention
C) Linear attention
D) No approximation

Question 7: Kernelized attention is approximately equal to standard, not exactly equal, because:

A) Too fast
B) Kernel approximation introduces error
C) Uses different activation
D) Has fewer parameters

Question 8: In kernelized attention, complexity is measured in terms of:

A) n² only
B) n·d (sequence × dimension)
C) n·d² or O(n) depending on kernel
D) d² only

36. Kernelized Attention