36. Kernelized Attention

Introduction

Kernelized attention reformulates the attention mechanism using kernel methods to achieve linear time and space complexity. Instead of computing exp(QKᵀ/√d), it uses kernel feature maps to approximate the softmax, enabling efficient computation through the "kernel trick."

The Kernel Trick

Standard softmax:
softmax(QKᵀ)[i,j] = exp(qᵢ·kⱼ/√d) / Σₖ exp(qᵢ·kₖ/√d)

Kernelized (approximate):
kernel(q,k) ≈ φ(q)·φ(k)ᵀ

Feature Map Approximations

1. Random Feature Maps (Performer)

φ(x) = [cos(x·w₁+b), sin(x·w₁+b), cos(x·w₂+b), sin(x·w₂+b), ...]

Where wᵢ are random vectors from distribution

2. Polynomial Kernels

K(q,k) = (q·k/√d + c)^d

Approximates softmax for certain d values

3. ELU Feature Map

φ(x) = ELU(x) + 1

Used in Linear Transformers
Simpler but less accurate approximation

Kernelized Attention Formula

Attention(Q,K,V) ≈ (φ(Q)φ(K)ᵀ)V / (φ(Q)φ(K)ᵀ)1

Computed as:
S = φ(Q) @ (φ(K)ᵀV) [O(n·d²)]
Normalize by S / (φ(Q)φ(K)ᵀ)1 [O(n)]

Properties

AspectStandardKernelized
TimeO(n²·d)O(n·d²) or O(n)
SpaceO(n²)O(n) or O(d²)
ApproximationExactApproximately equal

Error Analysis

Kernel approximation introduces error that can accumulate:

Test Your Understanding

Question 1: Kernelized attention uses the kernel trick to:

  • A) Make attention more accurate
  • B) Reduce complexity from O(n²) to O(n) or O(n·d²)
  • C) Increase memory
  • D) Replace softmax

Question 2: What does φ(x) represent in kernelized attention?

  • A) Raw input
  • B) Feature map of input
  • C> Loss function
  • D) Attention weight

Question 3: Performer uses which type of feature map?

  • A) Fixed deterministic
  • B> Random feature maps
  • C) Learned feature maps
  • D) No feature map

Question 4: ELU feature map is used in:

  • A) Performer
  • B) Linear Transformers
  • C) BigBird
  • D) Longformer

Question 5: The kernel trick replaces:

  • A) exp(q·k) with φ(q)·φ(k)ᵀ
  • B) Q with K
  • C) V with Q
  • D) softmax with sum

Question 6: Polynomial kernel K(q,k) = (q·k/√d + c)^d approximates:

  • A) ReLU attention
  • B) Softmax attention
  • C) Linear attention
  • D) No approximation

Question 7: Kernelized attention is approximately equal to standard, not exactly equal, because:

  • A) Too fast
  • B) Kernel approximation introduces error
  • C) Uses different activation
  • D) Has fewer parameters

Question 8: In kernelized attention, complexity is measured in terms of:

  • A) n² only
  • B) n·d (sequence × dimension)
  • C) n·d² or O(n) depending on kernel
  • D) d² only