37. Performer Attention (Rethinking Attention with Performers)

Introduction

Performer (introduced by Choromanski et al., 2020) is a linear attention model that uses generalized attention mechanisms with random feature maps for unbiased softmax approximation. It achieves O(n) complexity while providing theoretically justified approximation of standard attention.

Problem with Standard Softmax

softmax(QKᵀ/√d) requires computing full n×n matrix

For n=4096: 16M entries to compute
For n=16384: 268M entries - memory explodes

Random Feature Map for Softmax

The softmax kernel can be approximated using random features:

exp(qᵀk/√d) ≈ φ(q)ᵀφ(k)

Where φ(x) is constructed from random projections

The FAVOR+ Algorithm

Performer uses "FAVOR+" (Fast Attention Via positive Orthogonal Random Features):

φ(x) = [cos(x·w₁+b), sin(x·w₁+b), ..., cos(x·wₘ+b), sin(x·wₘ+b)] / √m

Where wᵢ ~ N(0, I) and bᵢ ~ Uniform(0, 2π)

Algorithm Steps

1. Sample random matrices R ∈ ℝ^{d×m} and B ∈ ℝ^{m}

2. Compute Φ(Q) = cos(QR + B), Φ(K) = cos(KR + B)

3. Attention ≈ softmax_normalize(Φ(Q)Φ(K)ᵀ)Φ(K)V

Complexity: O(n·d·m + n·m·d) = O(n·m·d) where m << n

Theoretical Guarantees

Comparison

AspectStandardPerformer
ComplexityO(n²·d)O(n·m·d)
SpaceO(n²)O(n·m)
ApproximationExactTheoretically justified

Advantages

Test Your Understanding

Question 1: Performer achieves what complexity?

  • A) O(n²)
  • B) O(n)
  • C) O(n·m·d)
  • D) O(d²)

Question 2: What does FAVOR+ stand for?

  • A) Fast Attention Via Orthogonal Random Features
  • B) Fast Attention Via positive Orthogonal Random Features
  • C) Fast Algorithm Via positive Random Features
  • D) Full Attention Via Orthogonal Random

Question 3: The random feature map φ(x) uses:

  • A) Fixed projections
  • B) Random projections wᵢ ~ N(0,I) and bias bᵢ ~ Uniform(0,2π)
  • C) Learned projections
  • D) No projections

Question 4: Performer provides:

  • A) No theoretical guarantee
  • B) Theoretically justified unbiased approximation
  • C) Exact standard attention
  • D) Random results

Question 5: The approximation E[φ(q)ᵀφ(k)] equals:

  • A) 0
  • B) 1
  • C) exp(qᵀk/√d)
  • D) q·k

Question 6: Performer was introduced in:

  • A) "Attention is All You Need"
  • B) "Rethinking Attention with Performers"
  • C) "BERT"
  • D) "Longformer"

Question 7: The dimension m in the random feature map is:

  • A) Equal to n
  • B) Much smaller than n (m << n)
  • C) Equal to d
  • D) Zero

Question 8: What advantage does Performer have over simple linear attention?

  • A) More parameters
  • B) Theoretically justified unbiased approximation
  • C) Slower computation
  • D) Higher memory