67. Transformer Alternatives

Introduction

Transformer alternatives are neural network architectures that aim to solve similar tasks as Transformers but with different mechanisms, often to address the O(n²) attention complexity or to capture different inductive biases.

Main Alternatives

1. State Space Models (SSM)

Mamba, S4 use state space formulations:

h' = A·h + B·x
y = C·h + D·x Linear time complexity O(n)

2. Recurrent Memory

LSTM/GRU with attention-like mechanisms.

3. Perceiver/Perceiver IO

Cross-attention with small latents, processing arbitrary modalities.

4. FFN-only models

Replacing attention with FFNs (e.g., MLP-Mixer, gMLP).

Comparison

ArchitectureComplexityUse Case
TransformerO(n²)Universal
Mamba (SSM)O(n)Long sequences
Linear TransformerO(n)Efficient attention
PerceiverO(n·m) small mMultimodal

Why Explore Alternatives?

Test Your Understanding

Question 1: Main motivation for Transformer alternatives is:

  • A) Higher accuracy
  • B) Address O(n²) attention complexity
  • C) More parameters
  • D> No motivation

Question 2: State Space Models (like Mamba) have complexity:

  • A) O(n²)
  • B) O(n)
  • C) O(log n)
  • D) O(n³)

Question 3: Perceiver uses:

  • A) Full cross-attention
  • B) Small latent array for cross-attention
  • C) No attention
  • D> Full self-attention

Question 4: MLP-Mixer replaces attention with:

  • A) Recurrence
  • B) FFNs only
  • C> Attention still
  • D> No replacement

Question 5: SSM stands for:

  • A) State Space Model
  • B) Standard System Model
  • C) Simple State Machine
  • D> No meaning

Question 6: gMLP uses:

  • A) Attention
  • B) Gating with FFNs, no attention
  • C> Convolution only
  • D> RNN

Question 7: Mamba achieves O(n) via:

  • A) More parameters
  • B) State space formulation with selective gating
  • C) Less accuracy
  • D> No efficiency

Question 8: Alternative architectures may capture different:

  • A) Random noise
  • B) Inductive biases
  • C> Nothing
  • D> The same patterns