67 - Transformer Alternatives | Mango Encyclopedia

Introduction

Transformer alternatives are neural network architectures that aim to solve similar tasks as Transformers but with different mechanisms, often to address the O(n²) attention complexity or to capture different inductive biases.

Main Alternatives

1. State Space Models (SSM)

Mamba, S4 use state space formulations:

h' = A·h + B·x
y = C·h + D·x Linear time complexity O(n)

2. Recurrent Memory

LSTM/GRU with attention-like mechanisms.

3. Perceiver/Perceiver IO

Cross-attention with small latents, processing arbitrary modalities.

4. FFN-only models

Replacing attention with FFNs (e.g., MLP-Mixer, gMLP).

Comparison

Architecture	Complexity	Use Case
Transformer	O(n²)	Universal
Mamba (SSM)	O(n)	Long sequences
Linear Transformer	O(n)	Efficient attention
Perceiver	O(n·m) small m	Multimodal

Why Explore Alternatives?

Efficiency: O(n²) is expensive for long sequences
Different inductive biases: Some tasks prefer recurrence
Hardware optimization: Different operations may be faster

Test Your Understanding

Question 1: Main motivation for Transformer alternatives is:

A) Higher accuracy
B) Address O(n²) attention complexity
C) More parameters
D> No motivation

Question 2: State Space Models (like Mamba) have complexity:

A) O(n²)
B) O(n)
C) O(log n)
D) O(n³)

Question 3: Perceiver uses:

A) Full cross-attention
B) Small latent array for cross-attention
C) No attention
D> Full self-attention

Question 4: MLP-Mixer replaces attention with:

A) Recurrence
B) FFNs only
C> Attention still
D> No replacement

Question 5: SSM stands for:

A) State Space Model
B) Standard System Model
C) Simple State Machine
D> No meaning

Question 6: gMLP uses:

A) Attention
B) Gating with FFNs, no attention
C> Convolution only
D> RNN

Question 7: Mamba achieves O(n) via:

A) More parameters
B) State space formulation with selective gating
C) Less accuracy
D> No efficiency

Question 8: Alternative architectures may capture different:

A) Random noise
B) Inductive biases
C> Nothing
D> The same patterns

67. Transformer Alternatives