Introduction
Transformer alternatives are neural network architectures that aim to solve similar tasks as Transformers but with different mechanisms, often to address the O(n²) attention complexity or to capture different inductive biases.
Main Alternatives
1. State Space Models (SSM)
Mamba, S4 use state space formulations:
h' = A·h + B·x
y = C·h + D·x
Linear time complexity O(n)
y = C·h + D·x
Linear time complexity O(n)
2. Recurrent Memory
LSTM/GRU with attention-like mechanisms.
3. Perceiver/Perceiver IO
Cross-attention with small latents, processing arbitrary modalities.
4. FFN-only models
Replacing attention with FFNs (e.g., MLP-Mixer, gMLP).
Comparison
| Architecture | Complexity | Use Case |
|---|---|---|
| Transformer | O(n²) | Universal |
| Mamba (SSM) | O(n) | Long sequences |
| Linear Transformer | O(n) | Efficient attention |
| Perceiver | O(n·m) small m | Multimodal |
Why Explore Alternatives?
- Efficiency: O(n²) is expensive for long sequences
- Different inductive biases: Some tasks prefer recurrence
- Hardware optimization: Different operations may be faster