34 - BigBird Attention | Mango Encyclopedia

Introduction

BigBird is a sparse attention mechanism introduced by Zaheer et al. (2020) that can handle sequences up to 8× longer than standard Transformers. It combines three attention patterns: local (sliding window), global, and random attention to theoretically guarantee that full attention can be approximated with O(n) complexity.

Three Attention Patterns

1. Local Attention (Window)

Each position attends to w neighbors
Default w = 512

Captures local structure

2. Global Attention

All positions attend to special global tokens
Global tokens attend to all positions

Enables information aggregation

3. Random Attention

Each position attends to r random positions
Default r = 3-5 random connections

Enables information spread across sequence

Theoretical Foundation

BigBird proves that sparse attention can approximate full attention because:

Graph theory: The attention graph can be made transitive (connected) with O(n) edges
Spectral clustering: Sparse connections preserve spectral properties
Message passing: Information flows across the sequence via random connections

Complexity

Total edges per position:
- w local connections
- g global connections (global tokens attend to all)
- r random connections

O(w + g + r) = O(n) total

Full sequence: O(n) instead of O(n²)

Configuration

Parameter	Typical Value	Description
num_global_tokens	2	Special global tokens
window_size	512	Local attention window
num_random_blocks	3	Random connections per token

Key Insight

Random connections are crucial because local + global alone can create isolated clusters. Random connections bridge these clusters, ensuring the entire sequence is connected.

Test Your Understanding

Question 1: BigBird combines how many attention patterns?

A) 1
B) 2
C) 3 (local + global + random)
D) 4

Question 2: What is the purpose of random attention in BigBird?

A) Make computation faster
B) Bridge isolated clusters, enable information spread
C) Reduce memory
D) Replace global attention

Question 3: BigBird can handle sequences up to:

A) 2× longer than standard
B) 4× longer than standard
C) 8× longer than standard
D) Same length only

Question 4: What happens without random connections?

A) Faster training
B) Some positions may form isolated clusters
C) Better accuracy
D) Less memory

Question 5: The three attention patterns in BigBird are:

A) Local, stride, chunk
B) Local, global, random
C) Global, sparse, dense
D) Window, block, full

Question 6: BigBird complexity is:

A) O(n²)
B) O(n)
C) O(n·w)
D) O(1)

Question 7: Default number of random connections per token is:

A) 1
B) 3-5
C) 512
D) 100

Question 8: BigBird provides theoretical guarantee that:

A) Sparse attention exactly equals full attention
B) Sparse attention can approximate full attention
C) Random is not needed
D) O(n²) is required

34. BigBird Attention