34. BigBird Attention

Introduction

BigBird is a sparse attention mechanism introduced by Zaheer et al. (2020) that can handle sequences up to 8× longer than standard Transformers. It combines three attention patterns: local (sliding window), global, and random attention to theoretically guarantee that full attention can be approximated with O(n) complexity.

Three Attention Patterns

1. Local Attention (Window)

Each position attends to w neighbors
Default w = 512

Captures local structure

2. Global Attention

All positions attend to special global tokens
Global tokens attend to all positions

Enables information aggregation

3. Random Attention

Each position attends to r random positions
Default r = 3-5 random connections

Enables information spread across sequence

Theoretical Foundation

BigBird proves that sparse attention can approximate full attention because:

Complexity

Total edges per position:
- w local connections
- g global connections (global tokens attend to all)
- r random connections

O(w + g + r) = O(n) total

Full sequence: O(n) instead of O(n²)

Configuration

ParameterTypical ValueDescription
num_global_tokens2Special global tokens
window_size512Local attention window
num_random_blocks3Random connections per token

Key Insight

Random connections are crucial because local + global alone can create isolated clusters. Random connections bridge these clusters, ensuring the entire sequence is connected.

Test Your Understanding

Question 1: BigBird combines how many attention patterns?

  • A) 1
  • B) 2
  • C) 3 (local + global + random)
  • D) 4

Question 2: What is the purpose of random attention in BigBird?

  • A) Make computation faster
  • B) Bridge isolated clusters, enable information spread
  • C) Reduce memory
  • D) Replace global attention

Question 3: BigBird can handle sequences up to:

  • A) 2× longer than standard
  • B) 4× longer than standard
  • C) 8× longer than standard
  • D) Same length only

Question 4: What happens without random connections?

  • A) Faster training
  • B) Some positions may form isolated clusters
  • C) Better accuracy
  • D) Less memory

Question 5: The three attention patterns in BigBird are:

  • A) Local, stride, chunk
  • B) Local, global, random
  • C) Global, sparse, dense
  • D) Window, block, full

Question 6: BigBird complexity is:

  • A) O(n²)
  • B) O(n)
  • C) O(n·w)
  • D) O(1)

Question 7: Default number of random connections per token is:

  • A) 1
  • B) 3-5
  • C) 512
  • D) 100

Question 8: BigBird provides theoretical guarantee that:

  • A) Sparse attention exactly equals full attention
  • B) Sparse attention can approximate full attention
  • C) Random is not needed
  • D) O(n²) is required