30 - Global Attention | Mango Encyclopedia

Introduction

Global attention allows certain tokens (global tokens) to attend to and be attended from by all positions in a sequence. This is in contrast to local attention which only considers neighboring positions. Global attention is essential for capturing long-range dependencies and is typically combined with local attention for efficiency.

Core Concept

Designate certain tokens as "global" that can attend to entire sequence:

Global Token at position g:
Can attend to: ALL positions {0, 1, ..., n-1}
Can be attended from: ALL positions {0, 1, ..., n-1}

Implementation Examples

1. Longformer

Each token attends to:
- 512 local neighbors (left and right)
- All global tokens

Global tokens: CLS token + task-specific tokens

2. BigBird

Global tokens attend to entire sequence
All tokens attend to global tokens
Local attention within windows
Random attention for connectivity

3. BERT-Style [CLS] Token

[CLS] token is global
Attends to all input tokens
Used for classification representations

Information Flow with Global Tokens

Global tokens serve as "information hubs" that collect and distribute information:

Without global token: Position 0 ← Position 100 (need many hops through local attention) With global token: Position 0 → Global → Position 100 (2 hops)

Complexity Analysis

Scenario	Complexity
Full attention	O(n²)
All global (g tokens)	O(g·n) for each direction
Local only	O(n·w)
Global + Local	O(g·n) + O(n·w)

Types of Global Tokens

1. Learnable Tokens

Add special tokens like [CLS], [SEP] that are trained to aggregate information.

2. Existing Tokens

Designate certain input tokens (e.g., question tokens in QA) as global.

3. Additional Memory Tokens

Add m memory tokens to sequence
Input: [memory tokens] + [actual tokens]
Memory tokens attend globally to capture summary

Advantages

Long-range dependency: Direct paths between any position and global tokens
Efficient: O(g·n) instead of O(n²) with few global tokens
Information aggregation: Global tokens naturally aggregate sequence information

Disadvantages

Bottleneck: Global tokens can become information bottleneck
Fixed number: Limited number of global tokens may limit expressivity

Test Your Understanding

Question 1: Global tokens can attend to:

A) Only neighbors
B) All positions in the sequence
C) No other tokens
D) Only previous positions

Question 2: In Longformer, global tokens attend to how many neighbors?

A) 0
B) 512
C) All positions
D) 128

Question 3: What is the [CLS] token in BERT?

A) Padding token
B) Global/CLS token for classification
C) Separator token
D) Unknown token

Question 4: Global tokens act as:

A) Local connectors
B) Information hubs for long-range dependency
C) Memory only
D) Padding only

Question 5: What is the complexity of global attention with g global tokens?

A) O(n²)
B) O(g·n)
C) O(g²)
D) O(n/g)

Question 6: How many hops to go from position 0 to position 1000 with global token?

A) 1000
B) 1
C) 2 (via global token)
D) 500

Question 7: Global attention alone has complexity:

A) O(n·w)
B) O(n²)
C) O(g·n)
D) O(n)

Question 8: Global tokens can become a bottleneck because:

A) They attend to too many positions
B) All information must flow through limited global tokens
C) They are too few
D) They use too much memory

30. Global Attention