30. Global Attention

Introduction

Global attention allows certain tokens (global tokens) to attend to and be attended from by all positions in a sequence. This is in contrast to local attention which only considers neighboring positions. Global attention is essential for capturing long-range dependencies and is typically combined with local attention for efficiency.

Core Concept

Designate certain tokens as "global" that can attend to entire sequence:

Global Token at position g:
Can attend to: ALL positions {0, 1, ..., n-1}
Can be attended from: ALL positions {0, 1, ..., n-1}

Implementation Examples

1. Longformer

Each token attends to:
- 512 local neighbors (left and right)
- All global tokens

Global tokens: CLS token + task-specific tokens

2. BigBird

Global tokens attend to entire sequence
All tokens attend to global tokens
Local attention within windows
Random attention for connectivity

3. BERT-Style [CLS] Token

[CLS] token is global
Attends to all input tokens
Used for classification representations

Information Flow with Global Tokens

Global tokens serve as "information hubs" that collect and distribute information:

Without global token: Position 0 ← Position 100 (need many hops through local attention) With global token: Position 0 → Global → Position 100 (2 hops)

Complexity Analysis

ScenarioComplexity
Full attentionO(n²)
All global (g tokens)O(g·n) for each direction
Local onlyO(n·w)
Global + LocalO(g·n) + O(n·w)

Types of Global Tokens

1. Learnable Tokens

Add special tokens like [CLS], [SEP] that are trained to aggregate information.

2. Existing Tokens

Designate certain input tokens (e.g., question tokens in QA) as global.

3. Additional Memory Tokens

Add m memory tokens to sequence
Input: [memory tokens] + [actual tokens]
Memory tokens attend globally to capture summary

Advantages

Disadvantages

Test Your Understanding

Question 1: Global tokens can attend to:

  • A) Only neighbors
  • B) All positions in the sequence
  • C) No other tokens
  • D) Only previous positions

Question 2: In Longformer, global tokens attend to how many neighbors?

  • A) 0
  • B) 512
  • C) All positions
  • D) 128

Question 3: What is the [CLS] token in BERT?

  • A) Padding token
  • B) Global/CLS token for classification
  • C) Separator token
  • D) Unknown token

Question 4: Global tokens act as:

  • A) Local connectors
  • B) Information hubs for long-range dependency
  • C) Memory only
  • D) Padding only

Question 5: What is the complexity of global attention with g global tokens?

  • A) O(n²)
  • B) O(g·n)
  • C) O(g²)
  • D) O(n/g)

Question 6: How many hops to go from position 0 to position 1000 with global token?

  • A) 1000
  • B) 1
  • C) 2 (via global token)
  • D) 500

Question 7: Global attention alone has complexity:

  • A) O(n·w)
  • B) O(n²)
  • C) O(g·n)
  • D) O(n)

Question 8: Global tokens can become a bottleneck because:

  • A) They attend to too many positions
  • B) All information must flow through limited global tokens
  • C) They are too few
  • D) They use too much memory