33. Longformer Attention

Introduction

Longformer is a Transformer variant designed for long sequences (up to 16K tokens) introduced by Beltagy et al. (2020). It combines sliding window attention with global attention and dilated attention to achieve O(n) complexity while maintaining the ability to capture both local and long-range dependencies.

Attention Pattern

Longformer uses a combination of three attention patterns:

1. Sliding Window Attention

Each token attends to w neighbors on each side
Default w = 512 (256 on each side)

Complexity: O(n·w)

2. Global Attention

Special tokens (e.g., [CLS], question tokens) attend to all positions
All positions can attend to these global tokens

Used for: classification, question answering

3. Dilated Attention

Like dilated convolution, skip positions:
Gap of d between attended positions

Example: dilation=2 → attends to positions 0,2,4,6...

Complexity Comparison

ModelComplexityFor 16K tokens
Standard TransformerO(n²)256M operations
LongformerO(n·w + n·g)~16M operations

Configuration Options

attention_window: size of sliding window (default 512)
attention_dilation: dilation factor (default [1, 1, 1, 1])
num_global_tokens: number of global attention tokens

Usage in Training

Longformer uses a special attention pattern that can be implemented efficiently:

Applications

Test Your Understanding

Question 1: Longformer is designed for:

  • A) Short sequences only
  • B) Long sequences (up to 16K tokens)
  • C) Image processing
  • D) Speech recognition only

Question 2: What is the sliding window size in Longformer?

  • A) 64
  • B) 128
  • C) 512 (256 each side)
  • D) 4096

Question 3: Global attention in Longformer is used for:

  • A) All tokens
  • B) Special tokens like [CLS]
  • C> No global attention
  • D) Padding tokens

Question 4: What does "dilated" attention mean?

  • A) Attend to all positions
  • B) Skip positions with gaps
  • C) Only attend to neighbors
  • D) No dilation

Question 5: Longformer's complexity is:

  • A) O(n²)
  • B) O(n)
  • C) O(n·w + n·g)
  • D) O(w²)

Question 6: For 16K tokens, compared to standard O(n²), Longformer reduces operations by:

  • A) 2×
  • B) 8×
  • C) 16×
  • D) 1000×

Question 7: Global tokens can attend to:

  • A) Only neighbors
  • B) The entire sequence
  • C) No positions
  • D) Random positions

Question 8: Longformer was introduced in the paper:

  • A) "Attention is All You Need"
  • B) "Longformer: The Long-Document Transformer"
  • C) "BERT"
  • D) "GPT-3"