Introduction
Longformer is a Transformer variant designed for long sequences (up to 16K tokens) introduced by Beltagy et al. (2020). It combines sliding window attention with global attention and dilated attention to achieve O(n) complexity while maintaining the ability to capture both local and long-range dependencies.
Attention Pattern
Longformer uses a combination of three attention patterns:
1. Sliding Window Attention
Each token attends to w neighbors on each side
Default w = 512 (256 on each side)
Complexity: O(n·w)
Default w = 512 (256 on each side)
Complexity: O(n·w)
2. Global Attention
Special tokens (e.g., [CLS], question tokens) attend to all positions
All positions can attend to these global tokens
Used for: classification, question answering
All positions can attend to these global tokens
Used for: classification, question answering
3. Dilated Attention
Like dilated convolution, skip positions:
Gap of d between attended positions
Example: dilation=2 → attends to positions 0,2,4,6...
Gap of d between attended positions
Example: dilation=2 → attends to positions 0,2,4,6...
Complexity Comparison
| Model | Complexity | For 16K tokens |
|---|---|---|
| Standard Transformer | O(n²) | 256M operations |
| Longformer | O(n·w + n·g) | ~16M operations |
Configuration Options
attention_window: size of sliding window (default 512)
attention_dilation: dilation factor (default [1, 1, 1, 1])
num_global_tokens: number of global attention tokens
attention_dilation: dilation factor (default [1, 1, 1, 1])
num_global_tokens: number of global attention tokens
Usage in Training
Longformer uses a special attention pattern that can be implemented efficiently:
- Full self-attention for short sequences: When sequence < attention_window
- Segment-level computation: For long sequences, compute within and across segments
Applications
- Long document classification: Papers, legal documents
- Question answering: Long contexts with global tokens for question
- summarization: Long articles