Introduction
Hierarchical attention is an approach where attention is computed at multiple levels of abstraction, typically from low-level tokens to higher-level summaries. This enables processing of very long sequences by first aggregating information locally, then globally.
Multi-Level Architecture
Level 1: Token-level attention (standard self-attention within patches)
Level 2: Patch-level attention (attend to patch summaries)
Level 3: Document-level attention (attend to section summaries)
Level 2: Patch-level attention (attend to patch summaries)
Level 3: Document-level attention (attend to section summaries)
Example: Document Classification
Tokens → Word-level attention → Phrase representations
↓
Phrases → Phrase-level attention → Sentence representations
↓
Sentences → Sentence-level attention → Document representation
↓
Classification
How It Works
1. Local Aggregation
First, aggregate information within local groups (patches, sentences):
For each local window:
Compute self-attention
Aggregate to single representation
Compute self-attention
Aggregate to single representation
2. Higher-Level Attention
Then attend across these aggregated representations:
Higher-level attention:
Query: higher-level unit (sentence)
Key/Value: aggregated lower-level (phrases)
Query: higher-level unit (sentence)
Key/Value: aggregated lower-level (phrases)
Benefits
- Handles long sequences: O(n log n) or O(n) effective context
- Captures structure: Hierarchical patterns in data
- Computational efficiency: Don't attend to every token directly
Example: Transformer-XL
Uses segment-level recurrence with hierarchical attention:
Process segment 1 → cache hidden states
Process segment 2 with cached context from segment 1
Enables very long effective context
Process segment 2 with cached context from segment 1
Enables very long effective context