63 - Hierarchical Attention | Mango Encyclopedia

Introduction

Hierarchical attention is an approach where attention is computed at multiple levels of abstraction, typically from low-level tokens to higher-level summaries. This enables processing of very long sequences by first aggregating information locally, then globally.

Multi-Level Architecture

Level 1: Token-level attention (standard self-attention within patches)

Level 2: Patch-level attention (attend to patch summaries)

Level 3: Document-level attention (attend to section summaries)

Example: Document Classification

Tokens → Word-level attention → Phrase representations ↓ Phrases → Phrase-level attention → Sentence representations ↓ Sentences → Sentence-level attention → Document representation ↓ Classification

How It Works

1. Local Aggregation

First, aggregate information within local groups (patches, sentences):

For each local window:
Compute self-attention
Aggregate to single representation

2. Higher-Level Attention

Then attend across these aggregated representations:

Higher-level attention:
Query: higher-level unit (sentence)
Key/Value: aggregated lower-level (phrases)

Benefits

Handles long sequences: O(n log n) or O(n) effective context
Captures structure: Hierarchical patterns in data
Computational efficiency: Don't attend to every token directly

Example: Transformer-XL

Uses segment-level recurrence with hierarchical attention:

Process segment 1 → cache hidden states
Process segment 2 with cached context from segment 1
Enables very long effective context

Test Your Understanding

Question 1: Hierarchical attention computes attention at:

A) Single level only
B) Multiple levels of abstraction
C) Token level only
D> No abstraction

Question 2: First level typically aggregates:

A) Everything at once
B) Local information (within patches/sentences)
C> No aggregation
D> Documents

Question 3: Hierarchical attention helps with:

A) Short sequences only
B) Long sequence processing
C> No benefit
D> Single token

Question 4: After local aggregation, higher-level attention attends to:

A) All tokens directly
B) Aggregated representations
C) Nothing
D> Random

Question 5: Complexity of hierarchical attention is typically:

A) O(n²)
B) O(n log n) or O(n)
C) O(n³)
D) O(1)

Question 6: Transformer-XL uses hierarchical attention via:

A) Full attention
B) Segment-level recurrence
C) No caching
D> Single segment

Question 7: For document classification, levels might be:

A) Token → Sentence → Document
B) Document only
C) Word only
D> Random

Question 8: Hierarchical attention captures:

A) Only flat structure
B) Hierarchical patterns in data
C) No structure
D> Random structure

63. Hierarchical Attention