63. Hierarchical Attention

Introduction

Hierarchical attention is an approach where attention is computed at multiple levels of abstraction, typically from low-level tokens to higher-level summaries. This enables processing of very long sequences by first aggregating information locally, then globally.

Multi-Level Architecture

Level 1: Token-level attention (standard self-attention within patches)

Level 2: Patch-level attention (attend to patch summaries)

Level 3: Document-level attention (attend to section summaries)

Example: Document Classification

Tokens → Word-level attention → Phrase representations ↓ Phrases → Phrase-level attention → Sentence representations ↓ Sentences → Sentence-level attention → Document representation ↓ Classification

How It Works

1. Local Aggregation

First, aggregate information within local groups (patches, sentences):

For each local window:
Compute self-attention
Aggregate to single representation

2. Higher-Level Attention

Then attend across these aggregated representations:

Higher-level attention:
Query: higher-level unit (sentence)
Key/Value: aggregated lower-level (phrases)

Benefits

Example: Transformer-XL

Uses segment-level recurrence with hierarchical attention:

Process segment 1 → cache hidden states
Process segment 2 with cached context from segment 1
Enables very long effective context

Test Your Understanding

Question 1: Hierarchical attention computes attention at:

  • A) Single level only
  • B) Multiple levels of abstraction
  • C) Token level only
  • D> No abstraction

Question 2: First level typically aggregates:

  • A) Everything at once
  • B) Local information (within patches/sentences)
  • C> No aggregation
  • D> Documents

Question 3: Hierarchical attention helps with:

  • A) Short sequences only
  • B) Long sequence processing
  • C> No benefit
  • D> Single token

Question 4: After local aggregation, higher-level attention attends to:

  • A) All tokens directly
  • B) Aggregated representations
  • C) Nothing
  • D> Random

Question 5: Complexity of hierarchical attention is typically:

  • A) O(n²)
  • B) O(n log n) or O(n)
  • C) O(n³)
  • D) O(1)

Question 6: Transformer-XL uses hierarchical attention via:

  • A) Full attention
  • B) Segment-level recurrence
  • C) No caching
  • D> Single segment

Question 7: For document classification, levels might be:

  • A) Token → Sentence → Document
  • B) Document only
  • C) Word only
  • D> Random

Question 8: Hierarchical attention captures:

  • A) Only flat structure
  • B) Hierarchical patterns in data
  • C) No structure
  • D> Random structure