Introduction
Chunked attention computes attention on chunks of the sequence separately, then combines results. This is used to handle long sequences that don't fit in memory, enabling processing of very long contexts by breaking them into manageable pieces.
The Problem
Full attention on long sequences requires O(n²) memory for the attention matrix. For very long sequences, this becomes infeasible.
Chunked Attention Solution
Sequence length n, chunk size C
Number of chunks: n/C
Compute attention within each chunk
Number of chunks: n/C
Compute attention within each chunk
Implementation Approaches
1. Independent Chunk Attention
Process each chunk independently
Fast but loses cross-chunk relationships
2. Chunked with Cross-Chunk
Multiple passes or hierarchical combination:
Pass 1: Within-chunk attention
Pass 2: Cross-chunk attention
Pass 2: Cross-chunk attention
3. Sliding Window Chunk
Chunks overlap: chunk i attends to chunk i and i+1
Use Cases
- Long document understanding: Process long documents in chunks
- Video processing: Frame sequences in chunks
- Streaming: Process continuous data streams