52. Chunked Attention

Introduction

Chunked attention computes attention on chunks of the sequence separately, then combines results. This is used to handle long sequences that don't fit in memory, enabling processing of very long contexts by breaking them into manageable pieces.

The Problem

Full attention on long sequences requires O(n²) memory for the attention matrix. For very long sequences, this becomes infeasible.

Chunked Attention Solution

Sequence length n, chunk size C Number of chunks: n/C Compute attention within each chunk

Implementation Approaches

1. Independent Chunk Attention

Process each chunk independently Fast but loses cross-chunk relationships

2. Chunked with Cross-Chunk

Multiple passes or hierarchical combination:

Pass 1: Within-chunk attention
Pass 2: Cross-chunk attention

3. Sliding Window Chunk

Chunks overlap: chunk i attends to chunk i and i+1

Use Cases

  • Long document understanding: Process long documents in chunks
  • Video processing: Frame sequences in chunks
  • Streaming: Process continuous data streams

Test Your Understanding

Question 1: Chunked attention is used to handle:

  • A) Short sequences
  • B) Long sequences that don't fit in memory
  • C) No specific purpose
  • D> Fixed length only

Question 2: With chunk size C and sequence length n, number of chunks is:

  • A) C
  • B) n
  • C) n/C
  • D) C/n

Question 3: Independent chunk attention has what drawback?

  • A) Too fast
  • B) Loses cross-chunk relationships
  • C) No memory savings
  • D> Uses too much memory

Question 4: Overlapping chunks in sliding window chunk allow:

  • A) No information flow
  • B) Cross-chunk information flow
  • C) Slower processing
  • D) No benefit

Question 5: For sequence length 4096 with chunk size 512, number of chunks is:

  • A) 8
  • B) 512
  • C) 4096
  • D) 16

Question 6: Chunked attention requires:

  • A) Full sequence in memory
  • B> Only chunk in memory at a time
  • C) No computation
  • D> Single chunk only

Question 7: A use case for chunked attention is:

  • A) Only image processing
  • B) Long document understanding
  • C) No use case
  • D> Fixed sequences only

Question 8: Multi-pass chunked attention:

  • A) Single pass only
  • B) Within-chunk first, then cross-chunk attention
  • C) No passes
  • D) Random passes