32. Block Attention

Introduction

Block attention is a sparse attention mechanism where the sequence is divided into blocks, and attention is computed within blocks rather than across the entire sequence. This reduces complexity from O(n²) to O(n·b) where b is block size, or allows specific block-to-block attention patterns.

Block Partitioning

Sequence length n divided into B blocks
Block size b = n / B

Example: n=1024, block_size=64 → 16 blocks

Attention Patterns

1. Within-Block Only

Attention only within each block (like local attention):

Complexity: O(B · b²) = O(n·b)

Block i can only attend to positions in block i

2. Block-to-Block (Strided)

Block i attends to blocks at stride S:

Block i → blocks: i, i-S, i+S, i-2S, ...

Enables larger receptive field

3. Hierarchical Blocks

Different block sizes at different levels:

Level 1: 64 tokens per block
Level 2: 256 tokens per block
Level 3: 1024 tokens per block

Memory Efficiency

Standard: O(n²) memory for attention matrix

Block attention: O(B · b²) = O(n·b) memory

Example: n=4096, b=64
Standard: 16M entries
Block: 16K entries (1000× less)

Used In

Cross-Block Information

To allow information to flow between blocks:

Test Your Understanding

Question 1: Block attention divides sequence into:

  • A) Single large block
  • B) Multiple blocks of equal size
  • C) Random sized blocks
  • D) No blocks

Question 2: For n=1024 with block size 64, how many blocks?

  • A) 1024
  • B) 64
  • C) 16
  • D) 8

Question 3: Memory for block attention is:

  • A) O(n²)
  • B) O(n·b)
  • C) O(b²)
  • D) O(n)

Question 4: To enable cross-block information flow, we can use:

  • A) Larger blocks
  • B) Shifted windows, global tokens, or hierarchical stacking
  • C) Smaller blocks
  • D) Remove attention

Question 5: In block attention, positions in block 3 attend to:

  • A) All positions
  • B) Only positions in block 3
  • C) No other positions
  • D) Only even blocks

Question 6: Swin Transformer's window attention is:

  • A) Block attention with fixed windows
  • B> Full sequence attention
  • C) Random attention
  • D) No attention

Question 7: Strided block attention block i attends to blocks:

  • A) Only block i
  • B) Blocks at stride distance (i±S, i±2S, ...)
  • C) All blocks
  • D) No blocks

Question 8: Hierarchical blocks help capture:

  • A) Only fine-grained patterns
  • B) Multi-scale patterns from local to global
  • C) No useful patterns
  • D) Single scale only