32 - Block Attention | Mango Encyclopedia

Introduction

Block attention is a sparse attention mechanism where the sequence is divided into blocks, and attention is computed within blocks rather than across the entire sequence. This reduces complexity from O(n²) to O(n·b) where b is block size, or allows specific block-to-block attention patterns.

Block Partitioning

Sequence length n divided into B blocks
Block size b = n / B

Example: n=1024, block_size=64 → 16 blocks

Attention Patterns

1. Within-Block Only

Attention only within each block (like local attention):

Complexity: O(B · b²) = O(n·b)

Block i can only attend to positions in block i

2. Block-to-Block (Strided)

Block i attends to blocks at stride S:

Block i → blocks: i, i-S, i+S, i-2S, ...

Enables larger receptive field

3. Hierarchical Blocks

Different block sizes at different levels:

Level 1: 64 tokens per block
Level 2: 256 tokens per block
Level 3: 1024 tokens per block

Memory Efficiency

Standard: O(n²) memory for attention matrix

Block attention: O(B · b²) = O(n·b) memory

Example: n=4096, b=64
Standard: 16M entries
Block: 16K entries (1000× less)

Used In

Longformer: Uses block + global attention
Swin Transformer: Window-based blocks
Image Transformers: Patch-based blocks

Cross-Block Information

To allow information to flow between blocks:

Hierarchical stacking: Later layers see aggregated block info
Shifted windows (Swin): Adjacent blocks attend to each other
Global tokens: Special tokens that attend across blocks

Test Your Understanding

Question 1: Block attention divides sequence into:

A) Single large block
B) Multiple blocks of equal size
C) Random sized blocks
D) No blocks

Question 2: For n=1024 with block size 64, how many blocks?

A) 1024
B) 64
C) 16
D) 8

Question 3: Memory for block attention is:

A) O(n²)
B) O(n·b)
C) O(b²)
D) O(n)

Question 4: To enable cross-block information flow, we can use:

A) Larger blocks
B) Shifted windows, global tokens, or hierarchical stacking
C) Smaller blocks
D) Remove attention

Question 5: In block attention, positions in block 3 attend to:

A) All positions
B) Only positions in block 3
C) No other positions
D) Only even blocks

Question 6: Swin Transformer's window attention is:

A) Block attention with fixed windows
B> Full sequence attention
C) Random attention
D) No attention

Question 7: Strided block attention block i attends to blocks:

A) Only block i
B) Blocks at stride distance (i±S, i±2S, ...)
C) All blocks
D) No blocks

Question 8: Hierarchical blocks help capture:

A) Only fine-grained patterns
B) Multi-scale patterns from local to global
C) No useful patterns
D) Single scale only

32. Block Attention