29 - Local Attention | Mango Encyclopedia

Introduction

Local attention (also called sliding window attention) is a sparse attention mechanism where each position only attends to a fixed-size window of neighboring positions. It's efficient (O(n·w) instead of O(n²)) and captures local patterns, though it cannot directly model long-range dependencies without additional mechanisms.

Core Concept

For a window size w, position i can only attend to positions [max(0, i-w/2), min(n-1, i+w/2)]:

LocalAttention(Q, K, V, w):
For each position i:
Attend to positions (i-w/2) to (i+w/2)
Compute weighted sum with those positions

Window Sizes

Model	Window Size	Notes
GPT-3	2048	Large context
Swin Transformer	7×7	For images
Longformer	512	Global at special tokens
BigBird	512	With global + random

Variants

1. Centered Window

Position i attends to [i-w/2, i+w/2]

Example: w=3 for position 5:
Attends to: {4, 5, 6}

2. Left-aligned (Causal Local)

Position i attends to [max(0, i-w+1), i]

Example: w=3 for position 5:
Attends to: {3, 4, 5}

3. Strided Local

Position i attends to positions at stride S
Within each stride, small local context

Computational Complexity

For sequence length n, window size w: Standard attention: O(n²) · d Local attention: O(n · w) · d Example: n=4096, w=64 - Standard: 16,777,216 · d - Local: 262,144 · d (64× less!)

Strengths

Efficient: Linear in sequence length
Local patterns: Captures local syntax, structure
Memory efficient: O(n·w) instead of O(n²)

Limitations

No long-range: Cannot directly connect distant positions
Information bottleneck: Long-range needs hierarchical propagation

Combined with Global Attention

To capture both local and long-range dependencies:

Global tokens: attend to ALL positions
Local tokens: attend to window w

Information from global tokens propagates through local attention

Test Your Understanding

Question 1: What is the complexity of local attention with window size w?

A) O(n²)
B) O(n·w)
C) O(w²)
D) O(n)

Question 2: Local attention cannot directly capture:

A) Local patterns
B) Long-range dependencies
C) Syntax
D) Word meaning

Question 3: A window size of w=3 at position 5 (centered) attends to positions:

A) {4, 5, 6}
B) {1, 2, 3}
C) {5}
D) {0, 1, 2, 3, 4, 5}

Question 4: How does local attention handle boundaries?

A) Ignores boundary positions
B) Clips window to valid positions
C) Uses wrap-around
D) Fails at boundaries

Question 5: For n=1024 and w=128, local attention computes how many connections per position?

A) 1024
B) 128
C) 896
D) 8

Question 6: Local attention alone cannot model long-range because:

A) Window is too small
B) No direct connection between distant positions
C) Too much memory
D) Cannot handle boundaries

Question 7: To enable long-range with local attention, we need:

A) Smaller windows
B) Global attention tokens + hierarchical propagation
C) Larger windows
D) No additional mechanism

Question 8: Swin Transformer uses window size:

A) 3×3
B) 7×7
C) 16×16
D) 512

29. Local Attention