29. Local Attention

Introduction

Local attention (also called sliding window attention) is a sparse attention mechanism where each position only attends to a fixed-size window of neighboring positions. It's efficient (O(n·w) instead of O(n²)) and captures local patterns, though it cannot directly model long-range dependencies without additional mechanisms.

Core Concept

For a window size w, position i can only attend to positions [max(0, i-w/2), min(n-1, i+w/2)]:

LocalAttention(Q, K, V, w):
For each position i:
Attend to positions (i-w/2) to (i+w/2)
Compute weighted sum with those positions

Window Sizes

ModelWindow SizeNotes
GPT-32048Large context
Swin Transformer7×7For images
Longformer512Global at special tokens
BigBird512With global + random

Variants

1. Centered Window

Position i attends to [i-w/2, i+w/2]

Example: w=3 for position 5:
Attends to: {4, 5, 6}

2. Left-aligned (Causal Local)

Position i attends to [max(0, i-w+1), i]

Example: w=3 for position 5:
Attends to: {3, 4, 5}

3. Strided Local

Position i attends to positions at stride S
Within each stride, small local context

Computational Complexity

For sequence length n, window size w: Standard attention: O(n²) · d Local attention: O(n · w) · d Example: n=4096, w=64 - Standard: 16,777,216 · d - Local: 262,144 · d (64× less!)

Strengths

Limitations

Combined with Global Attention

To capture both local and long-range dependencies:

Global tokens: attend to ALL positions
Local tokens: attend to window w

Information from global tokens propagates through local attention

Test Your Understanding

Question 1: What is the complexity of local attention with window size w?

  • A) O(n²)
  • B) O(n·w)
  • C) O(w²)
  • D) O(n)

Question 2: Local attention cannot directly capture:

  • A) Local patterns
  • B) Long-range dependencies
  • C) Syntax
  • D) Word meaning

Question 3: A window size of w=3 at position 5 (centered) attends to positions:

  • A) {4, 5, 6}
  • B) {1, 2, 3}
  • C) {5}
  • D) {0, 1, 2, 3, 4, 5}

Question 4: How does local attention handle boundaries?

  • A) Ignores boundary positions
  • B) Clips window to valid positions
  • C) Uses wrap-around
  • D) Fails at boundaries

Question 5: For n=1024 and w=128, local attention computes how many connections per position?

  • A) 1024
  • B) 128
  • C) 896
  • D) 8

Question 6: Local attention alone cannot model long-range because:

  • A) Window is too small
  • B) No direct connection between distant positions
  • C) Too much memory
  • D) Cannot handle boundaries

Question 7: To enable long-range with local attention, we need:

  • A) Smaller windows
  • B) Global attention tokens + hierarchical propagation
  • C) Larger windows
  • D) No additional mechanism

Question 8: Swin Transformer uses window size:

  • A) 3×3
  • B) 7×7
  • C) 16×16
  • D) 512