27. Attention Masks

Introduction

Attention masks are binary or continuous masks that control which positions in a sequence can attend to which other positions. They are essential for various attention scenarios including causal generation, ignoring padding, and implementing specialized attention patterns.

Types of Attention Masks

1. Causal Mask (Future Blocking)

Prevents attending to future positions
Lower triangular matrix of 0s and -∞s

2. Padding Mask

Blocks attention to padding tokens
True/1 for padding, False/0 for real tokens

3. Combined Mask

mask = causal_mask OR padding_mask

For decoder: block future AND block padding

Mask Application in Computation

Step 1: Compute attention scores
scores = QKᵀ / √d

Step 2: Apply mask (add -∞ for positions to block)
scores = scores + mask

Step 3: Softmax
attention_weights = softmax(scores, dim=-1)

Step 4: Weighted sum
output = attention_weights · V

Mask Representations

Mask TypeValue for BlockedValue for AllowedUsed In
Causal-∞0Decoder self-attention
Padding-∞0Variable length sequences
BERT attention_mask01Bidirectional (ignoring)
Key padding-∞0Cross-attention with padding

Advanced Masks

1. Chunk Mask

For local attention within chunks:

Chunk size C, sequence length N
Within each chunk: full attention
Between chunks: no attention

2. Stride Mask

For attention with fixed gaps:

Allow attention at stride S
Block positions that are S apart

3. Arbitrary Attention Patterns

For sparse attention patterns:

Custom binary mask M
M[i,j] = 0 for allowed, -∞ for blocked
Or M[i,j] = 1 for allowed, 0 for blocked

Efficiency Considerations

Test Your Understanding

Question 1: What value is typically used to block attention in masks?

  • A) 0
  • B) 1
  • C) -∞
  • D) NaN

Question 2: The combined mask in decoder is:

  • A) causal_mask + padding_mask
  • B) causal_mask OR padding_mask
  • C) causal_mask - padding_mask
  • D) No combined mask

Question 3: For chunk mask with chunk size 4, positions 0-3 can attend to:

  • A) Only within same chunk
  • B) All chunks
  • C) No other positions
  • D) Only chunk 0

Question 4: In BERT's attention_mask, what does 0 mean?

  • A) Attend normally
  • B) Ignore (padding)
  • C) High attention
  • D) No mask applied

Question 5: The mask is applied before or after softmax?

  • A) After softmax
  • B) Before softmax (by adding -∞)
  • C) During softmax
  • D> Not applied at all

Question 6: Dynamic masks are useful for:

  • A) Fixed patterns only
  • B) Variable patterns computed at runtime
  • C) Always the same mask
  • D) No dynamic masks exist

Question 7: After applying -∞ mask and softmax, blocked positions have attention weight approximately:

  • A) 1
  • B) 0.5
  • C) 0
  • D) -1

Question 8: How many masks are combined in decoder cross-attention?

  • A) 0
  • B) 1
  • C) 2 (causal + padding)
  • D) 3