25. Causal Masking

Introduction

Causal masking (also called autoregressive masking or future masking) is a technique used in autoregressive models to prevent positions from attending to future positions during sequence generation. It ensures the model can only see context from the past and present, not from the future.

Why Causal Masking?

When generating a sequence autoregressively:

The Causal Mask Matrix

For sequence length 4, causal mask M: To position (j) 0 1 2 3 ┌────┬────┬────┬────┐ 0 │ 1 │ 0 │ 0 │ 0 │ From ├────┼────┼────┼────┤ position 1 │ 1 │ 1 │ 0 │ 0 │ (i) ├────┼────┼────┼────┤ 2 │ 1 │ 1 │ 1 │ 0 │ ├────┼────┼────┼────┤ 3 │ 1 │ 1 │ 1 │ 1 │ └────┴────┴────┴────┘ 1 = can attend, 0 = cannot attend In practice, we use -∞ for forbidden positions so softmax produces 0 attention weight.

Mathematical Formulation

Let M be the causal mask matrix

M[i,j] = 0 if j ≤ i (can attend)
M[i,j] = -∞ if j > i (cannot attend)

Attention with mask:
A = softmax(QKᵀ/√d + M)

Then A[i,j] = 0 for all j > i (because softmax(-∞) = 0)

Implementation Methods

Method 1: Additive Mask

scores = QKᵀ/√d
scores_masked = scores + M (M has -∞ for future)
A = softmax(scores_masked, dim=-1)

Method 2: Lower Triangular

mask = torch.tril(torch.ones(seq_len, seq_len))
scores = scores.masked_fill(mask == 0, float('-inf'))

Method 3: Bias Addition (ALiBi style)

ALiBi bias already encodes causal structure
No additional masking needed

Causal in Training vs Generation

Training (Parallel)

All positions computed in parallel with causal masking applied to attention scores.

Generation (Autoregressive)

Generate one token at a time. When extending context, the causal mask naturally allows attending to all previous tokens.

Step 1: tokens[0:1] → predict token[1]
Step 2: tokens[0:2] → predict token[2]
Step 3: tokens[0:3] → predict token[3]
...

Causal Mask vs Padding Mask

AspectCausal MaskPadding Mask
PurposeBlock future positionsBlock padding tokens
ShapeLower triangularDepends on padding pattern
Used inDecoder (autoregressive)Both encoder/decoder
Combined withOften combined with padding maskOften combined with causal

Test Your Understanding

Question 1: What does causal masking prevent?

  • A) Attending to padding
  • B) Attending to future positions during generation
  • C) Attending to past positions
  • D) Self-attention

Question 2: In a sequence of length 5, which positions can position 3 attend to?

  • A) Only position 3
  • B) Positions 0, 1, 2, 3
  • C) Positions 0, 1, 2, 3, 4
  • D) Positions 3, 4 only

Question 3: What value is used to block attention to future positions?

  • A) 0
  • B) 1
  • C) -∞
  • D) ∞

Question 4: The causal mask matrix for length 3 is:

  • A) All zeros
  • B) Lower triangular
  • C) Upper triangular
  • D) All ones

Question 5: Why is -∞ used instead of 0 for masking?

  • A) 0 is too small to block
  • B) softmax(-∞) = 0, effectively blocking attention
  • C) -∞ is faster to compute
  • D) 0 would cause gradient issues

Question 6: In training, how is causal masking applied?

  • A) Token by token sequentially
  • B) All positions in parallel with masking
  • C) Not used during training
  • D) Only on the first layer

Question 7: Causal masking is used in which part of Transformers?

  • A) Encoder only
  • B) Decoder only (masked self-attention)
  • C) Both encoder and decoder
  • D) Neither

Question 8: What happens to future positions after softmax with -∞ mask?

  • A) They get high attention
  • B) They get zero attention weight
  • C) They get random attention
  • D) They cause errors