Introduction
Padding masking is a technique used to mask out padding tokens in sequences of variable lengths. When processing batches with sequences of different lengths, padding tokens are added to make all sequences the same length. Padding masks tell the attention mechanism to ignore these artificial tokens.
The Problem
When batching sequences:
Sequence 1: "Hello world" [len=2]
Sequence 2: "AI is great" [len=3]
Sequence 3: "NLP" [len=1]
Batched (padded to len=3):
Seq 1: [Hello, world, ]
Seq 2: [AI, is, great]
Seq 3: [NLP, , ]
We need to tell attention to ignore tokens
Padding Mask Implementation
Create padding mask:
mask[i,j] = True if position j is padding
mask[i,j] = False if position j is real token
In attention scores:
scores[i,j] = scores[i,j] if not padding_mask[i,j]
scores[i,j] = -∞ if padding_mask[i,j]
After softmax, attention to padding ≈ 0
mask[i,j] = True if position j is padding
mask[i,j] = False if position j is real token
In attention scores:
scores[i,j] = scores[i,j] if not padding_mask[i,j]
scores[i,j] = -∞ if padding_mask[i,j]
After softmax, attention to padding ≈ 0
Padding Mask in Self-Attention
For each batch item i:
Find which positions are padding
Set attention scores to those positions = -∞
Apply softmax → 0 attention to padding
Key property: Padding mask is per-sample, varies across batch
Find which positions are padding
Set attention scores to those positions = -∞
Apply softmax → 0 attention to padding
Key property: Padding mask is per-sample, varies across batch
Different Mask Representations
Boolean Mask
True = ignore (padding)
False = attend
mask = [[False, False, True], # seq 1: last is pad [False, False, False], # seq 2: no padding [False, True, True]] # seq 3: last two are pad
False = attend
mask = [[False, False, True], # seq 1: last is pad [False, False, False], # seq 2: no padding [False, True, True]] # seq 3: last two are pad
Attention Mask (for BERT-style models)
1 = attend (real token)
0 = ignore (padding)
attention_mask = [[1, 1, 0], [1, 1, 1], [1, 0, 0]]
0 = ignore (padding)
attention_mask = [[1, 1, 0], [1, 1, 1], [1, 0, 0]]
Combined Masks
In the decoder, we often need both causal and padding masks:
combined_mask = causal_mask OR padding_mask
For decoder self-attention:
- Block future positions (causal)
- Block padding positions (padding)
For decoder cross-attention:
- Only apply padding mask (no causal on cross)
For decoder self-attention:
- Block future positions (causal)
- Block padding positions (padding)
For decoder cross-attention:
- Only apply padding mask (no causal on cross)
Padding Token Considerations
- Learnable vs Fixed: Some models learn padding embeddings, others keep them fixed
- Position in sequence: Padding typically at end, but can be at beginning or middle
- Attention must not flow: Even if padding embeddings are learned, we must mask them to prevent meaningless attention