26 - Padding Masking | Mango Encyclopedia

Introduction

Padding masking is a technique used to mask out padding tokens in sequences of variable lengths. When processing batches with sequences of different lengths, padding tokens are added to make all sequences the same length. Padding masks tell the attention mechanism to ignore these artificial tokens.

The Problem

When batching sequences:

Sequence 1: "Hello world" [len=2] Sequence 2: "AI is great" [len=3] Sequence 3: "NLP" [len=1] Batched (padded to len=3): Seq 1: [Hello, world, ] Seq 2: [AI, is, great] Seq 3: [NLP, , ] We need to tell attention to ignore tokens

Padding Mask Implementation

Create padding mask:
mask[i,j] = True if position j is padding
mask[i,j] = False if position j is real token

In attention scores:
scores[i,j] = scores[i,j] if not padding_mask[i,j]
scores[i,j] = -∞ if padding_mask[i,j]

After softmax, attention to padding ≈ 0

Padding Mask in Self-Attention

For each batch item i:
Find which positions are padding
Set attention scores to those positions = -∞
Apply softmax → 0 attention to padding

Key property: Padding mask is per-sample, varies across batch

Different Mask Representations

Boolean Mask

True = ignore (padding)
False = attend

mask = [[False, False, True], # seq 1: last is pad [False, False, False], # seq 2: no padding [False, True, True]] # seq 3: last two are pad

Attention Mask (for BERT-style models)

1 = attend (real token)
0 = ignore (padding)

attention_mask = [[1, 1, 0], [1, 1, 1], [1, 0, 0]]

Combined Masks

In the decoder, we often need both causal and padding masks:

combined_mask = causal_mask OR padding_mask

For decoder self-attention:
- Block future positions (causal)
- Block padding positions (padding)

For decoder cross-attention:
- Only apply padding mask (no causal on cross)

Padding Token Considerations

Learnable vs Fixed: Some models learn padding embeddings, others keep them fixed
Position in sequence: Padding typically at end, but can be at beginning or middle
Attention must not flow: Even if padding embeddings are learned, we must mask them to prevent meaningless attention

Test Your Understanding

Question 1: What is the purpose of padding masking?

A) Speed up computation
B) Ignore artificial padding tokens in variable-length sequences
C) Increase model accuracy
D> Reduce memory

Question 2: In a batch with sequences of different lengths, where is padding added?

A> Beginning of shorter sequences
B) End of shorter sequences (typically)
C) Random positions
D) Not added

Question 3: What value is set for padding positions in the attention mask?

A) 0
B) 1
C) -∞
D) ∞

Question 4: A sequence "Hello world" padded to length 5 becomes:

A) [PAD, PAD, PAD, Hello, world]
B) [Hello, world, PAD, PAD, PAD]
C) [Hello, PAD, world, PAD, PAD]
D) [PAD, Hello, PAD, world, PAD]

Question 5: The padding mask is typically:

A) Same for all items in batch
B) Per-sample (varies within batch)
C) Only used in encoder
D) Always lower triangular

Question 6: In the decoder with both causal and padding masks, we:

A) Apply them separately
B) Combine using OR logic
C) Only use causal
D) Only use padding

Question 7: After applying padding mask and softmax, attention to padding tokens is approximately:

A) 1 (high)
B) 0.5
C) 0
D) Random

Question 8: Boolean mask with True means:

A) Attend to this position
B) Ignore this position (padding)
C) Use this position for output
D) Compute attention normally

26. Padding Masking