10 - Self-Attention | Mango Encyclopedia

Introduction

Self-attention (also called intra-attention) is an attention mechanism where all queries, keys, and values come from the same sequence. This allows each position in the sequence to attend to all other positions, capturing dependencies and relationships within the sequence itself. It is the foundational mechanism of the Transformer architecture.

Core Concept

In self-attention, the input sequence is transformed into Q, K, V representations, and then each position attends to all positions in the same sequence:

Q = X · Wᵠ (Query from same sequence)

K = X · Wₖ (Key from same sequence)

V = X · Wᵥ (Value from same sequence)

output = Attention(Q, K, V)

Key Properties

Symmetric: Each position can attend to all positions including itself
Order-agnostic: No inherent notion of position (requires positional encoding)
Parallel: All attention computations are parallelizable
Long-range: Can capture dependencies regardless of distance

Comparison with Other Attention Types

Aspect	Self-Attention	Encoder-Decoder Attention	Cross-Attention
Q source	Same sequence	Decoder sequence	Sequence A
K, V source	Same sequence	Encoder sequence	Sequence B
Captures	Within-sequence relationships	Cross-sequence alignment	Cross-sequence relationships
Used in	BERT, GPT, ViT	Translation (T5)	Multimodal models

Detailed Computation

Step 1: Linear Projections

For each position i: qᵢ = xᵢ · Wᵠ kᵢ = xᵢ · Wₖ vᵢ = xᵢ · Wᵥ

Step 2: Attention Scores

eᵢⱼ = (qᵢ · kⱼ) / √dₖ

Creates attention matrix E ∈ ℝ^{n×n}

Step 3: Softmax and Weighted Sum

αᵢⱼ = softmax(eᵢ)ⱼ

outputᵢ = Σⱼ αᵢⱼ · vⱼ

Self-Attention in Transformers

The Transformer uses self-attention in both encoder and decoder:

Encoder Self-Attention

Each position in the encoder can attend to all positions in the same encoder:

Encoder output = LayerNorm(x + MultiHead(x, x, x))

Decoder Self-Attention (Masked)

Each position can only attend to previous positions (causal masking):

Decoder output = LayerNorm(x + MaskedMultiHead(x, x, x))

Advantages of Self-Attention

Direct long-range dependencies: Any position can directly attend to any other
Parallel computation: No sequential dependency like in RNNs
Better gradient flow: Shorter paths for gradient propagation
Interpretability: Attention weights show relationships between positions
Versatility: Works for various data types (text, image, audio)

Disadvantages

Quadratic complexity: O(n²) memory and computation for sequence length n
No inherent position information: Requires positional encoding
Memory intensive: Storing full attention matrix for long sequences is expensive

Visual Example

For sentence "The cat sat on the mat":

When processing "cat", self-attention can directly connect to "sat" (the action it performs) even though they're separated by other words.

Attention from "cat" to all positions:

[The: 0.05, cat: 0.10, sat: 0.45, on: 0.10, the: 0.05, mat: 0.25]

"sat" gets high weight (verb relationship)

"mat" gets medium weight (location relationship)

Variants

Multi-head self-attention: Multiple attention heads in parallel
Masked self-attention: Prevents attending to future positions
Cross-attention: Q from one sequence, K,V from another

Test Your Understanding

Question 1: In self-attention, where do Q, K, V come from?

A) Q from encoder, K,V from decoder
B) All from the same sequence
C) All from different sequences
D) From external memory

Question 2: What property makes self-attention different from cross-attention?

A) Uses more parameters
B) Q, K, V come from the same sequence
C) Is faster
D) Uses residual connections

Question 3: What is the computational complexity of self-attention?

A) O(n) linear
B) O(n²) quadratic
C) O(log n)
D) O(1)

Question 4: Why does self-attention need positional encoding?

A) To increase parameters
B) Self-attention has no inherent notion of position
C) To make it faster
D) To reduce memory

Question 5: In the example with "cat", why might "sat" get high attention weight?

A) They are adjacent
B) "cat" is the subject performing "sat"
C) They are at the same position
D) Random chance

Question 6: What advantage does self-attention have over RNNs?

A) Lower memory
B) Better parallelization and shorter gradient paths
C) Uses fewer parameters
D) More interpretable

Question 7: What is masked self-attention used for?

A) Image processing
B) Preventing attending to future positions in decoders
C) Speeding up computation
D) Reducing parameters

Question 8: Self-attention allows direct connections between any positions, enabling:

A) Slower training
B) Long-range dependency modeling regardless of distance
C) Sequential processing
D) Higher memory usage only

10. Self-Attention