Introduction
Self-attention (also called intra-attention) is an attention mechanism where all queries, keys, and values come from the same sequence. This allows each position in the sequence to attend to all other positions, capturing dependencies and relationships within the sequence itself. It is the foundational mechanism of the Transformer architecture.
Core Concept
In self-attention, the input sequence is transformed into Q, K, V representations, and then each position attends to all positions in the same sequence:
K = X · Wₖ (Key from same sequence)
V = X · Wᵥ (Value from same sequence)
output = Attention(Q, K, V)
Key Properties
- Symmetric: Each position can attend to all positions including itself
- Order-agnostic: No inherent notion of position (requires positional encoding)
- Parallel: All attention computations are parallelizable
- Long-range: Can capture dependencies regardless of distance
Comparison with Other Attention Types
| Aspect | Self-Attention | Encoder-Decoder Attention | Cross-Attention |
|---|---|---|---|
| Q source | Same sequence | Decoder sequence | Sequence A |
| K, V source | Same sequence | Encoder sequence | Sequence B |
| Captures | Within-sequence relationships | Cross-sequence alignment | Cross-sequence relationships |
| Used in | BERT, GPT, ViT | Translation (T5) | Multimodal models |
Detailed Computation
Step 1: Linear Projections
Step 2: Attention Scores
Creates attention matrix E ∈ ℝ^{n×n}
Step 3: Softmax and Weighted Sum
outputᵢ = Σⱼ αᵢⱼ · vⱼ
Self-Attention in Transformers
The Transformer uses self-attention in both encoder and decoder:
Encoder Self-Attention
Each position in the encoder can attend to all positions in the same encoder:
Decoder Self-Attention (Masked)
Each position can only attend to previous positions (causal masking):
Advantages of Self-Attention
- Direct long-range dependencies: Any position can directly attend to any other
- Parallel computation: No sequential dependency like in RNNs
- Better gradient flow: Shorter paths for gradient propagation
- Interpretability: Attention weights show relationships between positions
- Versatility: Works for various data types (text, image, audio)
Disadvantages
- Quadratic complexity: O(n²) memory and computation for sequence length n
- No inherent position information: Requires positional encoding
- Memory intensive: Storing full attention matrix for long sequences is expensive
Visual Example
For sentence "The cat sat on the mat":
When processing "cat", self-attention can directly connect to "sat" (the action it performs) even though they're separated by other words.
[The: 0.05, cat: 0.10, sat: 0.45, on: 0.10, the: 0.05, mat: 0.25]
"sat" gets high weight (verb relationship)
"mat" gets medium weight (location relationship)
Variants
- Multi-head self-attention: Multiple attention heads in parallel
- Masked self-attention: Prevents attending to future positions
- Cross-attention: Q from one sequence, K,V from another