10. Self-Attention

Introduction

Self-attention (also called intra-attention) is an attention mechanism where all queries, keys, and values come from the same sequence. This allows each position in the sequence to attend to all other positions, capturing dependencies and relationships within the sequence itself. It is the foundational mechanism of the Transformer architecture.

Core Concept

In self-attention, the input sequence is transformed into Q, K, V representations, and then each position attends to all positions in the same sequence:

Q = X · Wᵠ (Query from same sequence)

K = X · Wₖ (Key from same sequence)

V = X · Wᵥ (Value from same sequence)

output = Attention(Q, K, V)

Key Properties

Comparison with Other Attention Types

AspectSelf-AttentionEncoder-Decoder AttentionCross-Attention
Q sourceSame sequenceDecoder sequenceSequence A
K, V sourceSame sequenceEncoder sequenceSequence B
CapturesWithin-sequence relationshipsCross-sequence alignmentCross-sequence relationships
Used inBERT, GPT, ViTTranslation (T5)Multimodal models

Detailed Computation

Step 1: Linear Projections

For each position i: qᵢ = xᵢ · Wᵠ kᵢ = xᵢ · Wₖ vᵢ = xᵢ · Wᵥ

Step 2: Attention Scores

eᵢⱼ = (qᵢ · kⱼ) / √dₖ

Creates attention matrix E ∈ ℝ^{n×n}

Step 3: Softmax and Weighted Sum

αᵢⱼ = softmax(eᵢ)ⱼ

outputᵢ = Σⱼ αᵢⱼ · vⱼ

Self-Attention in Transformers

The Transformer uses self-attention in both encoder and decoder:

Encoder Self-Attention

Each position in the encoder can attend to all positions in the same encoder:

Encoder output = LayerNorm(x + MultiHead(x, x, x))

Decoder Self-Attention (Masked)

Each position can only attend to previous positions (causal masking):

Decoder output = LayerNorm(x + MaskedMultiHead(x, x, x))

Advantages of Self-Attention

Disadvantages

Visual Example

For sentence "The cat sat on the mat":

When processing "cat", self-attention can directly connect to "sat" (the action it performs) even though they're separated by other words.

Attention from "cat" to all positions:

[The: 0.05, cat: 0.10, sat: 0.45, on: 0.10, the: 0.05, mat: 0.25]

"sat" gets high weight (verb relationship)

"mat" gets medium weight (location relationship)

Variants

Test Your Understanding

Question 1: In self-attention, where do Q, K, V come from?

  • A) Q from encoder, K,V from decoder
  • B) All from the same sequence
  • C) All from different sequences
  • D) From external memory

Question 2: What property makes self-attention different from cross-attention?

  • A) Uses more parameters
  • B) Q, K, V come from the same sequence
  • C) Is faster
  • D) Uses residual connections

Question 3: What is the computational complexity of self-attention?

  • A) O(n) linear
  • B) O(n²) quadratic
  • C) O(log n)
  • D) O(1)

Question 4: Why does self-attention need positional encoding?

  • A) To increase parameters
  • B) Self-attention has no inherent notion of position
  • C) To make it faster
  • D) To reduce memory

Question 5: In the example with "cat", why might "sat" get high attention weight?

  • A) They are adjacent
  • B) "cat" is the subject performing "sat"
  • C) They are at the same position
  • D) Random chance

Question 6: What advantage does self-attention have over RNNs?

  • A) Lower memory
  • B) Better parallelization and shorter gradient paths
  • C) Uses fewer parameters
  • D) More interpretable

Question 7: What is masked self-attention used for?

  • A) Image processing
  • B) Preventing attending to future positions in decoders
  • C) Speeding up computation
  • D) Reducing parameters

Question 8: Self-attention allows direct connections between any positions, enabling:

  • A) Slower training
  • B) Long-range dependency modeling regardless of distance
  • C) Sequential processing
  • D) Higher memory usage only