18. Transformer Encoder

Introduction

The Transformer encoder is the encoder half of the original Transformer architecture, introduced in "Attention is All You Need" (2017). It processes the input sequence and creates a continuous representation that captures contextual information about each position. The encoder is bidirectional, meaning each position can attend to all other positions.

Architecture Overview

Input Token Embeddings + Positional Encoding │ ▼ ┌───────────────────────────────┐ │ ENCODER BLOCK │ │ ┌─────────────────────────┐ │ │ │ Multi-Head Self-Attention│ │ │ │ (no mask - full access) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ │ (Residual Connection) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Feed-Forward │ │ │ │ Network │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ │ (Residual Connection) │ │ │ └─────────────────────────┘ │ └───────────────────────────────┘ │ ▼ (repeat N times)

Encoder Block Components

1. Multi-Head Self-Attention

Attention(Q, K, V) = softmax(QKᵀ/√d) · V

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wᴼ

headᵢ = Attention(QWᵠᵢ, KWₖᵢ, VWᵥᵢ)

No masking is applied, allowing full bidirectional attention.

2. Residual Connection & Layer Normalization

x_layer = LayerNorm(x + SubLayer(x))

Each sub-layer (attention, FFN) has a residual connection around it.

3. Feed-Forward Network

FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂

Typically: d_model → d_ff → d_model

d_ff is usually 4× d_model (e.g., 2048 for d_model=512)

Full Encoder Stack

N identical layers (6 in original Transformer)

Each layer has: - Multi-head self-attention - Feed-forward network

Output: contextualized representations for each input position

Key Properties

Used In

BERT's Use of Encoder

BERT uses bidirectional encoder to build contextual representations:

Input: [CLS] The cat sat [MASK] the mat [SEP]

Encoder output: contextual embeddings for each token

[CLS] token used for classification tasks

Test Your Understanding

Question 1: What type of attention does the Transformer encoder use?

  • A) Masked self-attention
  • B) Unmasked bidirectional self-attention
  • C> Cross-attention
  • D) No attention

Question 2: How many identical layers does the original Transformer encoder have?

  • A) 1
  • B) 6
  • C) 12
  • D) 24

Question 3: What does the feed-forward network typically expand to?

  • A) d_model / 4
  • B) d_model
  • C) 4 × d_model
  • D) d_model²

Question 4: What is the purpose of the residual connection?

  • A) To reduce parameters
  • B) To allow gradient flow and enable deeper networks
  • C) To add positional information
  • D) To mask attention

Question 5: Which model uses encoder-only architecture?

  • A) GPT-2
  • B) T5
  • C) BERT
  • D) BART

Question 6: The encoder output is:

  • A) Single vector
  • B) Contextualized representations for each position
  • C) Attention weights only
  • D) Logits for vocabulary

Question 7: LayerNorm is applied after each sub-layer. Where is it applied?

  • A) Before sub-layer
  • B) After sub-layer with residual (post-norm)
  • C) Only after attention
  • D) Only after FFN

Question 8: What makes the encoder bidirectional?

  • A) Two separate layers
  • B) No causal mask applied
  • C) Bidirectional RNN
  • D) Input is reversed