18 - Transformer Encoder | Mango Encyclopedia

Introduction

The Transformer encoder is the encoder half of the original Transformer architecture, introduced in "Attention is All You Need" (2017). It processes the input sequence and creates a continuous representation that captures contextual information about each position. The encoder is bidirectional, meaning each position can attend to all other positions.

Architecture Overview

Input Token Embeddings + Positional Encoding │ ▼ ┌───────────────────────────────┐ │ ENCODER BLOCK │ │ ┌─────────────────────────┐ │ │ │ Multi-Head Self-Attention│ │ │ │ (no mask - full access) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ │ (Residual Connection) │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Feed-Forward │ │ │ │ Network │ │ │ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Add & Norm │ │ │ │ (Residual Connection) │ │ │ └─────────────────────────┘ │ └───────────────────────────────┘ │ ▼ (repeat N times)

Encoder Block Components

1. Multi-Head Self-Attention

Attention(Q, K, V) = softmax(QKᵀ/√d) · V

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wᴼ

headᵢ = Attention(QWᵠᵢ, KWₖᵢ, VWᵥᵢ)

No masking is applied, allowing full bidirectional attention.

2. Residual Connection & Layer Normalization

x_layer = LayerNorm(x + SubLayer(x))

Each sub-layer (attention, FFN) has a residual connection around it.

3. Feed-Forward Network

FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂

Typically: d_model → d_ff → d_model

d_ff is usually 4× d_model (e.g., 2048 for d_model=512)

Full Encoder Stack

N identical layers (6 in original Transformer)

Each layer has: - Multi-head self-attention - Feed-forward network

Output: contextualized representations for each input position

Key Properties

Bidirectional: Every position attends to all positions
Parallel processing: All positions processed simultaneously
Order-invariant: Position information comes from positional encoding
Layer stacking: Stacking layers builds hierarchical representations

Used In

BERT: Encoder-only for language understanding
RoBERTa: Improved BERT with encoder
ViT: Vision Transformer uses encoder
T5 encoder: Part of encoder-decoder T5

BERT's Use of Encoder

BERT uses bidirectional encoder to build contextual representations:

Input: [CLS] The cat sat [MASK] the mat [SEP]

Encoder output: contextual embeddings for each token

[CLS] token used for classification tasks

Test Your Understanding

Question 1: What type of attention does the Transformer encoder use?

A) Masked self-attention
B) Unmasked bidirectional self-attention
C> Cross-attention
D) No attention

Question 2: How many identical layers does the original Transformer encoder have?

A) 1
B) 6
C) 12
D) 24

Question 3: What does the feed-forward network typically expand to?

A) d_model / 4
B) d_model
C) 4 × d_model
D) d_model²

Question 4: What is the purpose of the residual connection?

A) To reduce parameters
B) To allow gradient flow and enable deeper networks
C) To add positional information
D) To mask attention

Question 5: Which model uses encoder-only architecture?

A) GPT-2
B) T5
C) BERT
D) BART

Question 6: The encoder output is:

A) Single vector
B) Contextualized representations for each position
C) Attention weights only
D) Logits for vocabulary

Question 7: LayerNorm is applied after each sub-layer. Where is it applied?

A) Before sub-layer
B) After sub-layer with residual (post-norm)
C) Only after attention
D) Only after FFN

Question 8: What makes the encoder bidirectional?

A) Two separate layers
B) No causal mask applied
C) Bidirectional RNN
D) Input is reversed

18. Transformer Encoder