Introduction
The Transformer encoder is the encoder half of the original Transformer architecture, introduced in "Attention is All You Need" (2017). It processes the input sequence and creates a continuous representation that captures contextual information about each position. The encoder is bidirectional, meaning each position can attend to all other positions.
Architecture Overview
Input Token Embeddings + Positional Encoding
│
▼
┌───────────────────────────────┐
│ ENCODER BLOCK │
│ ┌─────────────────────────┐ │
│ │ Multi-Head Self-Attention│ │
│ │ (no mask - full access) │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Add & Norm │ │
│ │ (Residual Connection) │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Feed-Forward │ │
│ │ Network │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Add & Norm │ │
│ │ (Residual Connection) │ │
│ └─────────────────────────┘ │
└───────────────────────────────┘
│
▼ (repeat N times)
Encoder Block Components
1. Multi-Head Self-Attention
Attention(Q, K, V) = softmax(QKᵀ/√d) · V
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wᴼ
headᵢ = Attention(QWᵠᵢ, KWₖᵢ, VWᵥᵢ)
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wᴼ
headᵢ = Attention(QWᵠᵢ, KWₖᵢ, VWᵥᵢ)
No masking is applied, allowing full bidirectional attention.
2. Residual Connection & Layer Normalization
x_layer = LayerNorm(x + SubLayer(x))
Each sub-layer (attention, FFN) has a residual connection around it.
3. Feed-Forward Network
FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂
Typically: d_model → d_ff → d_model
d_ff is usually 4× d_model (e.g., 2048 for d_model=512)
Typically: d_model → d_ff → d_model
d_ff is usually 4× d_model (e.g., 2048 for d_model=512)
Full Encoder Stack
N identical layers (6 in original Transformer)
Each layer has: - Multi-head self-attention - Feed-forward network
Output: contextualized representations for each input position
Each layer has: - Multi-head self-attention - Feed-forward network
Output: contextualized representations for each input position
Key Properties
- Bidirectional: Every position attends to all positions
- Parallel processing: All positions processed simultaneously
- Order-invariant: Position information comes from positional encoding
- Layer stacking: Stacking layers builds hierarchical representations
Used In
- BERT: Encoder-only for language understanding
- RoBERTa: Improved BERT with encoder
- ViT: Vision Transformer uses encoder
- T5 encoder: Part of encoder-decoder T5
BERT's Use of Encoder
BERT uses bidirectional encoder to build contextual representations:
Input: [CLS] The cat sat [MASK] the mat [SEP]
Encoder output: contextual embeddings for each token
[CLS] token used for classification tasks
Encoder output: contextual embeddings for each token
[CLS] token used for classification tasks