Introduction
Graph Attention Networks (GAT) apply attention mechanisms to graph-structured data. Instead of tokens in a sequence, GAT operates on nodes in a graph, where attention determines how much each node should attend to its neighbors.
Graph Notation
Graph G = (V, E)
Nodes V = {v₁, v₂, ..., vₙ}
Edges E = connections between nodes
Node features: h = {h₁, h₂, ..., hₙ}
Nodes V = {v₁, v₂, ..., vₙ}
Edges E = connections between nodes
Node features: h = {h₁, h₂, ..., hₙ}
GAT Layer
For node i attending to neighbor j:
eᵢⱼ = LeakyReLU(aᵀ[W·hᵢ || W·hⱼ])
αᵢⱼ = softmaxⱼ(eᵢⱼ)
h'ᵢ = σ(Σⱼ αᵢⱼ · W·hⱼ)
eᵢⱼ = LeakyReLU(aᵀ[W·hᵢ || W·hⱼ])
αᵢⱼ = softmaxⱼ(eᵢⱼ)
h'ᵢ = σ(Σⱼ αᵢⱼ · W·hⱼ)
Key Difference from Transformer
| Transformer | GAT |
|---|---|
| Full sequence (all-to-all) | Graph edges (neighbors only) |
| Positional encoding | No position (graph structure is position) |
| Sequence length fixed | Variable node count |
Multi-Head GAT
For K attention heads:
h'ᵢ = ||ₖ σ(Σⱼ αᵢⱼᵏ · Wᵏ·hⱼ)
|| = concatenation
h'ᵢ = ||ₖ σ(Σⱼ αᵢⱼᵏ · Wᵏ·hⱼ)
|| = concatenation
Applications
- Citation networks: Paper classification
- Social networks: User modeling
- Molecular graphs: Drug property prediction
- Knowledge graphs: Entity classification