64 - Graph Attention Networks (GAT)

Introduction

Graph Attention Networks (GAT) apply attention mechanisms to graph-structured data. Instead of tokens in a sequence, GAT operates on nodes in a graph, where attention determines how much each node should attend to its neighbors.

Graph Notation

Graph G = (V, E)
Nodes V = {v₁, v₂, ..., vₙ}
Edges E = connections between nodes

Node features: h = {h₁, h₂, ..., hₙ}

GAT Layer

For node i attending to neighbor j:

eᵢⱼ = LeakyReLU(aᵀ[W·hᵢ || W·hⱼ])

αᵢⱼ = softmaxⱼ(eᵢⱼ)

h'ᵢ = σ(Σⱼ αᵢⱼ · W·hⱼ)

Key Difference from Transformer

Transformer	GAT
Full sequence (all-to-all)	Graph edges (neighbors only)
Positional encoding	No position (graph structure is position)
Sequence length fixed	Variable node count

Multi-Head GAT

For K attention heads:

h'ᵢ = ||ₖ σ(Σⱼ αᵢⱼᵏ · Wᵏ·hⱼ)

|| = concatenation

Applications

Citation networks: Paper classification
Social networks: User modeling
Molecular graphs: Drug property prediction
Knowledge graphs: Entity classification

Test Your Understanding

Question 1: GAT operates on:

A) Sequences
B) Graphs (nodes and edges)
C> Images
D) Audio

Question 2: In GAT, attention is computed between:

A) All nodes
B) Only neighboring nodes (connected by edges)
C) Random nodes
D> All-to-all like transformer

Question 3: GAT does NOT need:

A) Graph structure
B> Positional encoding (graph structure IS position)
C) Node features
D> Attention

Question 4: Node features in GAT are:

A) Fixed
B) Transformed via W matrix and aggregated by attention
C) Ignored
D> Random

Question 5: GAT is used for:

A) Text classification only
B) Graph-structured data (social networks, molecules, knowledge graphs)
C) Image classification only
D> Speech recognition

Question 6: Multi-head GAT concatenates outputs from:

A) All layers
B) Multiple attention heads
C) No heads
D) Single node

Question 7: The difference from transformer is:

A) Same architecture
B> GAT uses graph edges to limit attention, transformer uses all-to-all
C) Transformer uses edges
D> No difference

Question 8: eᵢⱼ = LeakyReLU(aᵀ[W·hᵢ || W·hⱼ]) computes:

A) Node feature
B) Attention score between nodes i and j
C) Edge weight
D) Graph property

64. Graph Attention Networks (GAT)