64. Graph Attention Networks (GAT)

Introduction

Graph Attention Networks (GAT) apply attention mechanisms to graph-structured data. Instead of tokens in a sequence, GAT operates on nodes in a graph, where attention determines how much each node should attend to its neighbors.

Graph Notation

Graph G = (V, E)
Nodes V = {v₁, v₂, ..., vₙ}
Edges E = connections between nodes

Node features: h = {h₁, h₂, ..., hₙ}

GAT Layer

For node i attending to neighbor j:

eᵢⱼ = LeakyReLU(aᵀ[W·hᵢ || W·hⱼ])

αᵢⱼ = softmaxⱼ(eᵢⱼ)

h'ᵢ = σ(Σⱼ αᵢⱼ · W·hⱼ)

Key Difference from Transformer

TransformerGAT
Full sequence (all-to-all)Graph edges (neighbors only)
Positional encodingNo position (graph structure is position)
Sequence length fixedVariable node count

Multi-Head GAT

For K attention heads:

h'ᵢ = ||ₖ σ(Σⱼ αᵢⱼᵏ · Wᵏ·hⱼ)

|| = concatenation

Applications

Test Your Understanding

Question 1: GAT operates on:

  • A) Sequences
  • B) Graphs (nodes and edges)
  • C> Images
  • D) Audio

Question 2: In GAT, attention is computed between:

  • A) All nodes
  • B) Only neighboring nodes (connected by edges)
  • C) Random nodes
  • D> All-to-all like transformer

Question 3: GAT does NOT need:

  • A) Graph structure
  • B> Positional encoding (graph structure IS position)
  • C) Node features
  • D> Attention

Question 4: Node features in GAT are:

  • A) Fixed
  • B) Transformed via W matrix and aggregated by attention
  • C) Ignored
  • D> Random

Question 5: GAT is used for:

  • A) Text classification only
  • B) Graph-structured data (social networks, molecules, knowledge graphs)
  • C) Image classification only
  • D> Speech recognition

Question 6: Multi-head GAT concatenates outputs from:

  • A) All layers
  • B) Multiple attention heads
  • C) No heads
  • D) Single node

Question 7: The difference from transformer is:

  • A) Same architecture
  • B> GAT uses graph edges to limit attention, transformer uses all-to-all
  • C) Transformer uses edges
  • D> No difference

Question 8: eᵢⱼ = LeakyReLU(aᵀ[W·hᵢ || W·hⱼ]) computes:

  • A) Node feature
  • B) Attention score between nodes i and j
  • C) Edge weight
  • D) Graph property