Introduction
Deformable attention is an attention mechanism that learns to attend to arbitrary locations rather than fixed grid points. It was introduced in Deformable DETR (2020) to address the limitation of standard attention that only focuses on regular grid positions.
The Problem with Grid Attention
Standard attention on images typically operates on fixed grid positions (like 7×7 windows). This is inefficient when objects of interest are not aligned to these grids.
Deformable Convolution vs Attention
Deformable Convolution (DConv)
Offset sampling: p = p₀ + Δp
Features: f(p₀) = Σ w·f(p₀ + Δp)
Features: f(p₀) = Σ w·f(p₀ + Δp)
Deformable Attention
Multi-scale deformable attention:
y(q) = Σ K w·A_qk · x(p_k + Δp_qk) / Σ K w A_qk
Where:
K = number of sampling points (e.g., 4×4=16)
Δp_qk = learned offset for query q at point k
A_qk = attention weight
y(q) = Σ K w·A_qk · x(p_k + Δp_qk) / Σ K w A_qk
Where:
K = number of sampling points (e.g., 4×4=16)
Δp_qk = learned offset for query q at point k
A_qk = attention weight
Key Components
1. Offset Generation
From query feature q, predict offset Δp_qk for each sampling point
Offset head: FC → 2K (x and y offsets)
Offset head: FC → 2K (x and y offsets)
2. Sampling Points
For each query, sample K points at learnable offsets from current position:
Sampling positions: p_k + Δp_qk
K is typically 4×4 = 16 points per query
K is typically 4×4 = 16 points per query
Advantages
- Flexible receptive field: Attends to locations relevant to object, not fixed grid
- Multi-scale: Can sample from different feature levels
- Efficient: O(K) per query instead of O(n) for full attention