45 - Deformable Attention | Mango Encyclopedia

Introduction

Deformable attention is an attention mechanism that learns to attend to arbitrary locations rather than fixed grid points. It was introduced in Deformable DETR (2020) to address the limitation of standard attention that only focuses on regular grid positions.

The Problem with Grid Attention

Standard attention on images typically operates on fixed grid positions (like 7×7 windows). This is inefficient when objects of interest are not aligned to these grids.

Deformable Convolution vs Attention

Deformable Convolution (DConv)

Offset sampling: p = p₀ + Δp

Features: f(p₀) = Σ w·f(p₀ + Δp)

Deformable Attention

Multi-scale deformable attention:

y(q) = Σ K w·A_qk · x(p_k + Δp_qk) / Σ K w A_qk

Where:
K = number of sampling points (e.g., 4×4=16)
Δp_qk = learned offset for query q at point k
A_qk = attention weight

Key Components

1. Offset Generation

From query feature q, predict offset Δp_qk for each sampling point

Offset head: FC → 2K (x and y offsets)

2. Sampling Points

For each query, sample K points at learnable offsets from current position:

Sampling positions: p_k + Δp_qk

K is typically 4×4 = 16 points per query

Advantages

Flexible receptive field: Attends to locations relevant to object, not fixed grid
Multi-scale: Can sample from different feature levels
Efficient: O(K) per query instead of O(n) for full attention

Test Your Understanding

Question 1: Deformable attention learns to attend to:

A) Fixed grid positions
B) Arbitrary locations via learned offsets
C) Only immediate neighbors
D) No positions

Question 2: Deformable attention was introduced in:

A) ViT paper
B) Deformable DETR
C) BERT
D) Swin

Question 3: Offset generation predicts:

A) Attention weights
B) Where to sample for each query
C) Output values
D) Nothing

Question 4: With K=16 sampling points, deformable attention per query is:

A) O(n) like standard
B) O(K) = O(16)
C) O(n²)
D) O(1)

Question 5: Deformable attention addresses the limitation that standard attention:

A) Uses too much memory
B) Only focuses on regular grid positions, not object-relevant locations
C) Is too slow
D) Has no offset

Question 6: The offset prediction uses:

A) Fixed offsets
B) Learned offsets from query features
C) Random offsets
D) No offsets

Question 7: Multi-scale deformable attention can sample from:

A) Only one scale
B) Different feature levels
C) No levels
D) Only high resolution

Question 8: An offset of Δp_qk = (2, 3) means:

A) Sample 2+3=5 positions
B) Shift position by (2, 3)
C) Use value 2 and 3
D> Multiply by 6

45. Deformable Attention (Deformable DETR)