Introduction
Video attention extends attention mechanisms to video data, which combines spatial (image) and temporal (time) dimensions. Video understanding requires attending to objects across both space and time, making attention design more complex than for static images.
Video as Spatio-Temporal Sequence
1. Frame-by-Frame (Late Temporal)
Video → extract frames → process each frame with image model → temporal modeling
Tokens: [frame_0 tokens, frame_1 tokens, ..., frame_T tokens]
Tokens: [frame_0 tokens, frame_1 tokens, ..., frame_T tokens]
2. Tube/Patch-by-Patch (Early Temporal)
Video → spatiotemporal tubes/patches → flatten to sequence
Each "tube" is a small volume across time
Each "tube" is a small volume across time
Video Attention Patterns
1. Spatial Attention
Within each frame, like standard image attention.
2. Temporal Attention
Across frames at same spatial position, tracking objects over time.
3. Spatio-Temporal Attention
Full attention across both space and time.
Complexity Challenge
Video: T frames, each H×W pixels
If treated as tokens: sequence length = T × (H/p)² × (W/p)²
For 16 frames, 224×224, 16×16 patches:
16 × 196 = 3136 tokens (still manageable)
For longer videos: becomes very large → need efficient attention
If treated as tokens: sequence length = T × (H/p)² × (W/p)²
For 16 frames, 224×224, 16×16 patches:
16 × 196 = 3136 tokens (still manageable)
For longer videos: becomes very large → need efficient attention
Efficient Video Attention
- Space-only first: Attend within frames, then aggregate temporally
- Key frame selection: Attend only to key frames
- Memory-efficient: Use chunked attention or sparse patterns