49. Video Attention

Introduction

Video attention extends attention mechanisms to video data, which combines spatial (image) and temporal (time) dimensions. Video understanding requires attending to objects across both space and time, making attention design more complex than for static images.

Video as Spatio-Temporal Sequence

1. Frame-by-Frame (Late Temporal)

Video → extract frames → process each frame with image model → temporal modeling

Tokens: [frame_0 tokens, frame_1 tokens, ..., frame_T tokens]

2. Tube/Patch-by-Patch (Early Temporal)

Video → spatiotemporal tubes/patches → flatten to sequence

Each "tube" is a small volume across time

Video Attention Patterns

1. Spatial Attention

Within each frame, like standard image attention.

2. Temporal Attention

Across frames at same spatial position, tracking objects over time.

3. Spatio-Temporal Attention

Full attention across both space and time.

Complexity Challenge

Video: T frames, each H×W pixels

If treated as tokens: sequence length = T × (H/p)² × (W/p)²

For 16 frames, 224×224, 16×16 patches:
16 × 196 = 3136 tokens (still manageable) For longer videos: becomes very large → need efficient attention

Efficient Video Attention

Test Your Understanding

Question 1: Video adds which dimension compared to images?

  • A) Color
  • B) Time (temporal)
  • C> No dimension
  • D> Depth

Question 2: Video attention requires attending to:

  • A) Space only
  • B) Time only
  • C) Both space and time
  • D) No attention

Question 3: For 16 frames, 224×224, with 16×16 patches, token count is:

  • A) 16
  • B) 196
  • C) 3136 (16 × 196)
  • D) 4096

Question 4: Temporal attention attends to:

  • A) Same spatial position across different frames
  • B) Different spatial positions in same frame
  • C) No attention
  • D> Random positions

Question 5: A challenge with video attention is:

  • A) Too few tokens
  • B) Very long sequences (many frames × patches)
  • C> Too simple
  • D> No memory issue

Question 6: Spatiotemporal attention attends:

  • A) Only within one frame
  • B) Only across frames
  • C) Both within frames and across frames
  • D> No attention

Question 7: Video tubes are:

  • A) Only spatial patches
  • B) Spatiotemporal volumes across time
  • C> Single pixels
  • D> No concept

Question 8: Efficient video attention strategies include:

  • A) Full attention always
  • B> Space-only first then temporal, or sparse patterns
  • C) No strategies needed
  • D> Use maximum memory