65. Reinforcement Learning Attention

Introduction

Reinforcement learning attention uses RL algorithms to learn attention policies. Since hard attention is non-differentiable, standard backpropagation cannot be used. RL provides a way to optimize which positions to attend to.

Why RL for Attention?

Soft attention is differentiable, but hard attention requires discrete selection:

Hard attention: z ~ categorical(attention_weights)

This sampling is non-differentiable

RL (REINFORCE) can estimate gradients:
∇L = E[∇ log p(z) · r]

REINFORCE for Attention

Sample which position to attend to

Reward r = task performance (accuracy, BLEU, etc.)

Gradient: ∇L = r · ∇ log π(position) π is the attention distribution

Attention as Policy

Attention weights define a policy over positions:

Variants

1. Hard Attention with RL

Select single position, train with REINFORCE.

2. Attention Weight Prediction

Predict which positions are important, use as attention bias.

3. Meta-Learning Attention

Learn to learn attention patterns via RL.

Test Your Understanding

Question 1: RL for attention is needed because:

  • A) Attention is too slow
  • B) Hard attention is non-differentiable
  • C> Attention is too fast
  • D> No reason

Question 2: In RL attention, attention weights define:

  • A) Policy over positions
  • B) Loss function
  • C) No policy
  • D) Gradient

Question 3: REINFORCE estimates gradients using:

  • A) Backpropagation
  • B) Reward and log probability
  • C) No estimation
  • D> Second derivatives

Question 4: Hard attention samples:

  • A) Weighted average
  • B) Single position from distribution
  • C) All positions
  • D> No position

Question 5: Reward in RL attention is typically:

  • A) Random
  • B) Task performance (accuracy, BLEU)
  • C) Zero
  • D> Constant

Question 6: In REINFORCE gradient ∇L = r · ∇ log π(position), π is:

  • A) Loss
  • B) Attention distribution (policy)
  • C) Reward
  • D> Action

Question 7: Why can't we use standard backprop for hard attention?

  • A) Too slow
  • B) Sampling is non-differentiable operation
  • C> Too accurate
  • D> No reason

Question 8: RL attention has been used for:

  • A) Only images
  • B) Image captioning (Show, Attend and Tell)
  • C) Text only
  • D> No applications