Introduction
Reinforcement learning attention uses RL algorithms to learn attention policies. Since hard attention is non-differentiable, standard backpropagation cannot be used. RL provides a way to optimize which positions to attend to.
Why RL for Attention?
Soft attention is differentiable, but hard attention requires discrete selection:
Hard attention: z ~ categorical(attention_weights)
This sampling is non-differentiable
RL (REINFORCE) can estimate gradients:
∇L = E[∇ log p(z) · r]
This sampling is non-differentiable
RL (REINFORCE) can estimate gradients:
∇L = E[∇ log p(z) · r]
REINFORCE for Attention
Sample which position to attend to
Reward r = task performance (accuracy, BLEU, etc.)
Gradient: ∇L = r · ∇ log π(position)
π is the attention distribution
Reward r = task performance (accuracy, BLEU, etc.)
Gradient: ∇L = r · ∇ log π(position)
π is the attention distribution
Attention as Policy
Attention weights define a policy over positions:
- Policy: Distribution over positions to attend to
- Action: Which position to select (for hard attention)
- Reward: Downstream task performance
Variants
1. Hard Attention with RL
Select single position, train with REINFORCE.
2. Attention Weight Prediction
Predict which positions are important, use as attention bias.
3. Meta-Learning Attention
Learn to learn attention patterns via RL.