53 - Ring Attention | Mango Encyclopedia

Introduction

Ring attention is a distributed attention mechanism where the attention computation is split across multiple devices arranged in a ring topology. Each device handles a slice of the sequence, and information is passed around the ring to compute full attention.

Motivation

Even with linear attention, long sequences require too much memory for a single device. Ring attention distributes the computation across multiple GPUs/NPUs.

Ring Topology

Devices arranged in ring: Device 0 → Device 1 → Device 2 → ... → Device N-1 → Device 0 Each device holds a chunk of Q, K, V Communication passes K, V around the ring

Algorithm

For each step in ring:
1. Device i has Q_i (its chunk)
2. K_i, V_i are sent to device i-1
3. Device i computes partial attention with received K, V
4. Accumulate and repeat for all devices

After n steps (n devices), each device has full attention output

Communication Pattern

Step 0: Device 0 has K_0, V_0
Step 1: K_0, V_0 moves to Device N-1, Device 0 gets K_1, V_1
Step 2: K_1, V_1 moves to Device N-2, and so on... All devices exchange K, V in ring fashion

Properties

Aspect	Ring Attention
Communication	K, V passed in ring (all-to-all variant)
Memory per device	O(n/C) where C = num devices
Steps	C (number of devices)

Test Your Understanding

Question 1: Ring attention distributes attention computation across:

A) Single device
B) Multiple devices in ring topology
C) Random devices
D) No distribution

Question 2: In ring attention, each device holds:

A) Full sequence
B) A chunk of Q, K, V
C) Only Q
D> No data

Question 3: Ring attention is used for:

A) Single device only
B) Very long sequences with distributed computation
C) Short sequences
D) No purpose

Question 4: K and V are passed around the ring to:

A) No communication
B) Enable each device to compute attention with all K, V
C) Store data
D> Delete data

Question 5: With 4 devices, ring attention requires how many steps?

A) 1
B) 2
C) 4 (one per device)
D) 16

Question 6: Memory per device in ring attention is:

A) O(n)
B) O(n/C) where C is number of devices
C) O(1)
D) O(n²)

Question 7: Ring attention was introduced for:

A) Single GPU training
B) Long sequence distributed training
C) CPU-only systems
D) No specific context

Question 8: After completing all ring steps, each device has:

A) No result
B) Full attention output for its Q chunk
C) Only partial result
D) Random output

53. Ring Attention