51. Memory Attention

Introduction

Memory attention is attention over an external memory module that stores information. This extends the Transformer's attention to access large external stores of knowledge, similar to how computer memory stores data that can be accessed via addressing.

Memory-Augmented Neural Networks

Introduced in Memory Networks (MemN2N) and Neural Turing Machines:

Read: Attention(query, Memory_keys, Memory_values)

Output: Weighted combination of memory values

Different Types of Memory

1. Key-Value Memory

Keys K_mem, Values V_mem Attention(query, K_mem) → weights
Output = Σ weights × V_mem

2. Content-Based Memory

Memory addressed by content similarity.

3. Addressable Memory

Both content-based and location-based addressing.

Transformer as Memory

Standard Transformer can be viewed as having memory:

Previous tokens act as "memory" for current token Self-attention: query attends to previous token keys/values

External Memory in Models

Test Your Understanding

Question 1: Memory attention uses attention to access:

  • A) Current sequence only
  • B) External memory module
  • C) No memory
  • D) Random data

Question 2: Key-Value memory has:

  • A) Keys only
  • B) Values only
  • C) Both keys and values
  • D> No structure

Question 3: In key-value memory attention, we compute:

  • A) Random output
  • B) Attention weights from query to keys, weighted sum of values
  • C) No attention
  • D) Identity output

Question 4: Neural Turing Machines use:

  • A) No memory
  • B) Differentiable memory with attention
  • C) Fixed memory only
  • D) Random access

Question 5: Memory attention is similar to:

  • A) Self-attention (but memory can be external)
  • B) No attention
  • C> Only cross-attention
  • D> Fully connected

Question 6: Content-based memory addressing uses:

  • A) Random locations
  • B) Similarity to query to find relevant memory
  • C) No addressing
  • D) Sequential access only

Question 7: Transformer tokens can be viewed as:

  • A) No memory
  • B) Previous tokens acting as memory for current token
  • C) External database
  • D) Random noise

Question 8: Memory Networks introduced:

  • A) Memory-less models
  • B) Memory with differentiable attention for reading/writing
  • C) No new concepts
  • D) Recurrent networks only