50. Retrieval Attention

Introduction

Retrieval attention refers to attention mechanisms used in retrieval systems where a query attends to a large external database or knowledge base. This is fundamental to retrieval-augmented models (RAG), memory-augmented networks, and dense retrieval systems.

Retrieval-Augmented Generation (RAG)

Combines retrieval with generation:

Query → retrieve relevant documents → attend to retrieved content → generate response

Attention is used to attend to retrieved document embeddings

Attention in Retrieval

1. Query-Document Attention

Q = query embedding
K, V = document embeddings (from retrieval index)

Attention = which retrieved documents are most relevant to query

2. Cross-Attention for Fusion

After retrieval, cross-attend between query tokens and retrieved content.

Bi-Encoder vs Cross-Encoder Retrieval

AspectBi-EncoderCross-Encoder
RepresentationQuery and doc separately encodedQuery and doc encoded together
AttentionNo cross-attention in encodingCross-attention during encoding
SpeedFast (index and compare)Slow (must encode pairs)
QualityGood but approximateBetter but slower

Memory-Augmented Attention

Key-Value Memory: (K_mem, V_mem) Attention(query, K_mem, V_mem) Similar to standard attention but memory can be very large

Test Your Understanding

Question 1: Retrieval attention is used in:

  • A) RAG (Retrieval-Augmented Generation)
  • B) Memory-augmented networks
  • C) Dense retrieval systems
  • D) All of the above

Question 2: In RAG, attention is computed between query and:

  • A) Random noise
  • B) Retrieved document embeddings
  • C) No attention
  • D) Only recent tokens

Question 3: Bi-encoder retrieval:

  • A) Encodes query and doc together with cross-attention
  • B) Encodes query and doc separately, no cross-attention
  • C) Uses no encoding
  • D) Is slower than cross-encoder

Question 4: Cross-encoder is:

  • A) Faster than bi-encoder
  • B) Slower but more accurate
  • C) No different from bi-encoder
  • D) Not used

Question 5: Memory-augmented attention uses:

  • A) Query attends to KV memory
  • B) No attention
  • C) Only value
  • D) Fixed memory

Question 6: Retrieval attention differs from standard attention in that:

  • A) Attends to external database/documents rather than current sequence
  • B) Is exactly the same
  • C) Has no purpose
  • D) Uses less memory

Question 7: Dense retrieval uses:

  • A) Sparse vectors
  • B) Learned dense embeddings for similarity search
  • C) No embeddings
  • D) Random vectors

Question 8: For large-scale retrieval, cross-attention is often avoided because:

  • A) Too slow with large document collections
  • B) Too accurate
  • C) Uses too little memory
  • D) Not useful