50 - Retrieval Attention | Mango Encyclopedia

Introduction

Retrieval attention refers to attention mechanisms used in retrieval systems where a query attends to a large external database or knowledge base. This is fundamental to retrieval-augmented models (RAG), memory-augmented networks, and dense retrieval systems.

Retrieval-Augmented Generation (RAG)

Combines retrieval with generation:

Query → retrieve relevant documents → attend to retrieved content → generate response

Attention is used to attend to retrieved document embeddings

Attention in Retrieval

1. Query-Document Attention

Q = query embedding
K, V = document embeddings (from retrieval index)

Attention = which retrieved documents are most relevant to query

2. Cross-Attention for Fusion

After retrieval, cross-attend between query tokens and retrieved content.

Bi-Encoder vs Cross-Encoder Retrieval

Aspect	Bi-Encoder	Cross-Encoder
Representation	Query and doc separately encoded	Query and doc encoded together
Attention	No cross-attention in encoding	Cross-attention during encoding
Speed	Fast (index and compare)	Slow (must encode pairs)
Quality	Good but approximate	Better but slower

Memory-Augmented Attention

Key-Value Memory: (K_mem, V_mem) Attention(query, K_mem, V_mem) Similar to standard attention but memory can be very large

Test Your Understanding

Question 1: Retrieval attention is used in:

A) RAG (Retrieval-Augmented Generation)
B) Memory-augmented networks
C) Dense retrieval systems
D) All of the above

Question 2: In RAG, attention is computed between query and:

A) Random noise
B) Retrieved document embeddings
C) No attention
D) Only recent tokens

Question 3: Bi-encoder retrieval:

A) Encodes query and doc together with cross-attention
B) Encodes query and doc separately, no cross-attention
C) Uses no encoding
D) Is slower than cross-encoder

Question 4: Cross-encoder is:

A) Faster than bi-encoder
B) Slower but more accurate
C) No different from bi-encoder
D) Not used

Question 5: Memory-augmented attention uses:

A) Query attends to KV memory
B) No attention
C) Only value
D) Fixed memory

Question 6: Retrieval attention differs from standard attention in that:

A) Attends to external database/documents rather than current sequence
B) Is exactly the same
C) Has no purpose
D) Uses less memory

Question 7: Dense retrieval uses:

A) Sparse vectors
B) Learned dense embeddings for similarity search
C) No embeddings
D) Random vectors

Question 8: For large-scale retrieval, cross-attention is often avoided because:

A) Too slow with large document collections
B) Too accurate
C) Uses too little memory
D) Not useful

50. Retrieval Attention