Introduction
Retrieval attention refers to attention mechanisms used in retrieval systems where a query attends to a large external database or knowledge base. This is fundamental to retrieval-augmented models (RAG), memory-augmented networks, and dense retrieval systems.
Retrieval-Augmented Generation (RAG)
Combines retrieval with generation:
Query → retrieve relevant documents → attend to retrieved content → generate response
Attention is used to attend to retrieved document embeddings
Attention is used to attend to retrieved document embeddings
Attention in Retrieval
1. Query-Document Attention
Q = query embedding
K, V = document embeddings (from retrieval index)
Attention = which retrieved documents are most relevant to query
K, V = document embeddings (from retrieval index)
Attention = which retrieved documents are most relevant to query
2. Cross-Attention for Fusion
After retrieval, cross-attend between query tokens and retrieved content.
Bi-Encoder vs Cross-Encoder Retrieval
| Aspect | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Representation | Query and doc separately encoded | Query and doc encoded together |
| Attention | No cross-attention in encoding | Cross-attention during encoding |
| Speed | Fast (index and compare) | Slow (must encode pairs) |
| Quality | Good but approximate | Better but slower |
Memory-Augmented Attention
Key-Value Memory: (K_mem, V_mem)
Attention(query, K_mem, V_mem)
Similar to standard attention but memory can be very large
Attention(query, K_mem, V_mem)
Similar to standard attention but memory can be very large