Introduction
The lost-in-the-middle problem is an issue where language models can better retrieve information at the beginning or end of a long context, but perform significantly worse when the relevant information is located in the middle. This was documented in "Lost in the Middle" (Liu et al., 2024).
The Phenomenon
When asked to retrieve a specific fact from a long document:
Position of fact → Retrieval accuracy
Beginning: ~90%
End: ~90%
Middle: ~60-70%
Beginning: ~90%
End: ~90%
Middle: ~60-70%
Why Does This Happen?
1. Attention Flow
Information must flow through many layers. Early layers may not preserve middle information to later layers.
2. Positional Bias
Models trained with positional encodings may favor beginning/end positions.
3. Attention Dilution
As context grows, middle positions receive less distinct attention.
Experimental Evidence
Researchers inserted a random fact into documents of varying lengths and asked models to retrieve it:
| Context Length | Beginning | Middle | End |
|---|---|---|---|
| 1K tokens | 95% | 90% | 95% |
| 8K tokens | 85% | 60% | 85% |
| 32K tokens | 80% | 45% | 80% |
Implications
- RAG systems: May miss information in middle of retrieved docs
- Long document QA: Middle information may be ignored
- Code understanding: Important code in middle may be missed