Reranking
Theory
Why first-pass retrieval is imprecise
Hybrid search returns top_k=20 candidates ranked by approximate bi-encoder distance plus BM25 score. Neither method reads the query and document together — they score each independently. A highly relevant chunk can land at rank 18.
Cross-encoder vs bi-encoder
| Bi-encoder | Cross-encoder | |
|---|---|---|
| Input | Query alone, doc alone | [query, doc] concatenated |
| Attention | Independent | Full cross-attention |
| Speed | Fast — precomputed index | Slow — runtime per candidate |
| Precision | Coarser | Higher |
| Role | First-pass retrieval | Reranking shortlist |
Two-stage pipeline
- Hybrid search (vector + BM25) retrieves
top_k=20— optimized for recall - Cross-encoder scores each candidate against the query with full attention
- Keep
top_n=5highest-scoring chunks for the LLM context
Common rerankers: cross-encoder/ms-marco-MiniLM-L-6-v2, bge-reranker-v2, Cohere Rerank API. Latency cost is 20–80 ms per query — usually acceptable given the precision gains.
Next: the evaluation lesson covers how to measure whether reranking actually improves your system's answers.