Reranking

Act 3 · ~4 min

Theory

Why first-pass retrieval is imprecise

Hybrid search returns top_k=20 candidates ranked by approximate bi-encoder distance plus BM25 score. Neither method reads the query and document together — they score each independently. A highly relevant chunk can land at rank 18.

Cross-encoder vs bi-encoder

	Bi-encoder	Cross-encoder
Input	Query alone, doc alone	`[query, doc]` concatenated
Attention	Independent	Full cross-attention
Speed	Fast — precomputed index	Slow — runtime per candidate
Precision	Coarser	Higher
Role	First-pass retrieval	Reranking shortlist

Two-stage pipeline

Hybrid search (vector + BM25) retrieves top_k=20 — optimized for recall
Cross-encoder scores each candidate against the query with full attention
Keep top_n=5 highest-scoring chunks for the LLM context

Common rerankers: cross-encoder/ms-marco-MiniLM-L-6-v2, bge-reranker-v2, Cohere Rerank API. Latency cost is 20–80 ms per query — usually acceptable given the precision gains.

Next: the evaluation lesson covers how to measure whether reranking actually improves your system's answers.

Application

Reranking

Theory