0
Act 3

Application

5 / 9

Reranking

Act 3 · ~4 min

Theory

Why first-pass retrieval is imprecise

Hybrid search returns top_k=20 candidates ranked by approximate bi-encoder distance plus BM25 score. Neither method reads the query and document together — they score each independently. A highly relevant chunk can land at rank 18.

Cross-encoder vs bi-encoder

Bi-encoderCross-encoder
InputQuery alone, doc alone[query, doc] concatenated
AttentionIndependentFull cross-attention
SpeedFast — precomputed indexSlow — runtime per candidate
PrecisionCoarserHigher
RoleFirst-pass retrievalReranking shortlist

Two-stage pipeline

  1. Hybrid search (vector + BM25) retrieves top_k=20 — optimized for recall
  2. Cross-encoder scores each candidate against the query with full attention
  3. Keep top_n=5 highest-scoring chunks for the LLM context

Common rerankers: cross-encoder/ms-marco-MiniLM-L-6-v2, bge-reranker-v2, Cohere Rerank API. Latency cost is 20–80 ms per query — usually acceptable given the precision gains.

Next: the evaluation lesson covers how to measure whether reranking actually improves your system's answers.