First-stage retrieval (dense, sparse, or hybrid) casts a wide net to find candidate documents. But the top-k from this initial pass often contains marginal or irrelevant results. Reranking applies a more powerful (and expensive) model to reorder these candidates, dramatically improving precision. This post covers the three dominant reranking approaches and how to deploy them in production RAG pipelines.
Two-Stage Retrieval Architecture
The two-stage paradigm separates retrieval into a recall-optimized first stage and a precision-optimized second stage. The first stage retrieves a large candidate set (e.g., top-100) cheaply. The reranker then scores each candidate against the query with full cross-attention, promoting the most relevant documents to the top positions.
This architecture is used by virtually every production search engine. The key design decision is how many candidates to pass to the reranker. More candidates means higher potential recall but higher reranking latency. Typical values are 50–200.
Cross-Encoders
A cross-encoder processes the query and document jointly through a Transformer, allowing full cross-attention between query and document tokens. This produces far more accurate relevance scores than bi-encoders (which encode query and document independently), but at O(n) cost per query where n is the number of candidates.
The key difference: bi-encoders produce independent embeddings that are compared with dot product; cross-encoders produce a single relevance score from the concatenated [query, SEP, document] input.
Bi-Encoder (Stage 1)
Query and document encoded independently. Score = dot product. Can precompute document embeddings. Scales to millions of docs.
Cross-Encoder (Stage 2)
Query and document encoded jointly. Full cross-attention. Must run for each (query, doc) pair. Limited to ~100-500 candidates.
# Cross-encoder reranking with sentence-transformersfrom sentence_transformers import CrossEncoder
import numpy as np
# Load a cross-encoder fine-tuned for reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
defrerank(query: str, documents: list, top_k: int = 10):
"""Rerank documents using a cross-encoder."""
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs, batch_size=32)
ranked_indices = np.argsort(scores)[::-1][:top_k]
return [
{"document": documents[i], "score": float(scores[i])}
for i in ranked_indices
]
# Example usage
query = "What is retrieval-augmented generation?"
candidates = [
"RAG combines retrieval with language model generation...",
"Product quantization compresses high-dimensional vectors...",
"Retrieval-augmented models access external knowledge...",
]
results = rerank(query, candidates)
for r in results:
print(f"[{r['score']:.3f}] {r['document'][:60]}...")
Popular cross-encoder models, ordered by quality/speed tradeoff:
cross-encoder/ms-marco-MiniLM-L-6-v2 — fast, good baseline (22M params)
BAAI/bge-reranker-v2-m3 — multilingual variant with strong cross-lingual performance
ColBERT: Late Interaction
ColBERT (Contextualized Late Interaction over BERT) is a middle ground between bi-encoders and cross-encoders. It encodes query and document independently into per-token embeddings, then computes relevance via a MaxSim operation: for each query token, find the maximum cosine similarity to any document token, then sum these maxima.
score(q, d) = Σi maxj sim(qi, dj)
This late interaction preserves the ability to precompute document token embeddings (like a bi-encoder) while capturing fine-grained token-level matching (like a cross-encoder). ColBERT achieves ~95% of cross-encoder quality at ~10× the speed.
# ColBERT-style late interaction scoringimport torch
import numpy as np
defmaxsim_score(query_embs: np.ndarray, doc_embs: np.ndarray) -> float:
"""Compute ColBERT MaxSim between query and document token embeddings.
Args:
query_embs: (num_query_tokens, dim) normalized embeddings
doc_embs: (num_doc_tokens, dim) normalized embeddings
Returns:
MaxSim relevance score.
"""# Similarity matrix: (num_query_tokens, num_doc_tokens)
sim_matrix = np.dot(query_embs, doc_embs.T)
# For each query token, take max similarity across doc tokens
max_sims = sim_matrix.max(axis=1)
returnfloat(max_sims.sum())
# Using RAGatouille for easy ColBERT v2 usagefrom ragatouille import RAGPretrainedModel
rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
rag.index(
collection=documents,
index_name="my_index",
max_document_length=256,
split_documents=True,
)
results = rag.search(query="What is RAG?", k=10)
ColBERT tradeoff: ColBERT stores per-token embeddings, which requires more storage than single-vector bi-encoders (roughly 50–100× more per document). This is the price for late interaction. For large corpora, combine ColBERT with a coarse first-stage retriever.
Cohere Rerank API
For teams that prefer a managed solution, Cohere's Rerank API provides state-of-the-art reranking as a service. It accepts a query and a list of documents, returning relevance scores. The model is a large cross-encoder trained on extensive relevance data.
# Cohere Rerank API integrationimport cohere
co = cohere.Client("YOUR_API_KEY")
defcohere_rerank(query: str, documents: list, top_k: int = 10):
"""Rerank using Cohere's hosted cross-encoder."""
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=top_k,
return_documents=True,
)
return [
{
"index": r.index,
"score": r.relevance_score,
"text": r.document.text,
}
for r in response.results
]
# Usage in a RAG pipeline
candidates = first_stage_retrieve(query, top_k=100)
reranked = cohere_rerank(query, candidates, top_k=10)
context = "\n\n".join([r["text"] for r in reranked])
answer = llm.generate(prompt=f"Context:\n{context}\n\nQuestion: {query}")
Cohere Rerank offers three model tiers:
rerank-english-v3.0 — highest quality for English
rerank-multilingual-v3.0 — supports 100+ languages
rerank-english-v2.0 — legacy, faster but lower quality
Latency Tradeoffs
Reranking adds latency to your pipeline. The total retrieval time is: T = T_first_stage + T_rerank(n), where n is the number of candidates. Understanding the latency profile of each reranker is critical for meeting SLA requirements.
Reranker
100 docs (ms)
NDCG@10
Hosting
MiniLM-L-6 (cross-enc.)
45
0.68
Self-hosted
BGE-reranker-large
130
0.74
Self-hosted
ColBERT v2
25
0.71
Self-hosted
Cohere rerank-v3
180
0.76
API
No reranking
0
0.55
—
Batching matters: Cross-encoder latency scales linearly with candidate count. If your SLA is 200ms total, and first-stage takes 20ms, you have 180ms for reranking. At ~1.3ms per candidate (BGE-large on GPU), that's ~140 candidates maximum. Profile on your hardware.
Production Pipeline Implementation
Here is a complete two-stage retrieval pipeline with configurable reranking backend:
from abc import ABC, abstractmethod
from typing import List, Dict
import numpy as np
classBaseReranker(ABC):
@abstractmethod
defrerank(self, query: str, docs: List[str], top_k: int) -> List[Dict]:
...
classCrossEncoderReranker(BaseReranker):
def__init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
from sentence_transformers import CrossEncoder
self.model = CrossEncoder(model_name)
defrerank(self, query, docs, top_k=10):
pairs = [[query, d] for d in docs]
scores = self.model.predict(pairs, batch_size=32)
ranked = np.argsort(scores)[::-1][:top_k]
return [{"doc": docs[i], "score": float(scores[i])} for i in ranked]
classCohereReranker(BaseReranker):
def__init__(self, api_key, model="rerank-english-v3.0"):
import cohere
self.client = cohere.Client(api_key)
self.model = model
defrerank(self, query, docs, top_k=10):
resp = self.client.rerank(
model=self.model, query=query,
documents=docs, top_n=top_k
)
return [{"doc": docs[r.index], "score": r.relevance_score} for r in resp.results]
classTwoStageRetriever:
def__init__(self, first_stage, reranker: BaseReranker):
self.first_stage = first_stage
self.reranker = reranker
defretrieve(self, query: str, first_k=100, final_k=10):
# Stage 1: cheap, recall-optimized retrieval
candidates = self.first_stage.search(query, top_k=first_k)
# Stage 2: expensive, precision-optimized reranking
reranked = self.reranker.rerank(query, candidates, top_k=final_k)
return reranked
Reranking is one of the highest-leverage improvements you can add to a RAG pipeline. Cross-encoders consistently improve NDCG@10 by 10-20 points over bi-encoder retrieval alone, and the implementation overhead is minimal. Start with a small cross-encoder for development, benchmark latency on your hardware, and scale up to larger models or API-based solutions as quality requirements dictate.