← All Posts

Reranking: Cross-Encoders, ColBERT, Cohere

First-stage retrieval (dense, sparse, or hybrid) casts a wide net to find candidate documents. But the top-k from this initial pass often contains marginal or irrelevant results. Reranking applies a more powerful (and expensive) model to reorder these candidates, dramatically improving precision. This post covers the three dominant reranking approaches and how to deploy them in production RAG pipelines.

Two-Stage Retrieval Architecture

The two-stage paradigm separates retrieval into a recall-optimized first stage and a precision-optimized second stage. The first stage retrieves a large candidate set (e.g., top-100) cheaply. The reranker then scores each candidate against the query with full cross-attention, promoting the most relevant documents to the top positions.

Query User input Stage 1 Bi-encoder / BM25 Retrieve top-100 ~15ms latency Stage 2 Cross-encoder Rerank top-100 → top-10 ~150ms latency Top 10 Total pipeline: query → 100 candidates → 10 reranked results → LLM context

This architecture is used by virtually every production search engine. The key design decision is how many candidates to pass to the reranker. More candidates means higher potential recall but higher reranking latency. Typical values are 50–200.

Cross-Encoders

A cross-encoder processes the query and document jointly through a Transformer, allowing full cross-attention between query and document tokens. This produces far more accurate relevance scores than bi-encoders (which encode query and document independently), but at O(n) cost per query where n is the number of candidates.

The key difference: bi-encoders produce independent embeddings that are compared with dot product; cross-encoders produce a single relevance score from the concatenated [query, SEP, document] input.

Bi-Encoder (Stage 1)

Query and document encoded independently. Score = dot product. Can precompute document embeddings. Scales to millions of docs.

Cross-Encoder (Stage 2)

Query and document encoded jointly. Full cross-attention. Must run for each (query, doc) pair. Limited to ~100-500 candidates.

# Cross-encoder reranking with sentence-transformers from sentence_transformers import CrossEncoder import numpy as np # Load a cross-encoder fine-tuned for reranking reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') def rerank(query: str, documents: list, top_k: int = 10): """Rerank documents using a cross-encoder.""" pairs = [[query, doc] for doc in documents] scores = reranker.predict(pairs, batch_size=32) ranked_indices = np.argsort(scores)[::-1][:top_k] return [ {"document": documents[i], "score": float(scores[i])} for i in ranked_indices ] # Example usage query = "What is retrieval-augmented generation?" candidates = [ "RAG combines retrieval with language model generation...", "Product quantization compresses high-dimensional vectors...", "Retrieval-augmented models access external knowledge...", ] results = rerank(query, candidates) for r in results: print(f"[{r['score']:.3f}] {r['document'][:60]}...")

Popular cross-encoder models, ordered by quality/speed tradeoff:

ColBERT: Late Interaction

ColBERT (Contextualized Late Interaction over BERT) is a middle ground between bi-encoders and cross-encoders. It encodes query and document independently into per-token embeddings, then computes relevance via a MaxSim operation: for each query token, find the maximum cosine similarity to any document token, then sum these maxima.

score(q, d) = Σi maxj sim(qi, dj)

This late interaction preserves the ability to precompute document token embeddings (like a bi-encoder) while capturing fine-grained token-level matching (like a cross-encoder). ColBERT achieves ~95% of cross-encoder quality at ~10× the speed.

# ColBERT-style late interaction scoring import torch import numpy as np def maxsim_score(query_embs: np.ndarray, doc_embs: np.ndarray) -> float: """Compute ColBERT MaxSim between query and document token embeddings. Args: query_embs: (num_query_tokens, dim) normalized embeddings doc_embs: (num_doc_tokens, dim) normalized embeddings Returns: MaxSim relevance score. """ # Similarity matrix: (num_query_tokens, num_doc_tokens) sim_matrix = np.dot(query_embs, doc_embs.T) # For each query token, take max similarity across doc tokens max_sims = sim_matrix.max(axis=1) return float(max_sims.sum()) # Using RAGatouille for easy ColBERT v2 usage from ragatouille import RAGPretrainedModel rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0") rag.index( collection=documents, index_name="my_index", max_document_length=256, split_documents=True, ) results = rag.search(query="What is RAG?", k=10)
ColBERT tradeoff: ColBERT stores per-token embeddings, which requires more storage than single-vector bi-encoders (roughly 50–100× more per document). This is the price for late interaction. For large corpora, combine ColBERT with a coarse first-stage retriever.

Cohere Rerank API

For teams that prefer a managed solution, Cohere's Rerank API provides state-of-the-art reranking as a service. It accepts a query and a list of documents, returning relevance scores. The model is a large cross-encoder trained on extensive relevance data.

# Cohere Rerank API integration import cohere co = cohere.Client("YOUR_API_KEY") def cohere_rerank(query: str, documents: list, top_k: int = 10): """Rerank using Cohere's hosted cross-encoder.""" response = co.rerank( model="rerank-english-v3.0", query=query, documents=documents, top_n=top_k, return_documents=True, ) return [ { "index": r.index, "score": r.relevance_score, "text": r.document.text, } for r in response.results ] # Usage in a RAG pipeline candidates = first_stage_retrieve(query, top_k=100) reranked = cohere_rerank(query, candidates, top_k=10) context = "\n\n".join([r["text"] for r in reranked]) answer = llm.generate(prompt=f"Context:\n{context}\n\nQuestion: {query}")

Cohere Rerank offers three model tiers:

Latency Tradeoffs

Reranking adds latency to your pipeline. The total retrieval time is: T = T_first_stage + T_rerank(n), where n is the number of candidates. Understanding the latency profile of each reranker is critical for meeting SLA requirements.

Reranker 100 docs (ms) NDCG@10 Hosting
MiniLM-L-6 (cross-enc.) 45 0.68 Self-hosted
BGE-reranker-large 130 0.74 Self-hosted
ColBERT v2 25 0.71 Self-hosted
Cohere rerank-v3 180 0.76 API
No reranking 0 0.55
Batching matters: Cross-encoder latency scales linearly with candidate count. If your SLA is 200ms total, and first-stage takes 20ms, you have 180ms for reranking. At ~1.3ms per candidate (BGE-large on GPU), that's ~140 candidates maximum. Profile on your hardware.

Production Pipeline Implementation

Here is a complete two-stage retrieval pipeline with configurable reranking backend:

from abc import ABC, abstractmethod from typing import List, Dict import numpy as np class BaseReranker(ABC): @abstractmethod def rerank(self, query: str, docs: List[str], top_k: int) -> List[Dict]: ... class CrossEncoderReranker(BaseReranker): def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"): from sentence_transformers import CrossEncoder self.model = CrossEncoder(model_name) def rerank(self, query, docs, top_k=10): pairs = [[query, d] for d in docs] scores = self.model.predict(pairs, batch_size=32) ranked = np.argsort(scores)[::-1][:top_k] return [{"doc": docs[i], "score": float(scores[i])} for i in ranked] class CohereReranker(BaseReranker): def __init__(self, api_key, model="rerank-english-v3.0"): import cohere self.client = cohere.Client(api_key) self.model = model def rerank(self, query, docs, top_k=10): resp = self.client.rerank( model=self.model, query=query, documents=docs, top_n=top_k ) return [{"doc": docs[r.index], "score": r.relevance_score} for r in resp.results] class TwoStageRetriever: def __init__(self, first_stage, reranker: BaseReranker): self.first_stage = first_stage self.reranker = reranker def retrieve(self, query: str, first_k=100, final_k=10): # Stage 1: cheap, recall-optimized retrieval candidates = self.first_stage.search(query, top_k=first_k) # Stage 2: expensive, precision-optimized reranking reranked = self.reranker.rerank(query, candidates, top_k=final_k) return reranked

Reranking is one of the highest-leverage improvements you can add to a RAG pipeline. Cross-encoders consistently improve NDCG@10 by 10-20 points over bi-encoder retrieval alone, and the implementation overhead is minimal. Start with a small cross-encoder for development, benchmark latency on your hardware, and scale up to larger models or API-based solutions as quality requirements dictate.