AI & Technology

On-Device Reranking vs. Cloud-Based Re-ranking: Which Retrieval Strategy Cuts Latency for RAG in 2025

May 9·7 min read·AI-assisted · human-reviewed

Retrieval-Augmented Generation pipelines have become the default architecture for grounding LLM outputs in factual data. Yet most engineering teams overlook the re-ranking stage as the primary source of latency and cost inflation. Re-ranking is the step where an initial set of retrieved documents is re-ordered by a more expensive model to surface the most relevant results to the LLM. Without it, RAG systems return noisy context that degrades generation quality. In 2025, the critical decision is where to run that re-ranker: locally on the client device or remotely on a cloud endpoint. This article compares on-device re-ranking with cloud-based re-ranking across four dimensions: latency budgets, accuracy ceilings, operational cost, and deployment complexity. By the end, you will have a concrete framework for choosing the right re-ranking strategy for your specific production constraints.

Why re-ranking became the bottleneck RAG pipelines need to solve first

Most RAG tutorials stop at retrieval with a vector database or BM25. In production, retrieving the top-k documents and feeding them directly to the generator leads to hallucinations and irrelevant answers. Re-ranking applies a cross-encoder—a transformer that jointly scores a query and a candidate document—to refine the initial list. This step typically adds 50 to 500 milliseconds per query, depending on model size and hardware. For latency-sensitive applications like customer support chat or real-time code assistants, that delay is unacceptable.

In 2024, Teams using cloud-based re-ranker APIs from Cohere or mixedbread.ai reported per-query costs of $0.002 to $0.01 for high-throughput pipelines. At 10 million queries per month, re-ranking alone can exceed $50,000 in API fees. On-device re-ranking eliminates those per-query costs but introduces new constraints: model size limits, lower accuracy ceilings, and battery drain on mobile devices. The choice is not just technical—it directly impacts the monthly cloud bill and the user experience.

On-device re-ranking: Distilled cross-encoders and ONNX runtime trade-offs

Running a re-ranker locally means trading model capacity for speed. The most common approach in 2025 uses a distilled cross-encoder such as ms-marco-MiniLM-L-6-v3 or bge-reranker-v2-minicpm, converted to ONNX or CoreML format. These models have 6 to 12 transformer layers, compared to the 22 layers of full-scale re-rankers like Cohere's rerank-english-v3.0.

Latency and throughput numbers on consumer hardware

On an M2 MacBook Air, a 6-layer MiniLM re-ranker processes a single query-document pair in approximately 8 to 12 milliseconds. For a top-20 retrieval, that translates to 160 to 240 milliseconds of re-ranking latency—acceptable for many applications. On a pixel 9 smartphone using Qualcomm's AI Engine, the same model takes 30 to 50 milliseconds per pair, meaning re-ranking 20 documents requires 600 to 1000 milliseconds. That pushes total query latency over one second.

Accuracy ceiling and the relevance gap

Standardized benchmarks like MS MARCO passage re-ranking show that full-scale cross-encoders achieve MRR@10 of 0.380, while the best distilled models reach 0.360 to 0.370. In domain-specific settings—legal document retrieval or medical literature—the gap widens. Fine-tuning the distilled model on in-domain data reduces the gap to 1-2%, but requires labeled data and retraining pipelines. For general-purpose RAG, the on-device accuracy is often sufficient. For enterprise compliance or diagnostic use cases, the small loss may be unacceptable.

Cloud-based re-ranking: Full-scale models and the cost of serial batching

Cloud re-ranking APIs run models like Cohere Rerank 3 or Mixedbread Rerank, which use 22-layer or 24-layer transformers with 330 million to 500 million parameters. These models achieve state-of-the-art relevance scores but introduce variable network latency and per-query billing.

Network overhead and tail latency

A single API call that sends 20 query-document pairs in a batch typically completes in 200-400 milliseconds on the server side. Adding network round-trip time—50 to 150 milliseconds under normal conditions—results in total latency of 250 to 550 milliseconds. The problem is tail latency: during peak hours, cloud re-ranking can spike to 1.2 seconds due to queuing and rate limiting. For synchronous user-facing RAG, those spikes degrade the experience.

Cost modeling for high-throughput systems

Using Cohere's hosted re-ranker at $0.001 per query (pricing as of Q1 2025) with 10 million monthly queries results in $10,000 per month. For a startup scaling to 50 million queries, the cost reaches $50,000 monthly. At that scale, on-device re-ranking becomes financially attractive even if it requires building custom inference pipelines.

Hybrid re-ranking: The pragmatic middle ground for production RAG

Several production systems in 2025 adopt a hybrid approach: use a lightweight on-device re-ranker as a first pass to reduce the candidate set from 100 to 10, then send only those 10 to a cloud re-ranker for final scoring. This cuts cloud API calls by 90% while maintaining near-perfect relevance.

Implementation sketch

This pattern reduces cloud re-ranker usage by 90% while adding only a few hundred milliseconds of on-device compute. The total latency remains under 1 second for most devices, and the cloud bill drops proportionally.

When on-device re-ranking is the only viable option

Three scenarios force teams to keep re-ranking fully on-device: air-gapped deployments, real-time voice assistants, and high-frequency trading applications. In air-gapped environments like military or nuclear research, no data can leave the local network. On-device re-ranking with a distilled model is the only legal option. For voice assistants, sub-200-millisecond response times are required to avoid awkward pauses. A cloud round-trip adds unavoidable latency. Using a model quantized to INT8 and running on a dedicated NPU can keep re-ranking below 50 milliseconds total.

When cloud re-ranking justifies its cost

Applications where a single incorrect retrieved document causes significant harm—clinical decision support, legal discovery, financial auditing—benefit from the highest possible re-ranking accuracy. In these domains, the cost of an error (misdiagnosis, missing a relevant precedent, failing to flag a compliance violation) far exceeds the API fees. Cloud re-ranking with a full-scale model also simplifies deployment: no need to ship model binaries for multiple device architectures, no concerns about memory pressure on phones, and no retraining for new domain data. The team can focus on prompt engineering and evaluation rather than model optimization.

Benchmarking your own latency and accuracy requirements

Before choosing a strategy, run a controlled experiment with your own data. Take 1,000 real user queries and the top-20 retrieved documents for each. Compute the following for both an on-device distilled re-ranker and a cloud re-ranker:

Use tools like RAGAS or DeepEval to measure end-to-end answer faithfulness, not just retrieval metrics. In one internal benchmark at a fintech company, the on-device re-ranker achieved 96% of the cloud re-ranker's faithfulness score while reducing infrastructure costs by 85%. For their moderate-latency loan application assistant, that was the winning trade-off.

The 2025 landscape shifts that could change your decision

Two trends are reshaping this comparison. First, Apple's CoreML and Google's AI Edge APIs now support on-device cross-encoders with hardware acceleration, closing the latency gap. By mid-2025, even budget Android phones can run 6-layer re-rankers in under 200 milliseconds for 10 documents. Second, cloud re-ranking providers are offering tiered pricing based on batch size and latency SLAs, making the cost predictable for high-volume users. If your workload is latency-tolerant but volume-heavy, a fixed-price cloud plan may beat the engineering cost of maintaining on-device inference.

Start by profiling your retrieval top-k size. If you can reduce it to 10 or fewer documents with a high-quality first-stage retriever (using a co-encoder or learned sparse retrieval), re-ranking becomes cheap enough to run on-device even without distillation. Teams that invest in retrieval quality often find they can skip re-ranking entirely for a large fraction of queries. That is the ultimate win: no re-ranking latency, no re-ranking cost, and perfect accuracy for the queries that matter most.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.

Explore more articles

Browse the latest reads across all four sections — published daily.

← Back to BestLifePulse