WritingDatabricks (DBRX)Databricks (DBRX)published Jun 4, 2026seen 5d

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

Open original ↗

Captured source

source ↗

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1 | Databricks Blog Skip to main content

Today we’re announcing a major update that makes Agent Bricks Knowledge Assistant both faster and higher quality. Answer generation time has dropped by 2x , and search time has dropped by more than 3x , bringing Time To First Token (TTFT) to around two seconds. ¹ Thus, Knowledge Assistant users will get noticeably faster answers across their use cases, with no reconfiguration required and no tradeoff in quality. These gains are powered by Instructed-Retriever-1 , a retrieval-specialized model built for parallel test-time scaling . Unlike standard agentic retrieval, where an agent works sequentially and reasons over each result before deciding its next step, our approach fans this work out in parallel. Instructed-Retriever-1 is a single model trained for both retrieval stages: query generation to increase recall and reranking to increase precision, run in parallel to keep latency low. In this post, we describe how this approach results in a Pareto-optimal performance, how we train one model to support the full retrieval pipeline, and how we validate performance on realistic enterprise workloads.

Figure: On KARLBench , Knowledge Assistant with Instructed-Retriever-1 improves both search latency and retrieval quality. 1. Parallel Test-Time Scaling for Search Our previous research demonstrated that quality can improve with additional test-time compute. However, most agentic search systems today spend that compute on sequential operations, like tool calls, reason-act loops, and chain-of-thought reasoning. These methods do improve search quality, but they come at the expense of substantially higher latency and cost. For training Instructed-Retriever-1, we take a different route: rather than scaling compute sequentially, we parallelize it during the initial search phase. By broadening the range of retrieved evidence and selecting the most relevant context up front, we achieve highly effective search with significantly lower latency. Improving the initial search depends heavily on the training harness. Our harness provides the model with user instructions and the precise schema of the underlying retrieval index, and it propagates them to all the subsequent stages of query and filter generation, reranking, and answer generation. We described how this can be achieved in our earlier Instructed Retriever blog , and we use the same search harness in training our Instructed-Retriever-1 model. This approach is especially important for enterprise questions, which often involve domain-specific constraints such as time period, organization, document type, or product area. Parallel query and filter generation improves candidate-set recall by simultaneously exploring multiple formulations and aspects of the same request. This allows the system to search more broadly while keeping latency low. Broader search creates an aggregation challenge. Different formulations may return overlapping or only partially relevant chunks. To select the most useful context from the merged candidate set, we use a multi-pivot groupwise reranker . Candidates are ranked in parallel groups, each anchored by one or more pivot chunks, and the group rankings are merged into a final ordering. This captures the key benefits of comparing evidence in context while keeping reranking efficient. Together, these stages provide two test-time scaling knobs: increasing the number of query and filter formulations improves recall , while increasing the number of pivots improves precision . Because both stages can use parallelism, the system can trade additional test-time compute for higher-quality context while preserving low latency.

Figure: The search harness used for Instructed-Retriever-1. 2. Training Instructed-Retriever-1 Parallel test-time scaling for search requires a model that can do two things well: generate effective searches and judge retrieved evidence. We trained Instructed-Retriever-1 as a single retrieval-specialized model that supports parallel query generation and reranking. The result is a model that matches Claude Sonnet 4.5 retrieval quality on KARLBench while maintaining low latency.

Figure: Retrieval quality on KARLBench after training, evaluated across reranking configurations. Instructed-Retriever-1 matches Claude Sonnet 4.5 retrieval quality. Across models, pivot-based reranking improves Recall@10 over the no-reranker setting, and two pivots further improve quality over one pivot. To prepare the data for training, we build synthetic enterprise-style retrieval environments from a broad pretraining corpus, independently from our evaluation benchmark. We create them using the agentic data synthesis approach described in the KARL report . The resulting environments reflect the kinds of tasks Knowledge Assistant must handle, including factual lookup, summarization, recommendation, problem solving, and decision support over corpora that combine unstructured documents with structured metadata. The model is trained in two stages to capture multiple search capabilities. The resulting model supports both query and filter generation, as well as verification-style retrieval capabilities, enabling the two stages that make parallel test-time scaling useful in practice. 3. Validating Instructed-Retriever-1 in Production Improving retrieval only matters if it works on realistic workloads and fits within production latency constraints. We evaluate Instructed-Retriever-1 on a large-scale internal dataset representative of Knowledge Assistant usage, measuring whether the two scaling mechanisms introduced above improve retrieval quality: parallel query and filter generation for recall, and multi-pivot reranking for precision.

Figure : Demonstration of Knowledge Assistant powered by Instructed-Retriever-1. Retrieval quality on realistic workloads Our evaluation dataset is based on real-world Knowledge Assistant workloads, where useful answers often require multiple pieces of supporting evidence rather than a single ground-truth document. We evaluate retrieval in two stages. First, we measure query generation latency and quality across all candidate systems. For quality, we use LLM-judge rubric scores for specificity , breadth , and relevance . These metrics capture whether generated queries are targeted, cover the important aspects of the request, and remain useful for answering the question.

Figure:…

Excerpt shown — open the source for the full document.

Additional captured pages

KARL: Knowledge Agents via Reinforcement Learning Databricks AI Research\* \*Please see Contributions for Full List We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of…

Notability

notability 5.0/10

Substantive research post from Databricks on faster search, but lacks traction details.