WritingSambaNova SystemsSambaNova Systemspublished Apr 22, 2026seen 5d

Many-Shot Prompting: A Practical Guide to In-Context Learning at Scale

Open original ↗

Captured source

source ↗

Many-Shot Prompting: A Practical Guide to In-Context Learning at Scale

BACK TO RESOURCES

Blog

Many-Shot Prompting: A Practical Guide to In-Context Learning at Scale

by Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Urmish Thakker, Changran Hu, Qizheng Zhang

--> April 22, 2026

TL;DR: What We Found

We ran thousands of experiments on many-shot in-context learning (ICL) across multiple benchmarks, model sizes, and prompting strategies. Here are the headline findings:

Many-shot ICL works, but only for certain tasks. Structured classification and information extraction see large, consistent gains. Open-ended generation tasks like machine translation barely move.

More examples ≠ better results after a point. Performance typically plateaus around 50–70 examples per class, then stalls or degrades as the context window saturates.

How you select examples matters more than how many you use. Cross-label similarity-based selection at low shot counts (n=1 per class) delivered our best result: 90.2% accuracy vs. a 43% zero-shot baseline.

Smaller models scale more gracefully with many-shot prompts. LLaMA-8B continued improving where 70B started to over-condition and degrade.

Reinforced ICL (chain-of-thought demos) peaks early. Just 4 reasoning traces matched or beat 32 on GPQA Diamond.

In recent years, large language models (LLMs) have undergone substantial advances in their ability to process and reason over extended context lengths. Architectural innovations and optimized attention mechanisms have expanded context windows from a few thousand tokens to hundreds of thousands, enabling models to condition on vast amounts of in-context information.

This progress has transformed the traditional few-shot in-context learning (ICL) [6] paradigm into what can now be characterized as many-shot ICL where the number of exemplars or demonstrations provided within the prompt can scale to hundreds or even thousands.

In this post, we share empirical observations and practical insights from our exploration of many-shot ICL across a range of tasks, domains, and few model backbones. We also explore Reinforced ICL and Dynamic ICL regimes. We analyze when and why many-shot prompting leads to consistent performance gains, outline failure modes observed during large-context evaluations, and discuss best practices for constructing effective long-context prompts. In this blogpost, we show:

Variation in model accuracy under different n-shot configurations

Best performing example selection strategy for prompt construction

Success and failure cases for many-shot ICL

Scaling law for Reinforced ICL with number of examples in the prompt

Preliminaries

In-context learning

In-context learning (ICL) [1] is a simple but powerful idea: Large language models (LLM) can learn new tasks directly from examples placed in their input context, without updating any model parameters. By observing a few input–output pairs, the model infers the underlying pattern and applies it to new queries. This makes ICL a lightweight alternative to fine-tuning or pre-training when adapting models to new tasks.

As context lengths have grown and serving systems like LMCache [3] and MoonCake [4] have added to deployment efficiency, ICL has evolved into a practical way to teach models through demonstration. It also offers a unique lens for studying how LLMs generalize, compress patterns, and “learn” from context alone.

Many-shot ICL

Recent progress in long-context architectures has unlocked many-shot ICL, where we can fit hundreds or even thousands of demonstrations into a single prompt. Intuitively, the more examples we provide, the more signal the model has to learn from, and several studies [2] show that performance often improves as the number of shots increases.

However, scaling up is not always straightforward. The gains from many-shot prompting depend heavily on how examples are organized: their ordering, diversity, template format, and even the phrasing of instructions. In practice, many-shot ICL can be surprisingly brittle if prompts aren’t constructed carefully. In our work, we explore these sensitivities and share what worked (and what didn’t) when building effective many-shot prompts. We also show success and failure domains for many-shot ICL.

Reinforced ICL

In Reinforced ICL [2], the goal isn’t to show the model final answers but to demonstrate how to get there. Instead of giving input–output pairs, we include a few examples of chain-of-thought reasoning in the prompt. The model then imitates these reasoning patterns to arrive at its own answers.

This approach works especially well in domains like MATH or GSM8K, where the reasoning process matters more than the exact output format. Reinforced ICL can be seen as a lightweight, unsupervised way to teach reasoning strategies purely through examples in context.

Dynamic ICL

Traditional ICL uses a fixed set of examples. Every new query sees a static prompt. Dynamic ICL, in contrast, builds the prompt on-the-fly. For each query, the model (or a retrieval system) selects the most relevant examples based on embedding similarity, heuristics, or task metadata.

By tailoring the context to each input, dynamic ICL tends to produce better results than static prompts, especially in varied or open-ended domains. Dynamic ICL just swaps out irrelevant examples for more relevant ones and in practice it often leads to meaningful gains in consistency and accuracy. In our work, we explore different selection strategies to create prompts for Dynamic ICL.

Experiments and Results

To evaluate our hypotheses, we conducted experiments across a range of datasets drawn from two primary benchmarks: LongICLBench and EvalHarness.

LongICLBench [5] focuses on long-context learning in extreme label classification settings, covering six datasets with label spaces ranging from 28 to 174 classes and input lengths extending from roughly 2k to 50k tokens. This benchmark provides an ideal testbed for studying many-shot ICL, as it stresses both the model’s ability to leverage a large number of demonstrations and its capacity to maintain coherence over long input sequences. For experiments 1-4 described below, we used the Banking77 dataset from LongICLBench.

In addition, we included selected tasks from EvalHarness, emphasizing those that require complex contextual reasoning, such as machine translation, information extraction, summarization, and question answering. Together, these datasets span a…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Substantive tutorial, but no model release or traction.