How to choose the right open model for production
Captured source
source ↗How to choose the right open model for production
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Model Library
Published 1/8/2026
How to choose the right open model for production
Authors
Nicholas Broad, Dan Waters
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
How do you choose the right open model for your workload? With over 2 million open models on Hugging Face, it’s hard to know which one is best for any specific task. There are countless leaderboards and benchmarks comparing dozens of models, and noteworthy new releases practically every week. How can anyone possibly know which model they should use, or how to even begin the selection process? Why choose an open model? Why choose an open model for your workload? There are many reasons why choosing an open vs. closed model makes sense for enterprise tasks, including transparency, adaptability, and control . Open models are transparent because details of their weights, training data, and architecture are known, making them fitting candidates for introspection and analysis into their decision making. Transparency helps identify the sources of issues like overfitting and bias, which can help organizations gradually increase confidence in AI decision making over time if choosing to adapt the model. Open models are adaptable because fine-tuning techniques from the research community can be applied. Proprietary adaptation methods for closed models may not provide the same level of customization or alignment as a collection of open post-training approaches, which might include supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL). Finally, open models provide an additional level of control to organizations. Open models can be run anywhere and are not tied to any proprietary architecture or stack. This enables your team to innovate upon open community research, maintaining full ownership and auditability over the model artifacts created along the way. When you own your AI, you’re more invested in building for the future needs of your organization. Legal considerations for open models Some open models have strict licensing requirements that disqualify them from commercial or production use. For some companies, this means only using models with Apache-2.0 or MIT licenses. For others, it could mean only using models made in the U.S. or France. Be sure to consult with your legal team before starting your model selection process, because this will limit the field of candidate models to consider. There is a table in the next section with relevant information about licenses and region of origin. The Llama license is the most restrictive, but all are generally permissive for commercial purposes. Closed vs. open models: Task comparison It’s common to begin the model selection process with experience using proprietary models like GPT-5, Claude Opus, or Gemini 3. These providers typically offer three tiers, trading off speed and cost versus capability (e.g. GPT-5 pro, GPT-5 mini, GPT-5 nano). We will refer to these three tiers as low , medium , and high , suggesting similarly capable open models to try. The low tier is the cheapest, fastest, and least capable, whereas the high tier is the most expensive, slowest, and most capable. Model size & capabilities Here are some rough guidelines to help narrow down your model selection by parameter size based on current performance of closed-model systems. If the task requires the closed model to be from the high tier, the open model equivalent should be at least 300B total parameters . For the medium tier, the open model should be between 70 and 250B total parameters. For the low tier, the model should likely be <32B total parameters. After fine-tuning, there are instances where an even smaller model can be used — but this is not always the case. These are simple guidelines to use as a starting point, and proper evaluation is still required.
The following model families are recommended for general-purpose evaluation versus proprietary models: Name Country of Origin Available Sizes License Deepseek China 650B MIT Kimi China 1T Modified MIT Qwen China 1B – 480B Apache-2.0 GLM China 100B – 350B MIT Meta Llama U.S. 7B – 400B Llama Google Gemma U.S. 1B – 27B Gemma Mistral France 3B – 675B Apache-2.0
Plotting model quality versus parameter size gives us a clear view into the performance of each model family at its various available parameter sizes.
Tradeoffs to consider during model selection When choosing a model, there are three main dimensions: cost, speed, and quality. Cost and quality are directly related, and are determined primarily by the size of the model. The larger the model, the higher the expected quality of its output, and the more expensive it is to run. Speed is inversely proportional to quality: the higher the quality, the lower the speed. For instance, here are three different potential configurations for the same task: Deepseek-v3.2 Running on 8x GB 200s (most costly, $$$) Max load: 2 RPS 99% accuracy
Qwen3-Next-80B-A3B-Instruct Running on 4x H200s (good price-performance mix, $$) Max load: 10 RPS 96% accuracy
Ministral-3-3B-Instruct-2512 Running on 2x H100s (Fast, but lower accuracy and cost, $) Max load: 20 RPS 94% accuracy
The best option for you depends on your requirements for quality, cost margins, and throughput. It’s worth taking the time to thoroughly define these requirements; they will be instrumental in guiding your selection. Evaluating the model While a given model might have published academic benchmarks, both your data and task are likely to be unique. Success in AI initiatives requires disciplined experimentation and a clear understanding of the metrics you’ve chosen to represent performance. Techniques like LLM-as-a-judge evaluations enable a better approximation of performance on famously difficult- to-evaluate LLM tasks, like numerical scoring, classification, and open-ended comparison. Other tasks, like reranking, can be evaluated with…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine blog post, no notable traction.