WritingTogether AITogether AIpublished Aug 15, 2025seen 5d

Fine-Tuning Small Open-Source LLMs to Outperform Large Closed-Source Models by 60% on Specialized Tasks

Open original ↗

Captured source

source ↗

Fine-Tuning Small Open-Source LLMs to Outperform Large Closed-Source Models by 60% on Specialized Tasks

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Fine-Tuning

Published 8/15/2025

Fine-Tuning Small Open-Source LLMs to Outperform Large Closed-Source Models by 60% on Specialized Tasks

Authors

Charles O'Neill, Mudith Jayasekara, David Nugent, James Zou

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Learn about Parsed Together Fine-Tuning Direct Preference Optimization Continued Fine-Tuning

TL;DR: Parsed, using Together AI’s fine-tuning platform, shows how small open-source models—when paired with rigorous evaluation and task-specific optimization—can outperform the largest proprietary reasoning models on complex real-world tasks like healthcare scribing. Their approach delivers 60% better accuracy , 10–100x lower inference cost , and greater transparency and reliability. This partnership highlights Parsed’s deep evaluation expertise and Together AI’s low-effort, production-ready fine-tuning stack. Introduction ​​The conventional wisdom that open-source LLMs represent a fundamental performance-capability tradeoff stems from outdated assumptions. Early LLM comparisons showed proprietary models dramatically outperforming open alternatives. However, Chinchilla scaling laws revealed that compute-optimal training requires balanced parameter-to-token ratios rather than pure parameter scaling, demonstrating diminishing returns beyond 70B parameters for most tasks. Further, when the distribution of the task we require a model to generate for is sufficiently constrained, a much smaller model optimized specifically for that distribution can outperform bigger models many times its size. The research team at Parsed builds programmatic, domain-aligned evaluation systems that decompose tasks into granular checks, reflect expert judgment, and integrate directly into model training and deployment pipelines. Parsed’s evaluation-first methodology––designing and implementing an evaluation framework before developing the model or adaptation strategy––can save 50-80% in inference costs compared to the big labs while providing higher quality outputs for the customer. For some of their customers this represents a saving of millions of dollars a year. The Parsed platform uses this evaluation system to drive continual reinforcement learning, something only possible with open weight models due to full parameter access and algorithmic flexibility. Parsed’s team also uses mechanistic interpretability techniques to attribute every output token back to the input token, providing first-principles explanations of model behavior. This makes Parsed’s models not just cheaper and higher-performing, but also transparent, auditable, and self-improving over time. Open Models and Task-Specific Inference Task-specific fine-tuning exploits the fundamental principle that specialized models require lower entropy output distributions than general-purpose models. When a 27B parameter model is fine-tuned for a specific task, it can dedicate its full representational capacity to a narrower probability space, effectively increasing its bits-per-parameter efficiency by 2-3 orders of magnitude compared to general inference. This specialization enables aggressive optimization strategies unavailable to general models, such as domain-specific vocabulary optimization. The compound effect is dramatic. As an example, Parsed’s fine-tuned a Gemma 3 27B model to achieve 60% better performance than Claude Sonnet 4 while requiring 10x-100x less compute per inference on a healthcare use case (explained below). Parsed’s experience at Together.ai suggests this performance isn't anomalous; medical, legal and scientific domain-specific fine-tuning consistently shows 40-100% improvements over base models. The technical prerequisites for successfully challenging the open source performance tradeoff center on three critical factors: high-quality task-specific data curation, rigorous evaluation frameworks, and iterative optimization cycles. Successful fine-tuning requires tens of thousands of high-quality examples with consistent formatting and comprehensive coverage of the task distribution. The technical stack, provided on Together.ai ’s cloud, enables organizations to achieve state-of-the-art performance on specialized tasks at significantly reduced inference cost compared to the large proprietary models. Parsed Healthcare Use Case: Scribe One type of customer that Parsed has worked closely with in the healthcare space are ambient scribes, who transcribe clinician-patient interactions and write clinical notes in the style of that doctor for the particular interaction. Whilst on the surface this may seem like a simple and innocuous summarisation problem, there are several aspects that make it challenging even for the largest frontier models. First, transcriptions can be as long as a couple of hours, meaning the language model must process upwards of 30,000 tokens of transcript (not including the prompt and other information), which leads to degraded quality, increased hallucinations, and lower adherence to instruction following. Second, it is difficult to describe precisely the many levels of nested instructions and requirements that define the contours of this task; for instance, it is hard to set a threshold for exactly how much information is relevant, particularly when the doctor has the option to choose different levels of detail. Finally, the transcript itself often contains errors (due to noisy doctor rooms, complex medical terminology not handled well by the speech-to-text model, etc.) and thus the LLM generating the clinical note must be able to infer misspellings, mis-quotes, incorrect language all without making unreasonable leaps about what was said. Fortunately, with the right optimization setup, all of these issues can be solved. Parsed has worked with several scribes now. By defining a rigorous evaluation harness that…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Substantial claim in a research post, not a release.