parasail-ai/curator
forked from bespokelabsai/curator
Captured source
source ↗parasail-ai/curator
Description: Synthetic data curation for post-training and structured data extraction
Language: Python
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-05-07T17:46:19Z
Pushed: 2026-06-09T23:29:49Z
Default branch: main
Fork: yes
Parent repository: bespokelabsai/curator
Archived: no
README:
Bespoke Curator
Bulk Inference and Scalable Data Curation for Post-Training
🎉 What's New
- [2026.03.14] [Tinker integration for fine-tuning](examples/poem_finetuning_example.py): Go from curated data to a LoRA fine-tuned model in a few lines of Python using the Tinker SDK.
- [2025.12.05] Launched OpenThoughts-Agents whose data was curated using Curator.
- [2025.04.09] Launching Reasoning Datasets Competition with HuggingFace and Together.ai. Win $5000 USD worth of prizes!
- [2025.04.03] We used Bespoke Curator to create OpenThoughts2-1M dataset, which was used to train OpenThinker2-32B that outperforms DeepSeek-R1-32B. The dataset started trending on HuggingFace.
- [2025.03.12] Gemini Batch support added: Gemini batch API is extremely challenging, and we made it much simpler! :)
- [2025.03.05] Claude 3.7 Sonnet Thinking and batch mode support added.
- [2025.02.26] Code Execution Support added: You can now run code (generated by Curator) using CodeExecutor. We support four backends: local (called multiprocessing), Ray, Docker and e2b.
- [2025.02.06] We used Bespoke Curator to create [s1K-1.1]( https://huggingface.co/datasets/simplescaling/s1K-1.1), a high-quality sample-efficient reasoning dataset.
- [2025.01.30] Batch Processing Support for OpenAI, Anthropic, and other compatible APIs: Cut Token Costs in Half 🔥🔥🔥. Through our partnership with kluster.ai, new users using Curator can access open-source models like DeepSeek-R1 and receive a $25 credit (limits apply). EDIT: Promotion has come to an end.
- [2025.01.27] We used Bespoke Curator to create OpenThoughts-114k, a high-quality reasoning dataset (trending on HuggingFace).
- [2025.01.22] We used Bespoke Curator to create Bespoke-Stratos-17k, a high-quality reasoning dataset (trending on HuggingFace).
- [2025.01.15] Curator launched 🎉
Overview
Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.
- Rich Python based library for generating and curating synthetic data.
- Viewer to monitor data while it is being generated.
- First class support for structured outputs.
- Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
- Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.
Check out our full documentation for getting started, tutorials, guides and detailed reference.
🛠️ Installation
pip install bespokelabs-curator
📕 Examples
Finetuning/Distillation
| Task | Link(s) | Goal | |----------|--------------|-------------| | Product feature extraction | | Finetuning a model to identify features of a product | | Sentiment analysis | | Aspect-based sentiment analysis of restaurant reviews and finetuning using Together.ai | | RAFT for domain-specific RAG | Code | Implement Retrieval Augmented Fine-Tuning (RAFT) that processes domain-specific documents, generates questions, and prepares data for fine-tuning LLMs. | | Poem generation & LoRA fine-tuning | Code | End-to-end pipeline: curate poem data with Curator, then LoRA fine-tune with TinkerTrainer |
Data Generation
| Task | Link(s) | Goal | |----------|--------------|-------------| | Reasoning dataset generation (Bespoke Stratos) | Code | Generate the Bespoke-Stratos-17k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets. | | Reasoning dataset generation (Open Thoughts) | Code | Generate the Open-Thoughts-114k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.| | Multimodal | Code | Demonstrates multimodal capabilities by generating recipes from food images | | Ungrounded Question Answer generation | Code | Generate diverse question-answer pairs using techniques similar to the CAMEL paper | | Code Execution | | Execute code generated with Curator | | 3Blue1Brown video generation | Code | Generate videos similar to 3Blue1Brown and render them using code execution! | | Synthetic charts | Code | Generate charts synthetically. | Function calling | Code | Generate data for finetuning for function calling. |
🚀 Quickstart
Using curator.LLM for Bulk Inference
from typing import Dict
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import Literal
class Sentiment(BaseModel):
sentiment: Literal["positive", "negative", "neutral"] = Field(
description="Sentiment of the review")
class SentimentAnalyzer(curator.LLM):
def prompt(self, product: Dict):
return f"Determine the sentiment of the product from the review: {product['review']}"
def parse(self, product: Dict, response: Sentiment):
return [{"name": product["name"], "sentiment": response.sentiment}]
# You can easily have a million rows here.
# Curator takes care of parallelism, retries, and caches responses.
dataset = [{"name": "Curator", "review": "Already saved hours in one day of use."},
{"name": "Bespoke…Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine fork of a repo