ForkParasailParasailpublished May 7, 2025seen 5d

parasail-ai/curator

forked from bespokelabsai/curator

Open original ↗

Captured source

source ↗
published May 7, 2025seen 5dcaptured 9hhttp 200method plain

parasail-ai/curator

Description: Synthetic data curation for post-training and structured data extraction

Language: Python

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-05-07T17:46:19Z

Pushed: 2026-06-09T23:29:49Z

Default branch: main

Fork: yes

Parent repository: bespokelabsai/curator

Archived: no

README:

Bespoke Curator

Bulk Inference and Scalable Data Curation for Post-Training

🎉 What's New

  • [2026.03.14] [Tinker integration for fine-tuning](examples/poem_finetuning_example.py): Go from curated data to a LoRA fine-tuned model in a few lines of Python using the Tinker SDK.
  • [2025.12.05] Launched OpenThoughts-Agents whose data was curated using Curator.
  • [2025.04.09] Launching Reasoning Datasets Competition with HuggingFace and Together.ai. Win $5000 USD worth of prizes!
  • [2025.04.03] We used Bespoke Curator to create OpenThoughts2-1M dataset, which was used to train OpenThinker2-32B that outperforms DeepSeek-R1-32B. The dataset started trending on HuggingFace.
  • [2025.03.12] Gemini Batch support added: Gemini batch API is extremely challenging, and we made it much simpler! :)
  • [2025.03.05] Claude 3.7 Sonnet Thinking and batch mode support added.
  • [2025.02.26] Code Execution Support added: You can now run code (generated by Curator) using CodeExecutor. We support four backends: local (called multiprocessing), Ray, Docker and e2b.
  • [2025.02.06] We used Bespoke Curator to create [s1K-1.1]( https://huggingface.co/datasets/simplescaling/s1K-1.1), a high-quality sample-efficient reasoning dataset.
  • [2025.01.30] Batch Processing Support for OpenAI, Anthropic, and other compatible APIs: Cut Token Costs in Half 🔥🔥🔥. Through our partnership with kluster.ai, new users using Curator can access open-source models like DeepSeek-R1 and receive a $25 credit (limits apply). EDIT: Promotion has come to an end.
  • [2025.01.27] We used Bespoke Curator to create OpenThoughts-114k, a high-quality reasoning dataset (trending on HuggingFace).
  • [2025.01.22] We used Bespoke Curator to create Bespoke-Stratos-17k, a high-quality reasoning dataset (trending on HuggingFace).
  • [2025.01.15] Curator launched 🎉

Overview

Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.

  • Rich Python based library for generating and curating synthetic data.
  • Viewer to monitor data while it is being generated.
  • First class support for structured outputs.
  • Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
  • Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.

!CLI in action

Check out our full documentation for getting started, tutorials, guides and detailed reference.

🛠️ Installation

pip install bespokelabs-curator

📕 Examples

Finetuning/Distillation

| Task | Link(s) | Goal | |----------|--------------|-------------| | Product feature extraction | | Finetuning a model to identify features of a product | | Sentiment analysis | | Aspect-based sentiment analysis of restaurant reviews and finetuning using Together.ai | | RAFT for domain-specific RAG | Code | Implement Retrieval Augmented Fine-Tuning (RAFT) that processes domain-specific documents, generates questions, and prepares data for fine-tuning LLMs. | | Poem generation & LoRA fine-tuning | Code | End-to-end pipeline: curate poem data with Curator, then LoRA fine-tune with TinkerTrainer |

Data Generation

| Task | Link(s) | Goal | |----------|--------------|-------------| | Reasoning dataset generation (Bespoke Stratos) | Code | Generate the Bespoke-Stratos-17k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets. | | Reasoning dataset generation (Open Thoughts) | Code | Generate the Open-Thoughts-114k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.| | Multimodal | Code | Demonstrates multimodal capabilities by generating recipes from food images | | Ungrounded Question Answer generation | Code | Generate diverse question-answer pairs using techniques similar to the CAMEL paper | | Code Execution | | Execute code generated with Curator | | 3Blue1Brown video generation | Code | Generate videos similar to 3Blue1Brown and render them using code execution! | | Synthetic charts | Code | Generate charts synthetically. | Function calling | Code | Generate data for finetuning for function calling. |

🚀 Quickstart

Using curator.LLM for Bulk Inference

from typing import Dict
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import Literal

class Sentiment(BaseModel):
sentiment: Literal["positive", "negative", "neutral"] = Field(
description="Sentiment of the review")

class SentimentAnalyzer(curator.LLM):

def prompt(self, product: Dict):
return f"Determine the sentiment of the product from the review: {product['review']}"

def parse(self, product: Dict, response: Sentiment):
return [{"name": product["name"], "sentiment": response.sentiment}]

# You can easily have a million rows here.
# Curator takes care of parallelism, retries, and caches responses.
dataset = [{"name": "Curator", "review": "Already saved hours in one day of use."},
{"name": "Bespoke…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine fork of a repo