What does this fork signal mean?

Parasail forked parasail-ai/curator (forked from bespokelabsai/curator). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo parasail-ai/curator · parent bespokelabsai/curator · Routine fork of a repo. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Parasail Fork: parasail-ai/curator

Captured source

source ↗

GitHub/github.com/parasail-ai/curator

parasail-ai/curator repository metadata

Source ↗

published May 7, 2025seen Jun 5captured Jun 11http 200method plain

parasail-ai/curator

Description: Synthetic data curation for post-training and structured data extraction

Language: Python

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-05-07T17:46:19Z

Pushed: 2026-06-09T23:29:49Z

Default branch: main

Fork: yes

Parent repository: bespokelabsai/curator

Archived: no

README:

Bespoke Curator

Bulk Inference and Scalable Data Curation for Post-Training

🎉 What's New

[2026.03.14] [Tinker integration for fine-tuning](examples/poem_finetuning_example.py): Go from curated data to a LoRA fine-tuned model in a few lines of Python using the Tinker SDK.
[2025.12.05] Launched OpenThoughts-Agents whose data was curated using Curator.
[2025.04.09] Launching Reasoning Datasets Competition with HuggingFace and Together.ai. Win $5000 USD worth of prizes!
[2025.04.03] We used Bespoke Curator to create OpenThoughts2-1M dataset, which was used to train OpenThinker2-32B that outperforms DeepSeek-R1-32B. The dataset started trending on HuggingFace.
[2025.03.12] Gemini Batch support added: Gemini batch API is extremely challenging, and we made it much simpler! :)
[2025.03.05] Claude 3.7 Sonnet Thinking and batch mode support added.
[2025.02.26] Code Execution Support added: You can now run code (generated by Curator) using CodeExecutor. We support four backends: local (called multiprocessing), Ray, Docker and e2b.
[2025.02.06] We used Bespoke Curator to create [s1K-1.1]( https://huggingface.co/datasets/simplescaling/s1K-1.1), a high-quality sample-efficient reasoning dataset.
[2025.01.30] Batch Processing Support for OpenAI, Anthropic, and other compatible APIs: Cut Token Costs in Half 🔥🔥🔥. Through our partnership with kluster.ai, new users using Curator can access open-source models like DeepSeek-R1 and receive a $25 credit (limits apply). EDIT: Promotion has come to an end.
[2025.01.27] We used Bespoke Curator to create OpenThoughts-114k, a high-quality reasoning dataset (trending on HuggingFace).
[2025.01.22] We used Bespoke Curator to create Bespoke-Stratos-17k, a high-quality reasoning dataset (trending on HuggingFace).
[2025.01.15] Curator launched 🎉

Overview

Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.

Rich Python based library for generating and curating synthetic data.
Viewer to monitor data while it is being generated.
First class support for structured outputs.
Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.

!CLI in action

Check out our full documentation for getting started, tutorials, guides and detailed reference.

🛠️ Installation

pip install bespokelabs-curator

📕 Examples

Finetuning/Distillation

| Task | Link(s) | Goal | |----------|--------------|-------------| | Product feature extraction | | Finetuning a model to identify features of a product | | Sentiment analysis | | Aspect-based sentiment analysis of restaurant reviews and finetuning using Together.ai | | RAFT for domain-specific RAG | Code | Implement Retrieval Augmented Fine-Tuning (RAFT) that processes domain-specific documents, generates questions, and prepares data for fine-tuning LLMs. | | Poem generation & LoRA fine-tuning | Code | End-to-end pipeline: curate poem data with Curator, then LoRA fine-tune with TinkerTrainer |

Data Generation

| Task | Link(s) | Goal | |----------|--------------|-------------| | Reasoning dataset generation (Bespoke Stratos) | Code | Generate the Bespoke-Stratos-17k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets. | | Reasoning dataset generation (Open Thoughts) | Code | Generate the Open-Thoughts-114k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.| | Multimodal | Code | Demonstrates multimodal capabilities by generating recipes from food images | | Ungrounded Question Answer generation | Code | Generate diverse question-answer pairs using techniques similar to the CAMEL paper | | Code Execution | | Execute code generated with Curator | | 3Blue1Brown video generation | Code | Generate videos similar to 3Blue1Brown and render them using code execution! | | Synthetic charts | Code | Generate charts synthetically. | Function calling | Code | Generate data for finetuning for function calling. |

🚀 Quickstart

Using `curator.LLM` for Bulk Inference

from typing import Dict
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import Literal

class Sentiment(BaseModel):
sentiment: Literal["positive", "negative", "neutral"] = Field(
description="Sentiment of the review")

class SentimentAnalyzer(curator.LLM):

def prompt(self, product: Dict):
return f"Determine the sentiment of the product from the review: {product['review']}"

def parse(self, product: Dict, response: Sentiment):
return [{"name": product["name"], "sentiment": response.sentiment}]

# You can easily have a million rows here.
# Curator takes care of parallelism, retries, and caches responses.
dataset = [{"name": "Curator", "review": "Already saved hours in one day of use."},
{"name": "Bespoke...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine fork of a repo