RepoMeta AI (Llama)Meta AI (Llama)published Mar 27, 2025seen 6d

meta-llama/synthetic-data-kit

Python

Open original ↗

Captured source

source ↗
published Mar 27, 2025seen 6dcaptured 8hhttp 200method plain

meta-llama/synthetic-data-kit

Description: Tool for generating high quality Synthetic datasets

Language: Python

License: MIT

Stars: 1597

Forks: 219

Open issues: 48

Created: 2025-03-27T06:40:42Z

Pushed: 2025-10-28T20:10:55Z

Default branch: main

Fork: no

Archived: no

README:

Synthetic Data Kit

Tool for generating high-quality synthetic datasets to fine-tune LLMs.

Generate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.

> Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family

What does Synthetic Data Kit offer?

Fine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.

Why target data preparation?

Multiple tools support standardized formats. However, most of the times your dataset is not structured in "user", "assistant" threads or in a certain format that plays well with a fine-tuning packages.

This toolkit simplifies the journey of:

  • Using a LLM (vLLM or any local/external API endpoint) to generate examples
  • Modular 4 command flow
  • Converting your existing files to fine-tuning friendly formats
  • Creating synthetic datasets
  • Supporting various formats of post-training fine-tuning

How does Synthetic Data Kit offer it?

The tool is designed to follow a simple CLI structure with 4 commands:

  • ingest various file formats
  • create your fine-tuning format: QA pairs, QA pairs with CoT, summary format
  • curate: Using Llama as a judge to curate high quality examples.
  • save-as: After that you can simply save these to a format that your fine-tuning workflow requires.

You can override any parameter or detail by either using the CLI or overriding the default YAML config.

Installation

From PyPI

# Create a new environment

conda create -n synthetic-data python=3.10

conda activate synthetic-data

pip install synthetic-data-kit

(Alternatively) From Source

git clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit
pip install -e .

To get an overview of commands type:

synthetic-data-kit --help

1. Tool Setup

  • The tool can process both individual files and entire directories.
# Create directory structure for the 4-stage pipeline
mkdir -p data/{input,parsed,generated,curated,final}

# Or use the legacy structure (still supported)
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}
  • You also need a LLM backend that you will utilize for generating your dataset, if using vLLM:
# Start vLLM server
# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000

2. Usage

The flow follows 4 simple steps: ingest, create, curate, save-as. You can process individual files or entire directories. All data is now stored in Lance format by default.

# Check if your backend is running
synthetic-data-kit system-check

# SINGLE FILE PROCESSING (Original approach)
# Parse a document to a Lance dataset
synthetic-data-kit ingest docs/report.pdf
# This saves file to data/parsed/report.lance

# Generate QA pairs (default)
synthetic-data-kit create data/parsed/report.lance --type qa

OR

# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/parsed/report.txt --type cot

# Both of these save file to data/generated/report_qa_pairs.json

# Filter content based on quality
synthetic-data-kit curate data/generated/report_qa_pairs.json

# Convert to alpaca fine-tuning format and save as HF arrow file
synthetic-data-kit save-as data/curated/report_cleaned.json --format alpaca --storage hf

2.1 Batch Directory Processing (New)

Process entire directories of files with a single command:

# Parse all documents in a directory
synthetic-data-kit ingest ./documents/
# Processes all .pdf, .html, .docx, .pptx, .txt files
# Saves parsed text files to data/parsed/

# Generate QA pairs for all text files
synthetic-data-kit create ./data/parsed/ --type qa
# Processes all .txt files in the directory
# Saves QA pairs to data/generated/

# Curate all generated files
synthetic-data-kit curate ./data/generated/ --threshold 8.0
# Processes all .json files in the directory
# Saves curated files to data/curated/

# Convert all curated files to training format
synthetic-data-kit save-as ./data/curated/ --format alpaca
# Processes all .json files in the directory
# Saves final files to data/final/

2.2 Preview Mode

Use --preview to see what files would be processed without actually processing them:

# Preview files before processing
synthetic-data-kit ingest ./documents --preview
# Shows: directory stats, file counts by extension, list of files

synthetic-data-kit create ./data/parsed --preview
# Shows: .txt files that would be processed

Configuration

The toolkit uses a YAML configuration file (default: configs/config.yaml).

Note, this can be overridden via either CLI arguments OR passing a custom YAML file

# Example configuration using vLLM
llm:
provider: "vllm"

vllm:
api_base: "http://localhost:8000/v1"
model: "meta-llama/Llama-3.3-70B-Instruct"
sleep_time: 0.1

generation:
temperature: 0.7
chunk_size: 4000
num_pairs: 25
max_context_length: 8000

curate:
threshold: 7.0
batch_size: 8

or using an API endpoint:

# Example configuration using the llama API
llm:
provider: "api-endpoint"

api-endpoint:
api_base: "https://api.llama.com/v1"
api_key: "llama-api-key"
model: "Llama-4-Maverick-17B-128E-Instruct-FP8"
max_retries: 3
sleep_time: 0.5

Customizing Configuration

Create a overriding configuration file and use it with the -c flag:

synthetic-data-kit -c my_config.yaml ingest docs/paper.pdf

Examples

Processing a Single PDF Document

# Ingest PDF
synthetic-data-kit ingest research_paper.pdf

# Generate QA pairs
synthetic-data-kit create data/parsed/research_paper.txt -n 30

# Curate data
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5

# Save in OpenAI fine-tuning format (JSON)
synthetic-data-kit save-as data/curated/research_paper_cleaned.json -f ft

# Save in OpenAI fine-tuning format (HF dataset)
synthetic-data-kit save-as…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid repo from Meta with decent traction.

Meta AI (Llama) has a repo signal matching data demand, evals and quality.