ForkNous ResearchNous Researchpublished Jan 13, 2025seen 5d

NousResearch/datatrove

forked from huggingface/datatrove

Open original ↗

Captured source

source ↗
published Jan 13, 2025seen 5dcaptured 14hhttp 200method plain

NousResearch/datatrove

Description: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

License: Apache-2.0

Stars: 8

Forks: 2

Open issues: 0

Created: 2025-01-13T20:59:53Z

Pushed: 2025-01-13T21:07:42Z

Default branch: main

Fork: yes

Parent repository: huggingface/datatrove

Archived: no

README:

DataTrove

DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.

DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory usage and multiple step design makes it ideal for large workloads, such as to process an LLM's training data.

Local, remote and other file systems are supported through fsspec.

Table of contents

  • [Installation](#installation)
  • [Quickstart examples](#quickstart-examples)
  • [Terminology](#terminology)
  • [Pipeline](#pipeline)
  • [DataTrove Document](#datatrove-document)
  • [Types of pipeline blocks](#types-of-pipeline-blocks)
  • [Full pipeline](#full-pipeline)
  • [Executors](#executors)
  • [LocalPipelineExecutor](#localpipelineexecutor)
  • [SlurmPipelineExecutor](#slurmpipelineexecutor)
  • [Logging](#logging)
  • [DataFolder / paths](#datafolder--paths)
  • [Practical guides](#practical-guides)
  • [Reading data](#reading-data)
  • [Extracting text](#extracting-text)
  • [Filtering data](#filtering-data)
  • [Saving data](#saving-data)
  • [Deduplicating data](#deduplicating-data)
  • [Summary Statistics](#summary-statistics)
  • [Custom blocks](#custom-blocks)

+ [Simple data](#simple-data) + [Custom function](#custom-function) + [Custom block](#custom-block)

  • [Contributing](#contributing)
  • [Citation](#citation)

Installation

pip install datatrove[FLAVOUR]

Available flavours (combine them with , i.e. [processing,s3]):

  • all installs everything: pip install datatrove[all]
  • io dependencies to read warc/arc/wet files and arrow/parquet formats: pip install datatrove[io]
  • processing dependencies for text extraction, filtering and tokenization: pip install datatrove[processing]
  • s3 s3 support: pip install datatrove[s3]
  • cli for command line tools: pip install datatrove[cli]

Quickstart examples

You can check the following [examples](examples):

  • [fineweb.py](examples/fineweb.py) full reproduction of the FineWeb dataset
  • [process_common_crawl_dump.py](examples/process_common_crawl_dump.py) full pipeline to read commoncrawl warc files, extract their text content, filters and save the resulting data to s3. Runs on slurm
  • [tokenize_c4.py](examples/tokenize_c4.py) reads data directly from huggingface's hub to tokenize the english portion of the C4 dataset using the gpt2 tokenizer
  • [minhash_deduplication.py](examples/minhash_deduplication.py) full pipeline to run minhash deduplication of text data
  • [sentence_deduplication.py](examples/sentence_deduplication.py) example to run sentence level exact deduplication
  • [exact_substrings.py](examples/exact_substrings.py) example to run ExactSubstr (requires this repo)

Terminology

  • pipeline: a list of processing steps to execute (read data, filter, write to disk, etc)
  • executor: runs a specific pipeline on a given execution environment (slurm, multi cpu machine, etc)
  • job: the execution of a pipeline on a given executor
  • task: a job is comprised of multiple tasks, and these are used to parallelize execution, usually by having each task process a shard of data. Datatrove keeps track of which tasks have completed and when you relaunch only incomplete tasks will run.
  • file: an individual input file (.json, .csv, etc).

> [!TIP] > Note that each file will be processed by a single task. Datatrove does not automatically split a file into multiple parts, so to fully parallelize you should have multiple medium sized files rather than a single large file)

  • shard: a group of input data (usually a group of files), which will be assigned to a specific task. Each task will process a different non overlapping shard of data, from the full list of input files
  • worker: compute resource that will execute a single task at a time, e.g., if you have 50 cpu cores you can run a LocalPipelineExecutor with workers=50, to execute 50 tasks simultaneously (one per cpu). Once a worker is done with a task, it will start processing another waiting task

> [!TIP] > Your number of tasks controls how much you can parallelize and also how much time each individual processing unit will take. If you have a small number of tasks (and they each therefore have to process a large number of files) and they fail, you will have to restart from scratch, whereas if you have a larger number of small tasks each failed task will take way less time to rerun.

> [!CAUTION] > If your tasks > files, some tasks will not process any data, so there usually isn't a point in setting tasks to a number larger than `files.

Example

Running a job to process 10000 files, on a machine with 100 cpu cores (workers). If we choose to use 1000 tasks, each one will process a shard of 10 files. workers=100 means that we can process 100 tasks at a time.

Pipeline

DataTrove Document

Each pipeline block processes data in the datatrove [Document](src/datatrove/data.py) format:

  • text the actual text content for each sample
  • id a unique id (string) for this sample
  • metadata a dictionary where any additional info may be stored

Types of pipeline blocks

Each pipeline block takes a generator of Document as input and returns another generator of Document.

  • [readers](src/datatrove/pipeline/readers) read data from different formats and yield Document
  • [writers](src/datatrove/pipeline/writers) save Document to disk/cloud in different formats
  • [extractors](src/datatrove/pipeline/extractors) extract text content from raw formats (such as webpage html)
  • [filters](src/datatrove/pipeline/filters) filter out (remove) some Documents based on specific rules/criteria
  • [stats](src/datatrove/pipeline/stats) blocks to collect statistics on the dataset

-…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, low stars