NVIDIA/physicsnemo-curator
Python
Captured source
source ↗NVIDIA/physicsnemo-curator
Description: Accelerated ETL toolkit for building AI-ready datasets across multiple scientific and engineering domains
Language: Python
License: Apache-2.0
Stars: 54
Forks: 18
Open issues: 3
Created: 2025-06-02T19:17:46Z
Pushed: 2026-06-08T20:06:50Z
Default branch: main
Fork: no
Archived: no
README:
PhysicsNeMo Curator
   
PhysicsNeMo Curator is an accelerated ETL toolkit for building AI-ready datasets across multiple scientific and engineering domains, including CAE, weather/climate, molecular dynamics, and more. Curator is designed to be a flexible and customizable package that provides core pipeline components for users to create their own data processing pipelines.
**Docs** | [Getting Started](#getting-started) | **Domains** | **Extending** | **Examples** | [Contributing](#contributing-to-physicsnemo-curator)
---
> [!WARNING] > This package is in beta and subject to extensive changes. There are no guarantees for API stability.
Key Features
- Fluent pipeline API — chain
Source → Filter → Sinkwith a single
expression, then execute in parallel
- Lazy generator semantics — sources and filters yield items lazily;
pipeline[i] processes only the *i*-th item
- Multiple domains — first-class support for unstructured meshes
(physicsnemo.mesh.Mesh), gridded data arrays (xarray.DataArray), and atomic/molecular data (nvalchemi.data.AtomicData)
- Pluggable execution — sequential, thread pool, process pool, Loky,
Dask, or Prefect backends
- Registry & CLI — all sources, filters, and sinks are discoverable
via a global registry and optional interactive CLI
- Extensible — write custom sources, filters, and sinks with minimal
boilerplate (guide)
Getting Started
Requirements
- Python >= 3.11
- OS: Linux x86_64
- Rust toolchain (for building the native extension from source)
Installation
git clone git@github.com:NVIDIA/physicsnemo-curator.git cd physicsnemo-curator # Install all dev dependencies and build the Rust extension uv sync --group dev uv run maturin develop # (Optional) Install pre-commit hooks uv run pre-commit install
Quick Start Sample
Curate a simple global weather dataset:
# First install the data array dependency group uv sync --extra da
from datetime import datetime, timedelta
from physicsnemo_curator.domains.da.filters.stats import DataArrayStatsFilter
from physicsnemo_curator.domains.da.sinks.zarr_writer import ZarrSink
from physicsnemo_curator.domains.da.sources.era5 import ERA5Source
from physicsnemo_curator.run import run_pipeline
# Hourly timestamps for one day
times = [datetime(2020, 1, 1) + timedelta(hours=h) for h in range(24)]
# Source → Filter → Sink
pipeline = (
ERA5Source(times=times, variables=["u10m", "v10m", "t2m"], backend="arco")
.filter(DataArrayStatsFilter(output="output/stats.zarr", dims=("time",)))
.write(ZarrSink(output_path="output/dataset.zarr"))
)
# Execute in parallel
results = run_pipeline(pipeline, n_jobs=4, backend="process_pool")Optional Dependencies
Install domain-specific extras as needed:
# Mesh domain (CAE, CFD) pip install physicsnemo-curator[mesh] # DataArray domain (weather/climate) pip install physicsnemo-curator[da] # Atomic domain (molecular dynamics) pip install physicsnemo-curator[atm] # Dashboard pip install physicsnemo-curator[dashboard]
CLI
PhysicsNeMo Curator includes the psnc command-line tool with an interactive full-screen pipeline wizard powered by Textual.
psnc
Dashboard
pip install 'physicsnemo-curator[dashboard]' psnc dashboard pipeline.db
Contributing to PhysicsNeMo Curator
PhysicsNeMo Curator is an open source project and its success is rooted in community contributions. Thank you for contributing so others can build on your work.
For guidance, please refer to the [contributing guidelines](CONTRIBUTING.md). See also:
how to write custom sources, filters, and sinks
style conventions, benchmarking, and AI-assisted development
Ecosystem
PhysicsNeMo Curator is part of NVIDIA's open-source Physics-ML ecosystem:
| Package | Description | |---|---| | PhysicsNeMo | Core framework for building, training, and fine-tuning physics-ML models | | PhysicsNeMo CFD | Pretrained AI models for computational fluid dynamics | | Earth-2 Studio | Pretrained AI models for weather and climate | | ALCHEMI Toolkit | GPU-first framework for AI-driven atomic simulations | | ALCHEMI Toolkit Ops | GPU-optimized primitives for neighbor lists, dispersion, and electrostatics |
Communication
- GitHub Discussions — new data formats, transformations, Physics-ML research
- GitHub Issues — bug reports, feature requests, installation issues
License
PhysicsNeMo Curator is provided under the Apache License 2.0. See [LICENSE.txt](./LICENSE.txt) for the full license text.
Notability
Scored, but no written rationale attached yet.
NVIDIA has a repo signal matching data demand.