OpenBMB/ForgeTrain
Python
Captured source
source ↗OpenBMB/ForgeTrain
Language: Python
License: Apache-2.0
Stars: 224
Forks: 21
Open issues: 2
Created: 2026-05-14T08:01:19Z
Pushed: 2026-05-26T13:15:23Z
Default branch: main
Fork: no
Archived: no
README:
ForgeTrain
An LLM Pretraining Framework Built End-to-End by an Autonomous Agent Loop + a matching Harness scaffolding *(coming soon)*
🤖 100% AI-Authored · 🚀 44.13% MFU on H100 · 📈 +10% over Megatron-LM · ✅ Production-Validated
[English](./README.md) | [中文](./README_zh.md)
---
> **An LLM pretraining framework written end-to-end by an AI Agent Loop with zero human edits — plus the Harness that produced the pretraining framework *(coming soon)*. > > Current release: v0.1.0** (NVIDIA H100 · MiniCPM4-0.5B / MiniCPM4-8B training frameworks; matching Harness *coming soon*)
---
✨ Highlights
- 🤖 100% Agent-Loop Authored — the entire framework produced by an AI Agent running in auto-loop mode, with zero manual edits
- 🔄 Self-Diagnosing Agent Loop — read reference → implement → launch job → parse logs → root-cause → patch → pass gate → commit, fully autonomous
- 🚀 44.13% MFU on H100 — ~10% above the Megatron-LM baseline (~40%), validated on 64× H100 with BF16, DP-only
- ✅ Production-Validated — MiniCPM4-0.5B fully pretrained, real model weights produced (not a demo)
- 🛠️ GEMM + Attention kernels authored by the agent loop — per-op MFU up to 90%; FlashAttention written from scratch, outperforms Transformer Engine / FA3, on par with FA4
---
🗺️ Roadmap
- Reproduction live demo
- Huawei MiniCPM5-1B training framework
- Training framework self-generates the Harness scaffolding
---
Feature Comparison
| Feature | ForgeTrain | Megatron-LM | |---------|:-:|:-:| | MFU on H100 (MiniCPM4-0.5B, BF16, DP) | 44.13% | ~40% | | 100% AI-Authored Code | ✅ | ❌ | | CuTeDSL custom GEMMs (AOT C-export) | ✅ (5 GEMMs) | ❌ | | Custom FlashAttention (on par with FA4) | ✅ (self-built CuTeDSL impl) | ❌ (uses upstream TE / FA) | | Checkpoint → HuggingFace export | ✅ (one script) | Manual |
Also supports CUDA Graph, Triton fused kernels, and comm-compute overlap out of the box.
> Comparison based on Megatron-LM v0.15 on the same hardware (H100, SM90). ForgeTrain v1 is scoped to MiniCPM4-0.5B (DP-only) and MiniCPM4-8B (TP=2) × BF16; Megatron-LM supports broader model families and parallelism strategies.
---
📢 News
- 📌 [2026-05] ForgeTrain v0.1.0 released — first public release of the training engine; the Harness that produced it is *coming soon*. MiniCPM4-0.5B pretrained on 64× H100, achieving 44.13% MFU.
---
Table of Contents
- [Highlights](#-highlights)
- [Roadmap](#-roadmap)
- [Feature Comparison](#feature-comparison)
- [News](#-news)
- [Agent-Friendly Quick Deploy](#-agent-friendly-quick-deploy)
- [Repository Layout](#repository-layout)
- [Quick Start](#quick-start)
- [Core Technology](#core-technology)
- [Performance](#performance)
- [Contributing](#contributing)
- [License](#license)
- [Acknowledgments](#acknowledgments)
- [Citation](#citation)
---
🤖 Agent-Friendly Quick Deploy
> This repo was produced by an AI Agent and is friendliest to AI Agents. Paste the prompt below into Cursor / Claude Code / Codex / Cline — it will read the README, install dependencies, run the smoke test and report the MFU, without you typing commands one at a time.
🟢 5-step minimal pretraining demo (paste into your Coding Agent)
Following this project's exports/train_engine_0.5B/README.md,
run a 5-step minimal pretraining demo on the current node:
1. Check the environment (Python ≥ 3.11, CUDA ≥ 12.x, H100, PyTorch ≥ 2.4)
and install anything missing;
2. Install the repo: pip install -e . and HF deps: pip install datasets transformers;
3. Import smoke test:
PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"
4. Run 5 steps on HF GSM8K:
torchrun --standalone --nproc-per-node=1 \
-m training_engine_tensor pretrain \
--num-steps 5 --global-batch-size 1 --micro-batch-size 1 \
--seq-length 4096 \
--hf-dataset openai/gsm8k --hf-dataset-config main \
--hf-text-template "Question: {question}\nAnswer: {answer}" \
--tokenizer-path openbmb/MiniCPM4-0.5B \
--save-dir ./checkpoints/demo
5. Print the final loss, step time, and MFU.
If anything fails, dig into the source on your own — do not ask me.> Full single-node 8× H100 and multi-node commands are in the [Quick Start](#quick-start) section below.
---
Repository Layout
This repo bundles a family of subprojects in a strict producer / product relationship:
| Subdirectory | Role | |---|---| | harness/ *(coming soon)* | Harness — the scaffolding that drives an Agent Loop to autonomously build a training framework | | exports/train_engine_0.5B/ | TrainingEngine (0.5B) — produced end-to-end by harness/ *(coming soon)*; targets MiniCPM4-0.5B at 44.13% MFU on 8× H100 | | exports/train_engine_8b/ | TrainingEngine (8B) — also produced by harness/ *(coming soon)*; targets MiniCPM4-8B with TP=2 / DP=4 at 50.9% MFU on a single 8× H100 host |
harness/ ──(bash agent-loop.sh, zero human input)──▶ exports/train_engine_0.5B/ Harness (coming soon) exports/train_engine_8b/ producer (gates + prompts + control plane) product (a runnable training framework)
Each subdirectory has its own README with full CLI docs, config reference, layout, performance baselines, and limitations.
---
Quick Start
> Environment: Python ≥ 3.11 · CUDA 12.x · PyTorch ≥ 2.4 · NVIDIA H100 80GB (SM90). Full pretraining requires 8× H100; early alignment stages run on a single GPU.
Use the training framework directly → exports/train_engine_0.5B/
---
Use the training framework
Goal: take the ready-made framework and run pretraining on your H100s.
1. Install
git clone https://github.com/OpenBMB/ForgeTrain.git cd ForgeTrain/exports/train_engine_0.5B pip install -e . pip install datasets transformers # HuggingFace data path (required)
2. Verify install
PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"Expected output: OK
3. Precompile operators (first run only; subsequent runs reuse the cache)
PYTHONPATH=src CUSTOM_GEMM=1 OP_ATTENTION=v1 \ python scripts/precompile_ops.py
Warms up AOT export + cpp_extension builds for the 5 CuTeDSL GEMMs, persisting under ${ENGINE_ROOT}/.persist_cache/. Subsequent jobs…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New repo, moderate stars.