RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published May 14, 2026seen 5d

OpenBMB/ForgeTrain

Python

Open original ↗

Captured source

source ↗
published May 14, 2026seen 5dcaptured 10hhttp 200method plain

OpenBMB/ForgeTrain

Language: Python

License: Apache-2.0

Stars: 224

Forks: 21

Open issues: 2

Created: 2026-05-14T08:01:19Z

Pushed: 2026-05-26T13:15:23Z

Default branch: main

Fork: no

Archived: no

README:

ForgeTrain

An LLM Pretraining Framework Built End-to-End by an Autonomous Agent Loop + a matching Harness scaffolding *(coming soon)*

🤖 100% AI-Authored · 🚀 44.13% MFU on H100 · 📈 +10% over Megatron-LM · ✅ Production-Validated

[English](./README.md) | [中文](./README_zh.md)

---

> **An LLM pretraining framework written end-to-end by an AI Agent Loop with zero human edits — plus the Harness that produced the pretraining framework *(coming soon)*. > > Current release: v0.1.0** (NVIDIA H100 · MiniCPM4-0.5B / MiniCPM4-8B training frameworks; matching Harness *coming soon*)

---

✨ Highlights

  • 🤖 100% Agent-Loop Authored — the entire framework produced by an AI Agent running in auto-loop mode, with zero manual edits
  • 🔄 Self-Diagnosing Agent Loop — read reference → implement → launch job → parse logs → root-cause → patch → pass gate → commit, fully autonomous
  • 🚀 44.13% MFU on H100 — ~10% above the Megatron-LM baseline (~40%), validated on 64× H100 with BF16, DP-only
  • Production-Validated — MiniCPM4-0.5B fully pretrained, real model weights produced (not a demo)
  • 🛠️ GEMM + Attention kernels authored by the agent loopper-op MFU up to 90%; FlashAttention written from scratch, outperforms Transformer Engine / FA3, on par with FA4

---

🗺️ Roadmap

  • Reproduction live demo
  • Huawei MiniCPM5-1B training framework
  • Training framework self-generates the Harness scaffolding

---

Feature Comparison

| Feature | ForgeTrain | Megatron-LM | |---------|:-:|:-:| | MFU on H100 (MiniCPM4-0.5B, BF16, DP) | 44.13% | ~40% | | 100% AI-Authored Code | ✅ | ❌ | | CuTeDSL custom GEMMs (AOT C-export) | ✅ (5 GEMMs) | ❌ | | Custom FlashAttention (on par with FA4) | ✅ (self-built CuTeDSL impl) | ❌ (uses upstream TE / FA) | | Checkpoint → HuggingFace export | ✅ (one script) | Manual |

Also supports CUDA Graph, Triton fused kernels, and comm-compute overlap out of the box.

> Comparison based on Megatron-LM v0.15 on the same hardware (H100, SM90). ForgeTrain v1 is scoped to MiniCPM4-0.5B (DP-only) and MiniCPM4-8B (TP=2) × BF16; Megatron-LM supports broader model families and parallelism strategies.

---

📢 News

  • 📌 [2026-05] ForgeTrain v0.1.0 released — first public release of the training engine; the Harness that produced it is *coming soon*. MiniCPM4-0.5B pretrained on 64× H100, achieving 44.13% MFU.

---

Table of Contents

  • [Highlights](#-highlights)
  • [Roadmap](#-roadmap)
  • [Feature Comparison](#feature-comparison)
  • [News](#-news)
  • [Agent-Friendly Quick Deploy](#-agent-friendly-quick-deploy)
  • [Repository Layout](#repository-layout)
  • [Quick Start](#quick-start)
  • [Core Technology](#core-technology)
  • [Performance](#performance)
  • [Contributing](#contributing)
  • [License](#license)
  • [Acknowledgments](#acknowledgments)
  • [Citation](#citation)

---

🤖 Agent-Friendly Quick Deploy

> This repo was produced by an AI Agent and is friendliest to AI Agents. Paste the prompt below into Cursor / Claude Code / Codex / Cline — it will read the README, install dependencies, run the smoke test and report the MFU, without you typing commands one at a time.

🟢 5-step minimal pretraining demo (paste into your Coding Agent)

Following this project's exports/train_engine_0.5B/README.md,
run a 5-step minimal pretraining demo on the current node:

1. Check the environment (Python ≥ 3.11, CUDA ≥ 12.x, H100, PyTorch ≥ 2.4)
and install anything missing;
2. Install the repo: pip install -e . and HF deps: pip install datasets transformers;
3. Import smoke test:
PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"
4. Run 5 steps on HF GSM8K:
torchrun --standalone --nproc-per-node=1 \
-m training_engine_tensor pretrain \
--num-steps 5 --global-batch-size 1 --micro-batch-size 1 \
--seq-length 4096 \
--hf-dataset openai/gsm8k --hf-dataset-config main \
--hf-text-template "Question: {question}\nAnswer: {answer}" \
--tokenizer-path openbmb/MiniCPM4-0.5B \
--save-dir ./checkpoints/demo
5. Print the final loss, step time, and MFU.

If anything fails, dig into the source on your own — do not ask me.

> Full single-node 8× H100 and multi-node commands are in the [Quick Start](#quick-start) section below.

---

Repository Layout

This repo bundles a family of subprojects in a strict producer / product relationship:

| Subdirectory | Role | |---|---| | harness/ *(coming soon)* | Harness — the scaffolding that drives an Agent Loop to autonomously build a training framework | | exports/train_engine_0.5B/ | TrainingEngine (0.5B) — produced end-to-end by harness/ *(coming soon)*; targets MiniCPM4-0.5B at 44.13% MFU on 8× H100 | | exports/train_engine_8b/ | TrainingEngine (8B) — also produced by harness/ *(coming soon)*; targets MiniCPM4-8B with TP=2 / DP=4 at 50.9% MFU on a single 8× H100 host |

harness/ ──(bash agent-loop.sh, zero human input)──▶ exports/train_engine_0.5B/
Harness (coming soon) exports/train_engine_8b/
producer (gates + prompts + control plane) product (a runnable training framework)

Each subdirectory has its own README with full CLI docs, config reference, layout, performance baselines, and limitations.

---

Quick Start

> Environment: Python ≥ 3.11 · CUDA 12.x · PyTorch ≥ 2.4 · NVIDIA H100 80GB (SM90). Full pretraining requires 8× H100; early alignment stages run on a single GPU.

Use the training framework directly → exports/train_engine_0.5B/

---

Use the training framework

Goal: take the ready-made framework and run pretraining on your H100s.

1. Install

git clone https://github.com/OpenBMB/ForgeTrain.git
cd ForgeTrain/exports/train_engine_0.5B
pip install -e .
pip install datasets transformers # HuggingFace data path (required)

2. Verify install

PYTHONPATH=src python -c "from training_engine_tensor import config; print('OK')"

Expected output: OK

3. Precompile operators (first run only; subsequent runs reuse the cache)

PYTHONPATH=src CUSTOM_GEMM=1 OP_ATTENTION=v1 \
python scripts/precompile_ops.py

Warms up AOT export + cpp_extension builds for the 5 CuTeDSL GEMMs, persisting under ${ENGINE_ROOT}/.persist_cache/. Subsequent jobs…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New repo, moderate stars.