What does this repo signal mean?

NVIDIA published NVIDIA/maxToki (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo NVIDIA/maxToki · language Python · NVIDIA's tokenizer for efficient text processing. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

NVIDIA Repo: NVIDIA/maxToki

Captured source

source ↗

GitHub/github.com/NVIDIA/maxToki

NVIDIA/maxToki repository metadata

Source ↗

published Mar 30, 2026seen Jun 5captured Jun 11http 200method plain

NVIDIA/maxToki

Description: MaxToki: an autoregressive single-cell language model built on BioNeMo/NeMo/Megatron.

Language: Python

License: NOASSERTION

Stars: 28

Forks: 9

Open issues: 0

Created: 2026-03-30T23:20:25Z

Pushed: 2026-05-07T14:45:29Z

Default branch: main

Fork: no

Archived: no

README:

MaxToki

MaxToki is a temporal AI model for predicting the drivers of cell state progression over time, providing a generalizable framework to decode and control dynamic cellular trajectories. The temporal training is composed of two tasks: 1) predict past, intervening, or future cell states along a trajectory autoregressively (NextCell), and 2) predict the time elapsed between cell state observations as a regression task (TimeBetweenCells). Training uses NeMo and Megatron-LM for distributed GPU execution.

See our manuscript for details.
See the model repository on Hugging Face for the pretrained MaxToki models.

Hardware Requirements

| | Minimum | Recommended | |---|---|---| | GPU | NVIDIA A100 | H100 80GB | | VRAM | 40 GB | 80 GB | | CUDA | 12.x | 12.4+ | | Driver | 525+ | latest |

TransformerEngine requires CUDA. The model cannot run on CPU. A single A100 or H100 is sufficient for development and fine-tuning; full-scale pretraining benefits from multiple GPUs.

Setup

All dependencies (NeMo via PyPI, Megatron-LM, TransformerEngine, Apex) are pinned in the container. Running outside the container is not supported.

Clone with submodules

Megatron-LM is a git submodule; initialize it with:

git submodule update --init --recursive

Build the image

DOCKER_BUILDKIT=1 docker build --target dev -t maxtoki-dev -f Dockerfile .

Launch the container

docker run --rm -it --gpus all \
--network host \
--shm-size=4g \
-e TMPDIR=/tmp \
-e NUMBA_CACHE_DIR=/tmp/ \
-w /workspaces/maxToki \
-v "$(pwd)":/workspaces/maxToki \
-v "$HOME/.cache":/home/ubuntu/.cache \
--user root \
maxtoki-dev \
bash -c "usermod -u $(id -u) ubuntu && groupmod -g $(id -g) ubuntu && \
su - ubuntu -c 'cd /workspaces/maxToki && \
source .devcontainer/postCreateCommand.sh && exec bash'"

This opens a bash shell with the repo mounted at /workspaces/maxToki and all bionemo/NeMo sub-packages installed in editable mode. Optionally add -v /data:/home/ubuntu/data if you have a local /data directory, and pass -e WANDB_API_KEY=... for experiment tracking.

Repository Structure

sub-packages/
bionemo-maxtoki/ # MaxToki model, training, inference, and checkpoint conversion
src/ # Contains the actual module and source code
test/ # Contains the tests relevant to maxtoki
bionemo-llm/ # Shared LLM primitives (Lightning module, LR scheduler, callbacks)
bionemo-core/ # Core utilities from bionemo-framework
bionemo-testing/ # General purpose test helpers
3rdparty/
Megatron-LM/ # Pinned Megatron-LM submodule (NeMo is installed from PyPI)

Architecture

MaxToki is based on the LLaMA decoder model architecture. The first stage pretraining employs an autoregressive training objective to generate rank value encoded transcriptomes using standard cross-entropy loss. In the second stage temporal training, the context length is extended to accommodate an input of multiple single-cell transcriptomes along a cell state trajectory, and the model is trained with a mixed loss (MaxTokiLossWithReduction) objective that balances the tasks of cell state generation (cross-entropy loss) and timelapse prediction (MSE loss) using a configurable mixture ratio.

Key classes:

| Class | File | Description | |---|---|---| | MaxTokiBaseConfig | api.py | Base config extending NeMo's Llama32Config1B | | MaxTokiConfig | model.py | Pretraining config; selects loss class and model class, as well as exposes a variety of transformer-related parameters | | MaxTokiMultitaskFineTuneConfig | model.py | Temporal training config; attaches regression head and a custom multitask loss | | MaxTokiLossWithReduction | model.py | Mixed CE + MSE loss with per-task masking | | MaxTokiTokenizer | tokenizer.py | Wraps the gene token dictionary; handles special tokens and loss mask generation. Generally the tokenizer is a pass-through, rank value encoded token IDs are expected as inputs. | | MaxTokiDataModule | datamodule.py | Lightning DataModule for single-cell datasets that may include time tokens. | | FinetuneLlamaModel | model.py | MCoreGPTModel subclass with the regression head attached |

Data Preparation

Raw counts from single-cell RNAseq data (.h5ad files) must be processed before training. The pipeline has three stages, all in bionemo.maxtoki.data_prep:

Stage 1: Tokenize — converts .h5ad files into rank value encoding token sequences.

python -m bionemo.maxtoki.data_prep tokenize \
--data-directory /path/to/h5ad_files \
--output-directory /path/to/output \
--output-prefix my_dataset \
--nproc 8

--token-dictionary-file and --gene-median-file are required. --gene-mapping-file (Ensembl mapping) is optional and only needed when mapping gene names to Ensembl IDs.

Stage 2: Assemble cell paragraphs — groups cells from the same trajectory into training sequences that include time-lapse tokens.

python -m bionemo.maxtoki.data_prep assemble-paragraphs \
--data-directory /path/to/tokenized.dataset \
--output-directory /path/to/output \
--output-prefix my_paragraphs \
--max-timepoint 730 \
--time-group-columns donor_id timepoint \
--num-examples 10000000

| Argument | Default | Description | |---|---|---| | --max-timepoint | required | Maximum time value; sets the numeric range for time tokens. | | --time-group-columns | none | Column names used to group cells into trajectories. | | --min-timepoints | 3 | Minimum observations per paragraph. | | --max-timepoints | 4 | Maximum observations per paragraph. | | --task-ratio | 0.5 | Fraction of samples used for timelapse vs next-cell tasks. | | --model-input-size | 16384 | Sequences longer than this are truncated. |

Stage 3: Assemble queries — builds evaluation query datasets from cell paragraphs.

python -m bionemo.maxtoki.data_prep assemble-queries \
--blueprint-dictionary-file /path/to/blueprint.pkl \
--time-token-dictionary-file /path/to/time_dictionary.pkl \
--cell-paragraph-dataset-file /path/to/paragraphs.dataset \...

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Very low stars, minimal traction