nvidia/NV-KERMT-70M-v2
Captured source
source ↗> Source code, training scripts, and inference utilities for this model: > [github.com/NVIDIA-BioNeMo/KERMT](https://github.com/NVIDIA-BioNeMo/KERMT) > (v2.0 branch / v2.0.0 release tag)
Model Overview
Description:
Contrastive KERMT (Kinetic GROVER Multi-Task) is a graph-transformer foundation model pretrained to learn chemically meaningful molecular representations for downstream ADMET (absorption, distribution, metabolism, excretion, toxicity) property prediction in drug discovery. The model encodes a 2D molecular graph into a latent representation under a single joint probabilistic objective that combines SMILES reconstruction, in-batch contrastive discrimination, and chemistry-specific self-supervision (atom-context, bond-context, and functional group prediction), all formulated as unit-weighted log-probability factors. The released checkpoint was pretrained for 100 epochs on a corpus combining an 11M-molecule ZINC15+ChEMBL base pool (following the pretraining-data protocol of Rong et al. 2020) with Biogen ADMET, ExpansionRX, and ChEMBL-MT (~125K additional molecules), and is intended as a starting point for downstream multi-task ADMET fine-tuning. Contrastive KERMT was developed by NVIDIA as part of the KERMT v2.0 release. This model is ready for commercial or non-commercial use.
License/Terms of Use:
Copyright © 2026, NVIDIA Corporation. All rights reserved.
The source code is made available under Apache License, Version 2.0. See LICENSE in the source repository at https://github.com/NVIDIA-BioNeMo/KERMT.
The model weights are made available under the NVIDIA Open Model License.
Deployment Geography:
Global
Use Case:
Computational chemistry and machine-learning researchers in drug discovery — particularly those working on ADMET / Drug Metabolism and Pharmacokinetics (DMPK) prediction — who need a pretrained molecular graph encoder that can be fine-tuned on multi-endpoint ADMET datasets, used as a feature extractor for property-prediction pipelines, or studied as a baseline in molecular-representation-learning research. The released checkpoint is a pretrained backbone; users are expected to fine-tune it on their own labeled datasets for specific ADMET endpoints before using predictions in downstream workflows.
Release Date:
NGC 06/10/2026 via https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/kermt-contrastive
Hugging Face 06/10/2026 via https://huggingface.co/nvidia/NV-KERMT-70M-v2
References(s):
- Adrian, M., Chung, Y., Boyd, K., Paliwal, S., Veccham, S.P., Cheng, A.C. *Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction.* arXiv:2510.12719 (2025). https://arxiv.org/abs/2510.12719 — KERMT (the v1 baseline this work extends).
- Rong, Y. et al. *Self-Supervised Graph Transformer on Large-Scale Molecular Data.* NeurIPS 33, 12559–12571 (2020). https://papers.nips.cc/paper/2020/hash/3fe230348e9a12c13120749e3f9fa4cd-Abstract.html — GROVER, the underlying graph-transformer architecture.
- Sterling, T., Irwin, J. J. *ZINC 15 – Ligand Discovery for Everyone.* J. Chem. Inf. Model. 55(11), 2324–2337 (2015). DOI: 10.1021/acs.jcim.5b00559 — ZINC15 base corpus.
- Mendez, D. et al. *ChEMBL: towards direct deposition of bioassay data.* Nucleic Acids Research 47(D1), D930–D940 (2019). — ChEMBL base corpus.
- Fang, C., Wang, Y., Grater, R. et al. *Prospective Validation of Machine Learning Algorithms for ADMET Prediction.* J. Chem. Inf. Model. 63(11), 3263–3274 (2023). — Biogen ADMET dataset (in-domain augmentation + finetune benchmark).
- Contrastive KERMT manuscript (in preparation; arXiv URL to be added on publication).
Model Architecture:
Architecture Type: Transformer (graph-transformer with local message passing + global self-attention)
Network Architecture: KERMT graph-transformer encoder (extension of GROVER) with a probabilistic latent head, an in-batch contrastive auxiliary variable, a SMILES-reconstruction transformer decoder, and chemistry-specific vocabulary prediction heads. Encoder: hidden size 800, 6 message-passing-plus-attention layers, 4 attention heads per layer, 1 multi-task (MT) block, PReLU activation, dropout 0.1. Decoder: 3 transformer layers, 8 attention heads, 512 hidden / latent dimension, FFN hidden 2048, rotary positional encoding (RoPE). Latent dimension 512.
This model was developed based on KERMT (Adrian et al. 2025, arXiv:2510.12719), in turn based on GROVER (Rong et al. 2020).
Number of model parameters: 7.06 × 10^7
Input(s):
Input Type(s): Text (SMILES string representing a 2D molecular structure)
Input Format(s): UTF-8 SMILES (Simplified Molecular Input Line Entry System)
Input Parameters: One-Dimensional (1D) text
Other Properties Related to Input: The input is a canonical SMILES string parseable by RDKit (an open-source cheminformatics toolkit); molecules are internally featurized into 2D atom-and-bond graphs prior to encoding. Recommended maximum sequence length for the SMILES decoder is 512 tokens (the value used at pretraining time); molecules whose canonical SMILES exceed this length should be truncated or omitted. Inputs are not text in the natural-language sense and are not subject to natural-language preprocessing (no tokenization in the human-language sense; characters are mapped via a chemistry-specific tokenizer matching the bundled SMILES vocabulary).
Output(s)
Output Type(s): Numerical tensors (molecular embeddings) and, when downstream task-specific heads are present, scalar ADMET property predictions. Optionally, generated SMILES strings via the pretraining-time SMILES decoder.
Output Format(s):
- Molecular embeddings: float tensors of shape (batch_size, hidden_size=800) for atom-level and bond-level readouts; (batch_size, latent_dim=512) for the cMIM projected latent.
- Property predictions (after finetune): float tensors of shape (batch_size, num_endpoints) — values are continuous regression outputs per ADMET endpoint.
- Generated SMILES (pretrain-time decoder only): UTF-8 SMILES string.
Output Parameters: One-Dimensional (1D) embedding / prediction vectors.
Other Properties Related to Output: Embeddings are intended as inputs to downstream property-prediction heads, similarity...
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10NVIDIA released a small 70M model, no traction data.