ModelMicrosoftMicrosoftpublished Apr 3, 2026seen 5d

microsoft/Dayhoff-170M-GRS-SS-50000

Open original ↗

Captured source

source ↗
published Apr 3, 2026seen 5dcaptured 14hhttp 200method plaintask text-generationlicense mitlibrary transformersparams 170Mdownloads 55likes 0

Model Card for Dayhoff

Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-derived synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.

The Dayhoff architecture is a hybrid of state-space Mamba layers and Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.

Model Details

Model Description

  • Developed by: Kevin K. Yang, Sarah Alamdari, Alex J. Lee, Kaeli Kaymak-Loveless, Samir Char, Garyk Brixi, Carles Domingo-Enrich, Chentong Wang, Suyue Lyu, Nicolo Fusi, Neil Tenenholtz, Ava P. Amini
  • Model type: Hybrid state-space-model transformer architecture with mixture-of-experts
  • License: MIT

Model Sources

  • Repository: https://github.com/microsoft/dayhoff

Uses

Downstream Use

Dayhoff is intended for broad research use on protein language modeling. The model has been used and assessed on the following capabilities:

1. Unconditional design of protein sequences 2. Zero-shot mutation effect prediction on ProteinGym 3. Designing scaffolds for structural motifs in sequence space on RFDiffusion and MotifBench 4. Homolog conditioning with Dayhoff-3b-GR-HM and Dayhoff-3b-GR-HM-c

Bias, Risks, and Limitations

This model should not be used to generate anything that is not a protein sequence or a set of homologuous protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences. Not all sequences are guaranteed to be realistic. It remains difficult to generate high-quality sequences with no sequence homology to any natural sequence.

How to Get Started with the Model

The simplest way to use these models and datasets is via the HuggingFace interface. You will need PyTorch, mamba=ssm, causal-conv1d, and flash-attn.

Requirements:

  • PyTorch: 2.7.1
  • CUDA 12.8 and above

We recommend using uv and creating a clean environment.

uv venv dayhoff
source dayhoff/bin/activate

In that new environment, install PyTorch 2.7.1.

uv pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128

Now, we need to install mamba-ssm, flash-attn, causal-conv1d, and their prerequisites.

uv pip install wheel packaging
uv pip install --no-build-isolation flash-attn causal-conv1d mamba-ssm

To import from HuggingFace, you will need to install these versions:

uv pip install datasets==3.2.0 #for HF datasets
uv pip install transformers==4.51.3
uv pip install huggingface_hub~=0.34.4

Sample protein generation code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed

set_seed(0)
torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/Dayhoff-170M-GRS-SS-50000").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Dayhoff-170M-GRS-SS-50000", trust_remote_code=True)

inputs = tokenizer(tokenizer.bos_token, return_tensors="pt", return_token_type_ids=False)

outputs = model.generate(inputs['input_ids'],max_length=50,do_sample=True)
sequence = tokenizer.batch_decode(outputs,skip_special_tokens=True)
print(sequence)

For detailed instructions on package usage, please refer to the README in model repo.

Evaluation

Results

See the preprint for the latest benchmark results and evaluations.

Model perplexity on held-out test sequences for Dayhoff models.

| Model | UniRef50 | GigaRef | Aligned homologs | Unaligned homologs | |------------------|---------:|--------:|-----------------:|-------------------:| | 170m-UR50 | 11.62 | 11.88 | | | | 170m-UR90 | 11.52 | 11.85 | | | | 170m-GR | 13.67 | 9.36 | | | | 170m-UR50-BRn | 11.78 | 12.03 | | | | 170m-UR50-BRq | 11.67 | 11.91 | | | | 170m-UR50-BRu | 11.66 | 11.87 | | | | 3b-UR90 | 8.95 | 9.64 | | | | 3b-GR-HM | 11.95 | 6.68 | 4.34 | 4.60 | | 3b-GR-HM-c | 10.11 | 9.21 | 3.57 | 3.56 |

Quality of generated sequences as measured by ESMFold pLDDT and scPerplexity. Dataset statistics are for 1024 randomly-sampled sequences. Model statistics are for 1024 generations at T=1 in the N-to-C direction.

| Model or dataset | pLDDT (mean ± s.d.) | scPerplexity (mean ± s.d.) | |-------------------------|---------------------|----------------------------| | Natural sequences | | | | UniRef50 | 0.653 ± 0.196 | 9.45 ± 2.89 | | GigaRef-clusters | 0.619 ± 0.199 | 9.69 ± 2.83 | | GigaRef-singletons | 0.561 ± 0.201 | 10.07 ± 2.88 | | Generated sequences | | | | 170m-UR50 | 0.421 ± 0.132 | 11.97 ± 2.14 | | 170m-UR90 | 0.407 ± 0.125 | 12.12 ± 2.14 | | 170m-GR | 0.422 ± 0.129 | 11.83 ± 2.12 | | 170m-UR50-BRu | 0.441 ± 0.157 | 11.71 ± 2.18 | | 170m-UR50-BRq | 0.434 ± 0.152 | 11.72 ± 2.24 | | 170m-UR50-BRn | 0.432 ± 0.131 | 11.77 ± 2.24 | | 3b-UR90 | 0.454 ± 0.150 | 11.79 ± 2.38 | | 3b-GR-HM | 0.406 ± 0.126 | 11.50 ± 2.16 | | 3b-GR-HM-c | 0.423 ± 0.132 | 11.91 ± 2.18 |

ProteinGym zero-shot performance Spearman’s correlation coefficient on ProteinGym substitutions and indels.

| Input | Model | Parameters | Substitutions | Indels | |------------------------|----------------|-----------:|--------------:|-------:| | Single sequence | 170m-UR50 | 170M | 0.353 | 0.479 | | | 170m-UR90 | 170M | 0.354 | 0.483 | | | 170m-GR | 170M…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Small model, low downloads, routine release