WritingOpenAIOpenAIpublished Aug 22, 2025seen 6d

Accelerating life sciences research

Open original ↗

Captured source

source ↗

Accelerating life sciences research | OpenAI

August 22, 2025

Accelerating life sciences research

OpenAI and Retro Biosciences achieve 50x increase in expressing stem cell reprogramming markers.

Loading…

Share

At OpenAI, we believe that AI can meaningfully accelerate life science innovation. To test this belief, we collaborated with the Applied AI team at Retro Bio⁠, a longevity biotech startup, to create and research the impact of GPT‑4b micro, a miniature version of GPT‑4o specialized for protein engineering.

We are excited to share that we’ve successfully leveraged GPT‑4b micro to design novel and significantly enhanced variants of the Yamanaka factors, a set of proteins which led to a Nobel Prize for their role in generating induced pluripotent stem cells (iPSCs) and rejuvenating cells. They have also been used to develop therapeutics to combat blindness⁠, reverse diabetes⁠, treat infertility⁠, and address organ shortages⁠.

In vitro, these redesigned proteins achieved greater than a 50-fold higher expression of stem cell reprogramming markers than wild-type controls. They also demonstrated enhanced DNA damage repair capabilities, indicating higher rejuvenation potential compared to baseline. This finding, made in early 2025, has now been validated by replication across multiple donors, cell types, and delivery methods, with confirmation of full pluripotency and genomic stability in derived iPSC lines. To ensure the findings are discoverable and replicable to benefit the life sciences industry, we are now sharing insights into the research and development of GPT‑4b micro.

Fibroblasts (day 1)

Cells reprogrammed with SOX2, KLF4, OCT4, MYC (day 10)

Cells reprogrammed with RetroSOX, RetroKLF, OCT4, MYC (day 10)

Figure 1: Left is initial conditions; center is standard proteins; right is using our novel proteins. The increase in round, white cells indicate an increase in reprogramming. More specifically: phase-contrast images of human fibroblasts before induction (left) and 10 days after transduction with wild‑type Yamanaka factors (middle), and our re-engineered variants (right). The engineered cocktail promotes the emergence of colonies with the compact, round morphology typically seen during progression toward an iPSC‑like state.

An experimental GPT model for protein engineering

To test our belief that AI can be used to accelerate life sciences research, we designed and trained a custom model—GPT‑4b micro—to possess a broad base of knowledge and skills across biology, with a particular focus on steerability and flexibility to enable advanced use cases such as protein engineering. We initialized it from a scaled-down version of GPT‑4o to take advantage of GPT models’ existing knowledge, then further trained it on a dataset composed mostly of protein sequences, along with biological text and tokenized 3D structure data, elements most protein language models omit.

A large portion of the data was enriched to contain additional contextual information about the proteins in the form of textual descriptions, co-evolutionary homologous sequences, and groups of proteins that are known to interact. This context allows GPT‑4b micro to be prompted to generate sequences with specific desired properties and, since most of the data is structure-free, the model handles proteins with intrinsically disordered regions just as well as structured proteins. This is particularly useful for targets like the Yamanaka factors, whose activity depends on forming numerous transient interactions with a diverse array of binding partners, rather than adopting a single stable structure (Figure 2).

Figure 2: Visualization of the 3D structure of the Yamanaka factors KLF4 (left) and SOX2 (right). Notice that the majority of these proteins are unstructured, with flexible arms that attach to other proteins.Source: AlphaFold Protein Structure Database (left⁠, right⁠)

By training on proteins enriched with additional evolutionary and functional context, we substantially increased the effective context length of our training examples beyond that of standalone sequences. Consequently, we found that we could run prompts as large as 64,000 tokens during inference and continue to observe gains in controllability and output quality. While common in text LLMs, this context size is unprecedented in protein sequence models.

During development we observed the emergence of scaling laws similar to those seen in language models—larger models trained on larger datasets yielded predictable gains in perplexity and downstream protein benchmarks, allowing us to iterate at small scale before training the final GPT‑4b micro model. However, in silico evals for protein AI models are often of limited value, as it is unclear if such improvements translate to increased utility in the real world. To demonstrate that the model is capable of accelerating therapeutic development, we worked with Retro’s scientists, who used the model to re-engineer proteins relevant to their cell-reprogramming research program.

AI-assisted reengineering of SOX2 and KLF4 to increase stem cell reprogramming efficiency

The Yamanaka factors—OCT4, SOX2, KLF4, and MYC (OSKM)—are some of the most important proteins in regenerative biology today, and are named after Shinya Yamanaka, who discovered their ability to reprogram adult cells into pluripotent stem cells, a breakthrough that earned him the Nobel Prize in Physiology or Medicine in 2012. Unfortunately, they suffer from poor efficiency: typically less than 0.1% of cells convert during treatment, and the process can take three weeks or more. Efficiency drops further in cells from aged or diseased donors⁠, so finding more efficient variants remains an active and important research focus.

Directly optimizing the protein sequences is hard. SOX2 contains 317 amino acids and KLF4 513; the number of possible variants is on the order of 10^1000, so traditional “directed-evolution” screens that mutate a handful of residues at a time are able to explore only a miniscule fraction of the design space. A leading academic effort⁠⁠ tested a few thousand SOX2 mutants and found a handful of triple-mutants with a modest gain, while 15 years of work on chimeric SOX⁠⁠ proteins has yielded variants that differ from natural SOX constituents by only five residues.

The team at Retro built a wet lab screening platform using human fibroblast (skin and connective tissue) cells, initially validating it with…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable research announcement from OpenAI