RepoCerebrasCerebraspublished Mar 28, 2025seen 5d

Cerebras/exome_bench

Python

Open original ↗

Captured source

source ↗
published Mar 28, 2025seen 5dcaptured 9hhttp 200method plain

Cerebras/exome_bench

Language: Python

License: Apache-2.0

Stars: 11

Forks: 1

Open issues: 0

Created: 2025-03-28T18:27:00Z

Pushed: 2026-02-19T23:21:15Z

Default branch: main

Fork: no

Archived: no

README:

> ExomeBench consists of datasets and code. Datasets are licensed under Creative Commons Attribution–Non-Commercial 4.0 (CC BY NC 4.0). Code provided in ExomeBench is licensed under Apache 2.0. > ExomeBench is a research benchmark. It is not a diagnostic tool and should not be used to make clinical decisions.

1. Project Overview

ExomeBench is a benchmark dataset designed for the evaluation of models in the field of clinical genomics, specifically focusing on the interpretation of genetic variants in exome regions. This repo contains the code to fine-tune and evaluate models on the ExomeBench dataset using the Hugging Face Transformers library.

The ExomeBench dataset is derived from ClinVar (Nov 2024 release), a publicly accessible database maintained by the National Center for Biotechnology Information (NCBI). ClinVar provides comprehensive information on the clinical significance of genetic variants and their associations with human diseases. This dataset focuses on variants located in exome-specific regions and includes input sequences generated from the Human Reference Genome (HRG, GRCh38).

This dataset provides a valuable resource for researchers and practitioners working on genetic variant analysis and its clinical implications. Exome-specific regions are critically important because they encompass all protein-coding regions of the genome, where disease-associated variants are most likely to occur. By focusing on exome-specific regions and using sequences from the Human Reference Genome, this dataset enables robust evaluation of models on clinically significant tasks.

2. Data Curation

Data Collection

  • Source: Variants are sourced from the ClinVar database.
  • Clinical Significance: ClinVar provides detailed information on the clinical significance of each variant and its association with human diseases.

Data Filtering

  • Assertion Criteria: We include only variants with at least one submitter providing an interpretation and satisfying the assertion criteria for reliability.
  • Variant Type: Only single-nucleotide variants (SNVs) are selected.
  • Exome-Specific Regions: Filter the variants to include only those located in exome-specific regions (GENCODE v.38).

Sequence Generation

  • Human Reference Genome (HRG, GRCh38): For each variant, generate input sequences from the HRG using the variants from the ClinVar database.
  • Sequence Length: The length of the sequences is a parameter, typically set to 100 base pairs (bp).
  • Variant Positioning: The variant is centered within the sequence, which is read in from a FASTA file.

Dataset Format

Each dataset entry consists of two main fields:

  • sequence (str): A DNA sequence centered around the variant.
  • label (int): Task-specific integer-encoded class index.

3. Tasks

ExomeBench includes five supervised tasks, each framed as a classification problem:

  • Pathogenic Variant Prediction (PV)

Classify exome variants into four clinical significance categories: *pathogenic*, *likely pathogenic*, *likely benign*, or *benign*. Variants from the same gene are split across train/test to prevent leakage.

  • Phenotype Association
  • Cancer-Predisposing Syndrome (CPS): Determine if a variant is linked to Hereditary Cancer-Predisposing Syndrome.
  • Cardiovascular Phenotype (CP): Predict whether a variant is associated with cardiovascular conditions.
  • Gene Localization
  • BRCA Classification (BRCA): Identify whether a variant belongs to *BRCA1*, *BRCA2*, or neither.
  • Top 5 Genes Prediction (TFG): Classify a variant into one of the five most frequently represented genes in the dataset.

4. SOTA Model Performances

Please see our [experiments](experiments) folder for details on the hyperparameters and [results](results) folder for the best model performance metrics. Below we provide the best MCC metric on the test set for each task across different models.

> Note: For some models and tasks, the seed settings in the STRAND paper were slightly different from those used in this repository, which may lead to minor variations in the reported results. Due to this, on an overly saturated tasks like TFG, you might observe a small discrepancy in the ordering of models based on MCC values compared to those reported in the paper.

5. Installation & Setup

Prerequisites

Setup with Conda (Recommended)

1. Clone the Repository

git clone https://github.com/Cerebras/exome_bench.git
cd exome_bench

2. Create Conda Environment

conda create -n exome_bench python=3.10
conda activate exome_bench

3. Install bedtools (Required for Dataset Generation)

conda install -c bioconda bedtools

4. Install Python Packages

pip install -r requirements.txt
pip install pybedtools==0.12.0

> Note: If you're only using the pre-generated datasets from Hugging Face, you don't need to install bedtools or pybedtools (skip step 3). These are only required for regenerating datasets from ClinVar and reference genome files.

6. Fine-tuning and Evaluation

You can easily fine-tune and evaluate your model on the ExomeBench tasks:

Evaluate the Model

You can evaluate your model on test split by specifing pretrained_model and the task_name in eval_only mode.

python main.py \
--pretrained_model InstaDeepAI/nucleotide-transformer-2.5b-multi-species \
--output_dir results \
--mode eval_only \
--task_name brca

Fine-Tune and Evaluate the Model

You can fine-tune the model and provide training arguments with a yaml file.

python main.py \
--config configs/brca.yaml

(Optional) Perform Hyper-Parameter Optimization, Fine-Tune, and Evaluate…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low stars, routine new repo