Cerebras/exome_bench
Python
Captured source
source ↗Cerebras/exome_bench
Language: Python
License: Apache-2.0
Stars: 11
Forks: 1
Open issues: 0
Created: 2025-03-28T18:27:00Z
Pushed: 2026-02-19T23:21:15Z
Default branch: main
Fork: no
Archived: no
README:
> ExomeBench consists of datasets and code. Datasets are licensed under Creative Commons Attribution–Non-Commercial 4.0 (CC BY NC 4.0). Code provided in ExomeBench is licensed under Apache 2.0. > ExomeBench is a research benchmark. It is not a diagnostic tool and should not be used to make clinical decisions.
1. Project Overview
ExomeBench is a benchmark dataset designed for the evaluation of models in the field of clinical genomics, specifically focusing on the interpretation of genetic variants in exome regions. This repo contains the code to fine-tune and evaluate models on the ExomeBench dataset using the Hugging Face Transformers library.
The ExomeBench dataset is derived from ClinVar (Nov 2024 release), a publicly accessible database maintained by the National Center for Biotechnology Information (NCBI). ClinVar provides comprehensive information on the clinical significance of genetic variants and their associations with human diseases. This dataset focuses on variants located in exome-specific regions and includes input sequences generated from the Human Reference Genome (HRG, GRCh38).
This dataset provides a valuable resource for researchers and practitioners working on genetic variant analysis and its clinical implications. Exome-specific regions are critically important because they encompass all protein-coding regions of the genome, where disease-associated variants are most likely to occur. By focusing on exome-specific regions and using sequences from the Human Reference Genome, this dataset enables robust evaluation of models on clinically significant tasks.
2. Data Curation
Data Collection
- Source: Variants are sourced from the ClinVar database.
- Clinical Significance: ClinVar provides detailed information on the clinical significance of each variant and its association with human diseases.
Data Filtering
- Assertion Criteria: We include only variants with at least one submitter providing an interpretation and satisfying the assertion criteria for reliability.
- Variant Type: Only single-nucleotide variants (SNVs) are selected.
- Exome-Specific Regions: Filter the variants to include only those located in exome-specific regions (GENCODE v.38).
Sequence Generation
- Human Reference Genome (HRG, GRCh38): For each variant, generate input sequences from the HRG using the variants from the ClinVar database.
- Sequence Length: The length of the sequences is a parameter, typically set to 100 base pairs (bp).
- Variant Positioning: The variant is centered within the sequence, which is read in from a FASTA file.
Dataset Format
Each dataset entry consists of two main fields:
sequence(str): A DNA sequence centered around the variant.label(int): Task-specific integer-encoded class index.
3. Tasks
ExomeBench includes five supervised tasks, each framed as a classification problem:
- Pathogenic Variant Prediction (PV)
Classify exome variants into four clinical significance categories: *pathogenic*, *likely pathogenic*, *likely benign*, or *benign*. Variants from the same gene are split across train/test to prevent leakage.
- Phenotype Association
- Cancer-Predisposing Syndrome (CPS): Determine if a variant is linked to Hereditary Cancer-Predisposing Syndrome.
- Cardiovascular Phenotype (CP): Predict whether a variant is associated with cardiovascular conditions.
- Gene Localization
- BRCA Classification (BRCA): Identify whether a variant belongs to *BRCA1*, *BRCA2*, or neither.
- Top 5 Genes Prediction (TFG): Classify a variant into one of the five most frequently represented genes in the dataset.
4. SOTA Model Performances
Please see our [experiments](experiments) folder for details on the hyperparameters and [results](results) folder for the best model performance metrics. Below we provide the best MCC metric on the test set for each task across different models.
> Note: For some models and tasks, the seed settings in the STRAND paper were slightly different from those used in this repository, which may lead to minor variations in the reported results. Due to this, on an overly saturated tasks like TFG, you might observe a small discrepancy in the ordering of models based on MCC values compared to those reported in the paper.
5. Installation & Setup
Prerequisites
Setup with Conda (Recommended)
1. Clone the Repository
git clone https://github.com/Cerebras/exome_bench.git cd exome_bench
2. Create Conda Environment
conda create -n exome_bench python=3.10 conda activate exome_bench
3. Install bedtools (Required for Dataset Generation)
conda install -c bioconda bedtools
4. Install Python Packages
pip install -r requirements.txt pip install pybedtools==0.12.0
> Note: If you're only using the pre-generated datasets from Hugging Face, you don't need to install bedtools or pybedtools (skip step 3). These are only required for regenerating datasets from ClinVar and reference genome files.
6. Fine-tuning and Evaluation
You can easily fine-tune and evaluate your model on the ExomeBench tasks:
Evaluate the Model
You can evaluate your model on test split by specifing pretrained_model and the task_name in eval_only mode.
python main.py \ --pretrained_model InstaDeepAI/nucleotide-transformer-2.5b-multi-species \ --output_dir results \ --mode eval_only \ --task_name brca
Fine-Tune and Evaluate the Model
You can fine-tune the model and provide training arguments with a yaml file.
python main.py \ --config configs/brca.yaml
(Optional) Perform Hyper-Parameter Optimization, Fine-Tune, and Evaluate…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low stars, routine new repo