OmniGenBenchΒΆ

A Unified Framework for Genomic Foundation Model Development and Benchmarking

OmniGenBench is a comprehensive toolkit for developing, evaluating, and deploying genomic foundation models across diverse biological sequence analysis tasks. The framework implements a principled software architecture grounded in four core abstract base classes (OmniModel, OmniDataset, OmniTokenizer, OmniMetric), providing task-specific abstractions, standardized evaluation protocols, and seamless integration with the HuggingFace ecosystem to address the unique computational challenges of genomic machine learning.

Documentation Structure

Framework Capabilities

  • 30+ Pre-trained Models: Production-ready genomic foundation models from HuggingFace Hub, supporting DNA and RNA sequences across multiple species (plants, animals, microbes)

  • 80+ Standardized Benchmarks: Comprehensive evaluation suites across five domains:

    • RGB: 12 RNA structure and function tasks

    • BEACON: 13 multi-domain RNA tasks

    • PGB: 7 plant genomics categories with long-range context

    • GUE: 36 DNA general understanding datasets

    • GB: 9 classic DNA classification benchmarks

  • Three-Line Inference API: Minimal-code deployment pattern: load from HuggingFace Hub, predict on sequences, interpret results

  • Distributed Training Infrastructure: Three trainer backends with complementary strengths:

    • native: Pure PyTorch training loop for explicit control and debugging

    • accelerate: HuggingFace Accelerate for multi-GPU/multi-node distributed training

    • hf_trainer: HuggingFace Trainer API for full ecosystem integration

  • Extensible Architecture: Plug-and-play customization through abstract base classes - extend with custom models, datasets, tokenizers, or metrics without modifying core code

  • Interpretability Tools: Built-in embedding extraction (encode()), attention visualization (extract_attention_scores()), and similarity computation via EmbeddingMixin

  • RNA Sequence Design: Structure-to-sequence generation using genetic algorithms enhanced with masked language model guidance for target secondary structure realization

Quick Start

Python API: Three-line inference workflow for rapid deployment

from omnigenbench import ModelHub

# Load fine-tuned model from HuggingFace Hub (auto-cached on first use)
model = ModelHub.load("yangheng/ogb_tfb_finetuned")

# Predict transcription factor binding sites (919-way multi-label classification)
predictions = model.inference("ATCGATCGATCGATCG" * 20)
# Returns: {'predictions': array([1, 0, 1, ...]),
#           'probabilities': array([0.92, 0.08, 0.87, ...])}

# Interpret results: identify binding sites
import numpy as np
binding_tfs = np.where(predictions['predictions'] == 1)[0]
print(f"Predicted binding sites: {len(binding_tfs)}/919 transcription factors")

Command-Line Interface: Production workflows with zero-code execution

# Automated benchmarking: evaluate model across 12 RNA tasks with multi-seed averaging
ogb autobench \
    --model yangheng/OmniGenome-186M \
    --benchmark RGB \
    --seeds 0 1 2 \
    --trainer accelerate
# Output: Mean Β± std per metric (e.g., MCC: 0.742 Β± 0.015, F1: 0.863 Β± 0.009)

# Automated training: fine-tune on custom dataset with distributed training
ogb autotrain \
    --dataset ./my_promoters \
    --model yangheng/OmniGenome-186M \
    --epochs 50 \
    --batch-size 32 \
    --trainer accelerate

# Automated inference: batch prediction on genomic sequences
ogb autoinfer \
    --model yangheng/ogb_tfb_finetuned \
    --input-file sequences.json \
    --output-file predictions.json

# RNA sequence design: generate sequences realizing target secondary structure
ogb rna_design \
    --structure "(((...)))" \
    --model yangheng/OmniGenome-186M \
    --num-population 200 \
    --num-generation 100