OmniGenBench¶

A Unified Framework for Genomic Foundation Model Development and Benchmarking

OmniGenBench is a comprehensive toolkit for developing, evaluating, and deploying genomic foundation models across diverse biological sequence analysis tasks. The framework implements a principled software architecture grounded in four core abstract base classes (OmniModel, OmniDataset, OmniTokenizer, OmniMetric), providing task-specific abstractions, standardized evaluation protocols, and seamless integration with the HuggingFace ecosystem to address the unique computational challenges of genomic machine learning.

Documentation Structure

Installation - System requirements and installation procedures
User Guide - Core workflows, model downloading, and API usage patterns
Command Line Interface - Command-line interface reference
Architecture & Design Philosophy - Architectural principles and extension mechanisms
Troubleshooting Guide - Common issues and solutions
API Reference - Complete API specification

30+ Pre-trained Models: Production-ready genomic foundation models from HuggingFace Hub, supporting DNA and RNA sequences across multiple species (plants, animals, microbes)
80+ Standardized Benchmarks: Comprehensive evaluation suites across five domains:
- RGB: 12 RNA structure and function tasks
- BEACON: 13 multi-domain RNA tasks
- PGB: 7 plant genomics categories with long-range context
- GUE: 36 DNA general understanding datasets
- GB: 9 classic DNA classification benchmarks
Three-Line Inference API: Minimal-code deployment pattern: load from HuggingFace Hub, predict on sequences, interpret results
Distributed Training Infrastructure: Three trainer backends with complementary strengths:
- native: Pure PyTorch training loop for explicit control and debugging
- accelerate: HuggingFace Accelerate for multi-GPU/multi-node distributed training
- hf_trainer: HuggingFace Trainer API for full ecosystem integration
Extensible Architecture: Plug-and-play customization through abstract base classes - extend with custom models, datasets, tokenizers, or metrics without modifying core code
Interpretability Tools: Built-in embedding extraction (encode()), attention visualization (extract_attention_scores()), and similarity computation via EmbeddingMixin
RNA Sequence Design: Structure-to-sequence generation using genetic algorithms enhanced with masked language model guidance for target secondary structure realization

Quick Start

Python API: Three-line inference workflow for rapid deployment

from omnigenbench import ModelHub

# Load fine-tuned model from HuggingFace Hub (auto-cached on first use)
model = ModelHub.load("yangheng/ogb_tfb_finetuned")

# Predict transcription factor binding sites (919-way multi-label classification)
predictions = model.inference("ATCGATCGATCGATCG" * 20)
# Returns: {'predictions': array([1, 0, 1, ...]),
#           'probabilities': array([0.92, 0.08, 0.87, ...])}

# Interpret results: identify binding sites
import numpy as np
binding_tfs = np.where(predictions['predictions'] == 1)[0]
print(f"Predicted binding sites: {len(binding_tfs)}/919 transcription factors")

Command-Line Interface: Production workflows with zero-code execution

# Automated benchmarking: evaluate model across 12 RNA tasks with multi-seed averaging
ogb autobench \
    --model yangheng/OmniGenome-186M \
    --benchmark RGB \
    --seeds 0 1 2 \
    --trainer accelerate
# Output: Mean ± std per metric (e.g., MCC: 0.742 ± 0.015, F1: 0.863 ± 0.009)

# Automated training: fine-tune on custom dataset with distributed training
ogb autotrain \
    --dataset ./my_promoters \
    --model yangheng/OmniGenome-186M \
    --epochs 50 \
    --batch-size 32 \
    --trainer accelerate

# Automated inference: batch prediction on genomic sequences
ogb autoinfer \
    --model yangheng/ogb_tfb_finetuned \
    --input-file sequences.json \
    --output-file predictions.json

# RNA sequence design: generate sequences realizing target secondary structure
ogb rna_design \
    --structure "(((...)))" \
    --model yangheng/OmniGenome-186M \
    --num-population 200 \
    --num-generation 100

OmniGenBench¶

🚀 Installation

📖 User Guide

🛠️ CLI Commands

🏗️ Design Principles

🔧 Troubleshooting

📚 API Reference