OmniGenBenchΒΆ
A Unified Framework for Genomic Foundation Model Development and Benchmarking
OmniGenBench is a comprehensive toolkit for developing, evaluating, and deploying genomic foundation models across diverse biological sequence analysis tasks. The framework implements a principled software architecture grounded in four core abstract base classes (OmniModel, OmniDataset, OmniTokenizer, OmniMetric), providing task-specific abstractions, standardized evaluation protocols, and seamless integration with the HuggingFace ecosystem to address the unique computational challenges of genomic machine learning.
Documentation Structure
Installation - System requirements and installation procedures
User Guide - Core workflows, model downloading, and API usage patterns
Command Line Interface - Command-line interface reference
Architecture & Design Philosophy - Architectural principles and extension mechanisms
Troubleshooting Guide - Common issues and solutions
API Reference - Complete API specification
Framework Capabilities
30+ Pre-trained Models: Production-ready genomic foundation models from HuggingFace Hub, supporting DNA and RNA sequences across multiple species (plants, animals, microbes)
80+ Standardized Benchmarks: Comprehensive evaluation suites across five domains:
RGB: 12 RNA structure and function tasks
BEACON: 13 multi-domain RNA tasks
PGB: 7 plant genomics categories with long-range context
GUE: 36 DNA general understanding datasets
GB: 9 classic DNA classification benchmarks
Three-Line Inference API: Minimal-code deployment pattern: load from HuggingFace Hub, predict on sequences, interpret results
Distributed Training Infrastructure: Three trainer backends with complementary strengths:
native: Pure PyTorch training loop for explicit control and debuggingaccelerate: HuggingFace Accelerate for multi-GPU/multi-node distributed traininghf_trainer: HuggingFace Trainer API for full ecosystem integration
Extensible Architecture: Plug-and-play customization through abstract base classes - extend with custom models, datasets, tokenizers, or metrics without modifying core code
Interpretability Tools: Built-in embedding extraction (
encode()), attention visualization (extract_attention_scores()), and similarity computation viaEmbeddingMixinRNA Sequence Design: Structure-to-sequence generation using genetic algorithms enhanced with masked language model guidance for target secondary structure realization
Quick Start
Python API: Three-line inference workflow for rapid deployment
from omnigenbench import ModelHub
# Load fine-tuned model from HuggingFace Hub (auto-cached on first use)
model = ModelHub.load("yangheng/ogb_tfb_finetuned")
# Predict transcription factor binding sites (919-way multi-label classification)
predictions = model.inference("ATCGATCGATCGATCG" * 20)
# Returns: {'predictions': array([1, 0, 1, ...]),
# 'probabilities': array([0.92, 0.08, 0.87, ...])}
# Interpret results: identify binding sites
import numpy as np
binding_tfs = np.where(predictions['predictions'] == 1)[0]
print(f"Predicted binding sites: {len(binding_tfs)}/919 transcription factors")
Command-Line Interface: Production workflows with zero-code execution
# Automated benchmarking: evaluate model across 12 RNA tasks with multi-seed averaging
ogb autobench \
--model yangheng/OmniGenome-186M \
--benchmark RGB \
--seeds 0 1 2 \
--trainer accelerate
# Output: Mean Β± std per metric (e.g., MCC: 0.742 Β± 0.015, F1: 0.863 Β± 0.009)
# Automated training: fine-tune on custom dataset with distributed training
ogb autotrain \
--dataset ./my_promoters \
--model yangheng/OmniGenome-186M \
--epochs 50 \
--batch-size 32 \
--trainer accelerate
# Automated inference: batch prediction on genomic sequences
ogb autoinfer \
--model yangheng/ogb_tfb_finetuned \
--input-file sequences.json \
--output-file predictions.json
# RNA sequence design: generate sequences realizing target secondary structure
ogb rna_design \
--structure "(((...)))" \
--model yangheng/OmniGenome-186M \
--num-population 200 \
--num-generation 100