User Guide¶
Core Workflows for Genomic Foundation Models
This guide demonstrates the primary usage patterns for OmniGenBench, covering the complete machine learning lifecycle: automated benchmarking with AutoBench, model fine-tuning with AutoTrain, and production deployment via inference APIs. Each workflow is designed to minimize boilerplate code while providing flexibility for advanced customization through configurable parameters and extension points.
Prerequisites: This guide assumes you have installed OmniGenBench (see Installation). All examples use models from the HuggingFace Hub and standardized benchmark datasets that are automatically downloaded and cached on first use.
Model & Dataset Downloading¶
Eliminating Git-LFS Dependencies for Production-Grade Model Acquisition
OmniGenBench provides enhanced model downloading infrastructure that eliminates hard dependencies on Git-LFS while providing superior reliability, performance, and error handling. The system implements a hybrid strategy that prioritizes direct HTTP downloads via HuggingFace Hub’s official API, with automatic fallback to Git-based cloning for edge cases.
Design Philosophy: Git-LFS pointer file corruption has been a persistent source of silent model loading failures. Users without properly configured Git-LFS installations would successfully “clone” models but receive only 100-byte pointer files instead of multi-gigabyte weight tensors—leading to models initializing with random weights and producing nonsensical predictions.
Key Improvements:
Zero Git-LFS Dependency: Direct HTTPS downloads via
huggingface_hubeliminate Git/LFS configuration requirementsAutomatic Integrity Verification: Post-download validation detects LFS pointer corruption
Performance Gains: 33% faster downloads through CDN optimization
Reduced Storage: 20% disk savings by omitting
.git/repository metadataGraceful Degradation: Automatic fallback to Git clone preserves backward compatibility
Download Strategy Architecture¶
Primary Method: HuggingFace Hub API (Recommended)
Direct HTTPS downloads using huggingface_hub.snapshot_download(), bypassing Git entirely.
Advantages:
Only requires
huggingface_hub>=0.20.0Python package—no system-level Git/LFS binariesCDN-accelerated transfer with automatic geo-routing
Automatic SHA256 verification for each file chunk
Resume support for interrupted downloads
Compact storage excluding Git history (20-25% size reduction)
Example:
from huggingface_hub import snapshot_download
# Download entire model repository via HTTPS
local_path = snapshot_download(
repo_id="yangheng/OmniGenome-186M",
cache_dir="__OMNIGENBENCH_DATA__/models/",
local_dir_use_symlinks=False,
resume_download=True,
)
Fallback Method: Git Clone with LFS
Required only when users need full Git history or when HF Hub API is unavailable.
Warning
Git-LFS Pointer File Hazard
If Git-LFS is not installed, git clone will substitute large files with pointer files:
version https://git-lfs.github.com/spec/v1
oid sha256:b437d27531abc123...
size 41943280
PyTorch will fail to load these pointer files as model weights, resulting in random initialization and incorrect inference. This failure is silent—no exception raised, but predictions are meaningless.
Usage Patterns¶
Automatic Strategy Selection (Recommended)
from omnigenbench import ModelHub
# Automatic strategy: try HF Hub API, fallback to Git clone
model = ModelHub.load("yangheng/OmniGenome-186M")
Explicit HF Hub API Usage
from omnigenbench import ModelHub
# Force HuggingFace Hub API (production environments)
model = ModelHub.load(
"yangheng/OmniGenome-186M",
use_hf_api=True
)
Direct API Access for Fine-Grained Control
from omnigenbench.src.utility.model_hub.hf_download import download_from_hf_hub
# Download with custom configuration
path = download_from_hf_hub(
repo_id="yangheng/ogb_tfb_finetuned",
cache_dir="/custom/cache/",
force_download=False,
)
Selective File Download (bandwidth optimization):
# Download only essential files
path = download_from_hf_hub(
repo_id="yangheng/OmniGenome-186M",
allow_patterns=["*.json", "*.bin"],
ignore_patterns=["*.msgpack", "*.h5"],
)
Downloading Benchmarks and Datasets
The same robust API works for benchmark datasets:
from omnigenbench.src.utility.hub_utils import download_benchmark
# Download benchmark with HF Hub API (recommended)
benchmark_path = download_benchmark(
"RGB", # Short name
use_hf_api=True
)
# Or specify full HuggingFace dataset repository
benchmark_path = download_benchmark(
"yangheng/OmniGenBench_RGB",
use_hf_api=True
)
# Force re-download to update cached benchmark
benchmark_path = download_benchmark(
"RGB",
force_download=True,
use_hf_api=True
)
Automatic Method Selection in AutoBench:
from omnigenbench import AutoBench
# AutoBench automatically uses robust HF Hub API for benchmark downloads
bench = AutoBench(
benchmark="RGB", # Automatically downloaded via HF Hub API
model_name_or_path="yangheng/OmniGenome-186M"
)
bench.run()
Download Integrity Verification¶
All downloads include automatic validation:
from omnigenbench.src.utility.model_hub.hf_download import (
download_from_hf_hub,
verify_download_integrity
)
# Download and verify
path = download_from_hf_hub("yangheng/OmniGenome-186M")
is_valid = verify_download_integrity(path)
if not is_valid:
raise RuntimeError("Download corrupted—LFS pointer detected")
Validation Checks:
File existence: Verify all required files present
LFS pointer detection: Scan .bin files for Git-LFS pointer headers
Size validation: Flag suspiciously small files (<200 bytes)
Troubleshooting Common Issues¶
Problem: Model Produces Random Predictions
Symptoms: Model loads successfully but predictions are nonsensical.
Diagnosis:
from omnigenbench.src.utility.model_hub.hf_download import verify_download_integrity
is_valid = verify_download_integrity(
"__OMNIGENBENCH_DATA__/models/yangheng--ogb_tfb_finetuned"
)
if not is_valid:
print("Git-LFS pointer detected—model weights not downloaded")
Solution: Re-download with HF Hub API:
from omnigenbench import ModelHub
# Force HF Hub API re-download
model = ModelHub.load(
"yangheng/ogb_tfb_finetuned",
use_hf_api=True,
force_download=True
)
Problem: Network Connection Failures
HF Hub API includes automatic resume capability—simply re-run the download command:
# Re-run automatically resumes from last verified chunk
path = download_from_hf_hub(
"yangheng/OmniGenome-186M",
force_download=False
)
Performance Benchmarks¶
Tested with yangheng/OmniGenome-186M (~200MB model):
Method |
Time |
Dependencies |
Risk Level |
|---|---|---|---|
HF Hub API |
30 seconds |
|
None |
Git Clone (with LFS) |
45 seconds |
|
Low |
Git Clone (without LFS) |
5 seconds ⚠️ |
|
High (pointer files) |
Storage Efficiency:
HF Hub API: 200 MB (model files only)
Git Clone: 250 MB (model files +
.git/metadata)Savings: 50 MB (20% reduction)
Best Practices¶
Recommended:
Default to HF Hub API for all production workloads
Always validate downloads with
verify_download_integrity()Use selective downloads (
allow_patterns) to optimize bandwidthHandle private models with HuggingFace access tokens
Avoid:
Implicit Git-LFS dependencies (use
use_hf_api=True)Ignoring verification failures
Mixing download methods without cleanup
Tip
For detailed troubleshooting, repository metadata queries, and migration guides, see the complete API reference at api/model_hub.
Workflow 1: Automated Benchmarking¶
Objective: Evaluate pre-trained genomic foundation models on standardized benchmark suites with reproducible protocols and multi-seed statistical rigor.
The AutoBench class orchestrates the complete evaluation pipeline: benchmark dataset acquisition, model loading from HuggingFace Hub, distributed inference across tasks, metric calculation with domain-specific measures, and results serialization. It implements best practices for genomic ML evaluation, including multi-seed averaging for variance quantification and task-specific metric selection aligned with biological validation standards.
Basic Usage Pattern
Evaluate yangheng/OmniGenome-186M (plant genome foundation model with 186M parameters) on the RGB benchmark (RNA Genome Benchmark: 12 RNA sequence understanding tasks):
from omnigenbench import AutoBench
# Initialize benchmarking pipeline
bench = AutoBench(
benchmark="RGB", # Benchmark identifier (RGB, BEACON, PGB, GUE, GB)
config_or_model="yangheng/OmniGenome-186M" # HF Hub model ID or local path
)
# Execute evaluation workflow
# Automatically handles: data loading, tokenization, inference, metric computation
bench.run()
# Results saved to: ./autobench_results/RGB/OmniGenome-186M/
# Output format: JSON with per-task metrics and aggregated statistics
Statistical Rigor: Multi-Seed Evaluation
For publication-quality results, evaluate with multiple random seeds to quantify variance:
from omnigenbench import AutoBench
bench = AutoBench(
benchmark="RGB",
config_or_model="yangheng/OmniGenome-186M"
)
# Run with 5 independent initializations
bench.run(seeds=[0, 1, 2, 3, 4])
# Results include: mean ± std for each metric across seeds
# Example output: MCC: 0.742 ± 0.015, F1: 0.863 ± 0.009
Tip
Why Multiple Seeds for Evaluation?
Random initialization, data shuffling, and dropout stochasticity cause performance variance across training runs. Multi-seed evaluation (typically 3-5 independent runs) provides:
Statistical validity: Mean and standard deviation enable hypothesis testing and confidence intervals
Variance quantification: Distinguish models with stable performance from those with volatile behavior
Reproducibility verification: Demonstrates results aren’t artifacts of fortunate initialization
Publication standards: Required by most computational biology journals for method comparison
Note
Trainer Backend Selection Strategy:
Python API (
AutoBench): Defaults tonativetrainer for single-GPU explicit control and debuggingCLI (
ogb autobench): Defaults toacceleratetrainer for distributed multi-GPU capabilities
This design optimizes for different use cases:
API users typically prioritize explicit control over evaluation steps (
nativebackend)CLI users typically prioritize production-scale distributed evaluation (
acceleratebackend)
For multi-GPU distributed evaluation via Python API, explicitly specify trainer="accelerate":
# Single-GPU native trainer (Python API default)
bench = AutoBench(benchmark="RGB", config_or_model="model")
# Multi-GPU distributed evaluation (override default)
bench = AutoBench(
benchmark="RGB",
config_or_model="model",
trainer="accelerate"
)
Available trainer backends:
native: Pure PyTorch evaluation loop, single-GPU, explicit control (Python API default for AutoBench)accelerate: HuggingFace Accelerate, multi-GPU/multi-node distributed with gradient accumulation (CLI default)hf_trainer: HuggingFace Trainer API with full ecosystem integration (callbacks, logging, checkpointing)
Training a New Model¶
OmniGenBench simplifies the training process with the AutoTrain class. You provide a dataset and a base model, and it handles the rest.
In this example, we’ll fine-tune the yangheng/OmniGenome-186M model on a custom dataset named “MyCustomDataset”.
from omnigenbench import AutoTrain
# Initialize the trainer with your dataset and a base model
trainer = AutoTrain(dataset="MyCustomDataset", config_or_model="yangheng/OmniGenome-186M")
# Start the training process
trainer.run()
Tip
Your dataset should be prepared in a compatible format. Refer to the Data Template section below for details on data formatting.
Note
Trainer Backend Selection:
Python API (
AutoTrain): Defaults toacceleratetrainer for distributed training efficiencyCLI (
ogb autotrain): Also defaults toacceleratetrainer
This design choice recognizes that training typically benefits from distributed capabilities even on single-GPU systems (gradient accumulation, mixed precision, memory optimization).
For single-GPU training or debugging, specify trainer="native":
# Multi-GPU distributed training (default for AutoTrain)
trainer = AutoTrain(dataset="MyData", config_or_model="model")
# Single-GPU training with explicit control (for debugging)
trainer = AutoTrain(
dataset="MyData",
config_or_model="model",
trainer="native"
)
Available trainer backends:
accelerate: HuggingFace Accelerate, multi-GPU/multi-node distributed (default for AutoTrain)native: Pure PyTorch training loop, single-GPU, explicit controlhf_trainer: HuggingFace Trainer API integration with full ecosystem support
Note
CLI Alternative: You can also train models from the command line:
ogb autotrain \
--dataset ./my_dataset \
--model yangheng/OmniGenome-186M \
--epochs 50 \
--batch-size 32 \
--trainer accelerate
See Command Line Interface for all training options and configuration.
Data Template & Formats¶
OmniGenBench supports flexible data loading for genomic machine learning tasks. To ensure compatibility, your data should follow a simple template and be saved in one of the supported formats.
Data Template: {sequence, label} Structure
Each data sample should be a dictionary with at least two keys:
sequence: The biological sequence (DNA, RNA, or protein) as a string.label: The target value for the task (classification, regression, etc.).
Example for Classification (JSON format):
[
{"sequence": "ATCGATCGATCG", "label": "0"},
{"sequence": "GCTAGCTAGCTA", "label": "1"}
]
Example for Regression (JSON format):
[
{"sequence": "ATCGATCGATCG", "label": 0.75},
{"sequence": "GCTAGCTAGCTA", "label": -1.2}
]
OmniGenBench will automatically standardize common key names. For example, seq or text will be treated as sequence, and label will be standardized to labels internally.
Supported Data Formats
JSON (`.json`): Recommended. A list of dictionaries as shown above. Also supports JSON Lines (.jsonl).
CSV (`.csv`): Must have columns for
sequenceandlabel.Parquet (`.parquet`): Columns for
sequenceandlabel.FASTA (`.fasta`, `.fa`, etc.): Sequence data only. Labels must be provided separately or inferred.
FASTQ (`.fastq`, `.fq`): Sequence and quality scores. Labels must be provided separately or inferred.
BED (`.bed`): Genomic intervals. Sequence and label columns may need to be added.
Numpy (`.npy`, `.npz`): Array of dictionaries with
sequenceand optionallabel.
For supervised tasks, ensure every sample has both a sequence and a label. For unsupervised or sequence-only tasks, only the sequence key is required.
Inference with a Model¶
Once you have a trained model, running inference is straightforward. There are two safe patterns depending on your assets:
Use task-specific OmniModel classes when you know the task type (recommended)
from omnigenbench import OmniModelForSequenceClassification, OmniTokenizer
# Example: multi-label TF binding (919 tasks)
tokenizer = OmniTokenizer.from_pretrained("yangheng/ogb_tfb_finetuned")
model = OmniModelForSequenceClassification(
"yangheng/ogb_tfb_finetuned",
tokenizer=tokenizer,
num_labels=919, # or pass label2id if available
)
sequence = "ATCGATCGATCGATCGATCGATCGATCGATCG"
outputs = model.inference(sequence)
print(outputs.keys()) # e.g., dict with predictions/probabilities/logits
Use ModelHub to load fine-tuned OmniGenBench models with metadata
from omnigenbench import ModelHub
# Load models saved with OmniGenBench (metadata enables task context restoration)
model = ModelHub.load("yangheng/ogb_tfb_finetuned")
result = model.inference("ATCGATCGATCG") # Works when metadata.json is present
Note
ModelHub.load() clones models from HuggingFace Hub to local cache (__OMNIGENBENCH_DATA__/models/)
on first use, then loads from local files only. It returns a fully-configured task-specific model
when metadata.json is present, otherwise returns a standard Transformers model with attached tokenizer.
For models without OmniGenBench metadata, prefer instantiating task-specific OmniModel classes
directly (Pattern 1) with explicit num_labels or label2id configuration.
Note
CLI Alternative: You can also run inference from the command line:
ogb autoinfer --model yangheng/ogb_tfb_finetuned --sequence "ATCGATCGATCG"
The CLI uses the same metadata-aware loading under the hood. See Command Line Interface for complete options.
Embedding Extraction & Visualization¶
All OmniModel-based classes inherit EmbeddingMixin, which provides built-in support for extracting sequence embeddings and visualizing attention patterns. These features are essential for:
Downstream Analysis: Using genomic embeddings for clustering, classification, or similarity search
Model Interpretation: Understanding what patterns the model learns via attention visualization
Transfer Learning: Extracting features for training custom models
Extracting Sequence Embeddings
Generate fixed-length vector representations of genomic sequences:
from omnigenbench import OmniModelForEmbedding
model = OmniModelForEmbedding("yangheng/OmniGenome-186M")
sequences = ["ATCGATCGATCGATCG", "GCTAGCTAGCTAGCTA"]
embeddings = model.batch_encode(sequences, agg="mean")
print(embeddings.shape) # (2, hidden_size)
# Use embeddings for downstream tasks (clustering, similarity, etc.)
Extracting Attention Scores
Visualize which positions in the sequence the model attends to:
from omnigenbench import OmniModelForSequenceClassification, OmniTokenizer
tokenizer = OmniTokenizer.from_pretrained("yangheng/OmniGenome-186M")
# Any OmniModel subclass works; num_labels can be a placeholder for analysis-only use
model = OmniModelForSequenceClassification(
"yangheng/OmniGenome-186M", tokenizer=tokenizer, num_labels=2
)
sequence = "ATCGATCGATCGATCG"
result = model.extract_attention_scores(sequence)
attn = result["attentions"] # (num_layers, num_heads, seq_len, seq_len)
print(attn.shape)
Example Notebooks
For complete tutorials with visualization examples:
Embedding Tutorial:
examples/genomic_embeddings/RNA_Embedding_Tutorial.ipynbAttention Analysis:
examples/attention_score_extraction/Attention_Analysis_Tutorial.ipynb
Tip
Embedding Applications:
Sequence Similarity: Use cosine similarity between embeddings to find similar sequences
Clustering: Group sequences by biological function using k-means or hierarchical clustering
Feature Extraction: Use embeddings as input features for traditional ML models
Visualization: Project high-dimensional embeddings to 2D/3D using t-SNE or UMAP
Managing Datasets and Models Manually¶
While the AutoBench and AutoTrain pipelines handle asset downloads automatically, you might need to download models or benchmark datasets manually in certain scenarios, such as:
Pre-loading assets in an environment with limited internet access.
Inspecting the contents of a benchmark dataset.
Scripting custom workflows.
The omnigenbench module provides simple functions for this purpose. These functions download files from the Hugging Face Hub and store them in a local cache for future use, avoiding redundant downloads.
Tip
The first time you download an asset, it might take a while depending on its size and your connection speed. Subsequent calls for the same asset will be nearly instant as it will be loaded directly from your local cache.
To download a specific benchmark dataset, use the download_benchmark function. Provide the benchmark’s name as an argument.
from omnigenbench import download_benchmark
# Define the name of the benchmark to download
benchmark_name = "RGB"
# Download the dataset from the Hugging Face Hub
local_path = download_benchmark(benchmark_name)
print(f"Benchmark '{benchmark_name}' downloaded successfully to: {local_path}")
Similarly, the download_model function allows you to fetch a pre-trained model. Use the model’s identifier from the Hub.
from omnigenbench import download_model
# Define the model identifier from the Hugging Face Hub
model_id = "OmniGenome-186M-SSP"
# Download the model files
local_path = download_model(model_id)
print(f"Model '{model_id}' downloaded successfully to: {local_path}")
Common Pitfalls & Tips¶
Task Type Matters
Always use the appropriate task-specific model class for your problem. OmniGenBench provides specialized model classes for different genomic tasks:
# Binary/Multi-class/Multi-label Classification
# Use for: Transcription factor binding, promoter classification, RNA type classification
from omnigenbench import OmniModelForSequenceClassification
model = OmniModelForSequenceClassification("yangheng/OmniGenome-186M", num_labels=2)
# Regression Tasks
# Use for: Expression levels, mRNA degradation rates, variant effect prediction
from omnigenbench import OmniModelForRegression
model = OmniModelForRegression("yangheng/OmniGenome-186M")
# Per-nucleotide Predictions (Token Classification)
# Use for: Splice site detection, secondary structure prediction, methylation sites
from omnigenbench import OmniModelForTokenClassification
model = OmniModelForTokenClassification("yangheng/OmniGenome-186M", num_labels=3)
# RNA Sequence Design
# Use for: Designing RNA sequences that fold into target structures
from omnigenbench import OmniModelForRNADesign
model = OmniModelForRNADesign("yangheng/OmniGenome-186M")
sequences = model.design(structure="(((...)))")
Tip
Choosing the Right Task Type:
Classification: When predicting discrete categories (e.g., high/low expression, present/absent)
Regression: When predicting continuous values (e.g., expression level: 0.5, 2.3, 10.1)
Token Classification: When predicting labels for each position in the sequence
RNA Design: When generating sequences for target secondary structures
Important
RNA Design Returns a List: The RNA design model always returns a list of sequences (up to 25 candidates), never a single sequence. Always handle the output as a list:
# Correct: Handle as list
sequences = model.design(structure="(((...)))")
for seq in sequences:
print(seq)
# Incorrect: Assuming single sequence
sequence = model.design(structure="(((...)))") # This is a list!
print(sequence.upper()) # Will fail!
ModelHub vs Direct Instantiation
Use ModelHub.load() for quick inference with OmniGenBench-saved fine-tuned models (loads model + tokenizer and restores task context when metadata is present):
model = ModelHub.load("yangheng/ogb_tfb_finetuned")
outputs = model.inference("ATCGATCG")
Use direct class instantiation when you need custom configuration or when the HF repo has no OmniGenBench metadata:
# For training or custom configuration
from omnigenbench import OmniModelForSequenceClassification
model = OmniModelForSequenceClassification(
config_or_model="yangheng/OmniGenome-186M",
num_labels=919, # Custom number of labels
problem_type="multi_label_classification"
)
Data Format Requirements
Ensure your data has the correct keys:
# Correct format
data = [
{"sequence": "ATCG", "label": 0},
{"sequence": "GCTA", "label": 1}
]
# Also accepted (auto-standardized)
data = [
{"seq": "ATCG", "labels": 0}, # 'seq' -> 'sequence', 'labels' -> 'label'
{"text": "GCTA", "target": 1} # 'text' -> 'sequence', 'target' -> 'label'
]
GPU Memory Management
For large models or long sequences:
# Reduce batch size
bench = AutoBench(benchmark="RGB", config_or_model="large_model")
bench.run(batch_size=4) # Default is often 8-32
# Use gradient checkpointing
from omnigenbench import OmniModelForSequenceClassification
model = OmniModelForSequenceClassification("model", gradient_checkpointing=True)
# Use mixed precision
bench = AutoBench(benchmark="RGB", config_or_model="model", autocast="bf16")
What’s Next?¶
You’ve now seen the basic workflows in OmniGenBench! To dive deeper, explore these resources:
Core Documentation:
Command Line Interface - Command-line interface for codeless operations
Architecture & Design Philosophy - Understanding the four abstract base classes (OmniModel, OmniDataset, OmniTokenizer, OmniMetric)
API Reference - Complete API reference for all classes and functions
Detailed Guides (in API Reference):
Trainers - Comprehensive trainer guide (Native, Accelerate, HuggingFace)
Downstream Datasets - Dataset classes, formats, and preprocessing
Downstream Models - Model architectures and task-specific models
CLI Commands - CLI command reference with examples
Quick Reference:
# Model Loading (Recommended)
from omnigenbench import ModelHub
model = ModelHub.load("yangheng/OmniGenome-186M")
# Automated Training (Recommended)
from omnigenbench import AutoTrain
trainer = AutoTrain(dataset="./my_dataset", config_or_model="yangheng/OmniGenome-186M")
trainer.run()
# Dataset Loading
from omnigenbench import OmniDatasetForSequenceClassification
dataset = OmniDatasetForSequenceClassification("data.json", tokenizer, max_length=512)
# CLI Commands
# ogb autobench --model yangheng/OmniGenome-186M --benchmark RGB
# ogb autotrain --dataset data --model model
# ogb autoinfer --model model --sequence "ATCG"