Baseline Models (Simple Backbones and Heads)¶

This page documents the simple baseline models and backbones provided in omnigenbench.src.model.baselines. These components are lightweight and useful for quick baselines and ablations across classification, regression, and token-level tasks.

class omnigenbench.src.model.baselines.OmniBPNetBaseline(tokenizer, *args, **kwargs)[source]¶

Bases: OmniModel

A lightweight BPNet-like dilated-convolution baseline.

The model converts token ids to one-hot nucleotides (A,C,G,T), applies a first convolution and a stack of exponentially-dilated 1D convolutions with residual connections, averages globally, then classifies with a sigmoid.

Parameters:

tokenizer (Any) – Tokenizer that can convert tokens to ids for “A”, “C”, “G”, “T”.
num_labels (int) – Number of outputs.
n_filters (int, optional) – Number of channels in convolution blocks, by default 64.
n_dilated_layers (int, optional) – Number of dilated layers, by default 9.
conv1_kernel_size (int, optional) – Kernel size of the first conv, by default 25.
dil_kernel_size (int, optional) – Kernel size of dilated convs, by default 3.
dropout (float, optional) – Dropout after global pooling, by default 0.1.
Pseudocode
----------
code-block: (..) –
python: X = one_hot(input_ids) # (B, 4, L) H = relu(Conv1D(4->C, k=25)(X)) for i in range(n_layers):

R = H H = relu(DilatedConv1D(C->C, k=3, d=2**i)(H)) H = H + R

g = GlobalAvgPool1D(H) # (B, C) y = sigmoid(Linear(C->num_labels)(Dropout(g)))

forward(**inputs)[source]¶

Forward pass returning probabilities and last hidden state.

Returns:: dict – Keys: logits (probabilities), last_hidden_state (pooled), optional labels.

classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶

Loads a pre-trained model and tokenizer.

Parameters:

config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.

Returns:

An instance of OmniModel.

inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶: Return thresholded predictions along with confidence scores.

loss_function(logits, labels)[source]¶

Calculates the loss. This method should be implemented by concrete model classes to define how the loss is calculated for their specific task (classification, regression, etc.).

Parameters:

logits (torch.Tensor) – The model’s output logits.
labels (torch.Tensor) – The ground truth labels.

Returns:

torch.Tensor – The calculated loss.

Raises:

NotImplementedError – If the method is not implemented by the subclass.

Example

>>> # In a classification model
>>> loss = model.loss_function(logits, labels)

predict(sequence_or_inputs, **kwargs)[source]¶: Return probabilities for each label.

save_pretrained(save_directory: str, overwrite: bool = True)[source]¶

class omnigenbench.src.model.baselines.OmniBasenjiBaseline(tokenizer, *args, **kwargs)[source]¶

Bases: OmniModel

Basenji-like 1D CNN with dilations adapted for tokenizer-driven inputs.

This baseline maps token ids to A/C/G/T channels, stacks conv+pool blocks, applies dilated residual blocks, then global-average-pools and classifies.

forward(**inputs)[source]¶

Forward pass returning probabilities and last hidden state.

Returns:: dict – Keys: logits (probabilities), last_hidden_state (pooled), optional labels.

classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶

Loads a pre-trained model and tokenizer.

Parameters:

config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.

Returns:

An instance of OmniModel.

inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶: Return thresholded predictions along with confidence scores.

loss_function(logits, labels)[source]¶

Calculates the loss. This method should be implemented by concrete model classes to define how the loss is calculated for their specific task (classification, regression, etc.).

Parameters:

logits (torch.Tensor) – The model’s output logits.
labels (torch.Tensor) – The ground truth labels.

Returns:

torch.Tensor – The calculated loss.

Raises:

NotImplementedError – If the method is not implemented by the subclass.

Example

>>> # In a classification model
>>> loss = model.loss_function(logits, labels)

predict(sequence_or_inputs, **kwargs)[source]¶: Return probabilities for each label.

save_pretrained(save_directory: str, overwrite: bool = True)[source]¶

class omnigenbench.src.model.baselines.OmniCNNBaseline(tokenizer, *args, **kwargs)[source]¶

Bases: OmniModel

A simple 1D-CNN baseline with global max pooling for multi-label tasks.

This legacy-style model builds an embedding layer followed by multiple convolutional filters with kernel sizes specified in kernel_sizes, concatenates the features, pools them, and applies a linear classifier with a sigmoid for multi-label probabilities.

Parameters:

tokenizer (Any) – Tokenizer providing vocab_size or get_vocab() and pad_token_id.
num_labels (int) – Number of output labels.
embed_dim (int, optional) – Token embedding dimension, by default 128.
num_filters (int, optional) – Number of output channels per convolutional filter, by default 128.
kernel_sizes (tuple[int, ...], optional) – Convolution kernel sizes, by default (3, 5, 7).
dropout (float, optional) – Dropout probability, by default 0.1.
Inputs
------
input_ids (torch.LongTensor of shape (batch_size, seq_len))
attention_mask (torch.LongTensor of shape (batch_size, seq_len), optional)
labels (torch.FloatTensor of shape (batch_size, num_labels), optional)
Outputs
-------
dict –
- logits: (batch_size, num_labels) in [0,1] after sigmoid.
- last_hidden_state: pooled hidden vector (batch_size, hidden_size).
- labels: passthrough of input labels if provided.

Notes

Loss used is BCE (nn.BCELoss) expecting labels in {0,1} floats.
See predict() and inference() for convenience wrappers.

Pseudocode¶

x = Embedding(input_ids)
x = Dropout(x)
feats = [Conv1D_k(ReLU)(x_T) for k in kernel_sizes]
feats = concat(feats, dim=channels).T
pooled = masked_global_max_pool(feats, attention_mask)
logits = sigmoid(Linear(pooled))

forward(**inputs)[source]¶

Forward pass.

Parameters:

input_ids (torch.LongTensor) – Token ids of shape (batch_size, seq_len).
attention_mask (torch.LongTensor, optional) – Mask of shape (batch_size, seq_len) with 1 for valid tokens.
labels (torch.FloatTensor, optional) – Multi-hot label matrix (batch_size, num_labels).

Returns:

dict – Dictionary with logits, last_hidden_state, and optional labels.

classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶

Loads a pre-trained model and tokenizer.

Parameters:

config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.

Returns:

An instance of OmniModel.

inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶

Return binary predictions with a threshold.

Parameters:: threshold (float, optional) – Decision threshold in [0,1] applied to probabilities, by default 0.5.

loss_function(logits, labels)[source]¶

Binary cross-entropy loss with logits already in probability space.

Notes

This legacy baseline uses BCELoss assuming inputs were passed through a sigmoid already. Prefer BCEWithLogitsLoss for numerical stability in new code.

predict(sequence_or_inputs, **kwargs)[source]¶

Return probabilities for each label.

This calls the internal convenience routine to accept either raw sequences or already-tokenized inputs, then returns the probabilities.

Returns:: dict – Keys: predictions (alias of logits), logits, last_hidden_state.

save_pretrained(save_directory: str, overwrite: bool = True)[source]¶

class omnigenbench.src.model.baselines.OmniDeepSTARRBaseline(tokenizer, *args, **kwargs)[source]¶

Bases: OmniModel

DeepSTARR-like CNN with global pooling and MLP head adapted for tokenizer inputs.

forward(**inputs)[source]¶

Forward pass returning probabilities and last hidden state.

Returns:: dict – Keys: logits (probabilities), last_hidden_state (pooled), optional labels.

classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶

Loads a pre-trained model and tokenizer.

Parameters:

config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.

Returns:

An instance of OmniModel.

inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶: Return thresholded predictions along with confidence scores.

loss_function(logits, labels)[source]¶

Calculates the loss. This method should be implemented by concrete model classes to define how the loss is calculated for their specific task (classification, regression, etc.).

Parameters:

logits (torch.Tensor) – The model’s output logits.
labels (torch.Tensor) – The ground truth labels.

Returns:

torch.Tensor – The calculated loss.

Raises:

NotImplementedError – If the method is not implemented by the subclass.

Example

>>> # In a classification model
>>> loss = model.loss_function(logits, labels)

predict(sequence_or_inputs, **kwargs)[source]¶: Return probabilities for each label.

save_pretrained(save_directory: str, overwrite: bool = True)[source]¶

class omnigenbench.src.model.baselines.OmniGenericBaseline(tokenizer, *args, **kwargs)[source]¶

Bases: OmniModel

Generic baseline model wiring a simple backbone to a selected head.

This class provides a flexible way to create small baselines for different tasks by choosing among CNN/LSTM/BPNet-style backbones and head types.

Parameters:

tokenizer (Any) – Tokenizer instance used for vocabulary size and padding index. Must not be None.
backbone_type (str) – One of {“cnn”, “rnn”, “bpnet”, “deepstarr”, “basenji”}.
task_name (str) – One of {“multilabel_classification”, “classification”, “regression”, “token_classification”, “token_regression”}.
num_labels (int) – Number of outputs for the head.
label2id (dict, optional) – Mapping from label string to id; default builds {“0”:0, …}.
configuration (Other keyword arguments are forwarded to the chosen backbone)
(e.g.
embed_dim
hidden_dim
n_filters
etc.).

property device¶: Return the actual device of model parameters, not cached value.

forward(**inputs)[source]¶

Forward pass through backbone and task head.

Returns:: dict – Always includes logits and passthrough labels if present. Includes last_hidden_state and optionally sequence_output when the backbone provides it (e.g., for token-level heads).

classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶: Load model, config, and tokenizer from a directory.

inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶

Convenience inference wrapper producing final predictions.

Behavior depends on task_name: - multilabel_classification: thresholded sigmoid probabilities. - classification: argmax over softmax probabilities. - others: returns post-processed outputs.

loss_function(logits, labels)[source]¶: Compute task-appropriate training loss.

predict(sequence_or_inputs, **kwargs)[source]¶: Return task-appropriate probabilities from head postprocess.

save_pretrained(save_directory: str, overwrite: bool = True)[source]¶

Save model weights, config, tokenizer and metadata to a directory.

Files¶

config.json: Backbone/head configuration.
pytorch_model.bin: Model weights.
tokenizer: Saved via tokenizer’s save_pretrained when available.
metadata.json: Lightweight metadata (class and library).

class omnigenbench.src.model.baselines.OmniRNNBaseline(tokenizer, *args, **kwargs)[source]¶

Bases: OmniModel

A simple BiLSTM baseline for sequence modeling.

Embeds tokens, applies a multi-layer LSTM (optionally bidirectional), then mean-pools over valid tokens and classifies with a sigmoid layer for multi-label probabilities.

Parameters:

tokenizer (Any) – Tokenizer with vocab_size and pad_token_id.
num_labels (int) – Number of output labels.
embed_dim (int, optional) – Embedding dimension, by default 128.
hidden_dim (int, optional) – LSTM hidden size per direction, by default 256.
num_layers (int, optional) – Number of LSTM layers, by default 1.
bidirectional (bool, optional) – Whether to use a bidirectional LSTM, by default True.
dropout (float, optional) – Dropout probability, by default 0.1.
Outputs
-------
dict –
- logits: probabilities after sigmoid (batch_size, num_labels).
- last_hidden_state: pooled vector (batch_size, hidden_size).
- labels: passthrough if provided.
Pseudocode
----------
code-block: (..) – python: x = Embedding(input_ids) x = Dropout(x) seq_out, _ = LSTM(x) pooled = masked_mean(seq_out, attention_mask) logits = sigmoid(Linear(pooled))

forward(**inputs)[source]¶

Forward pass producing multi-label probabilities and hidden state.

Returns:: dict – Dictionary with keys logits, last_hidden_state, optional labels.

classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶

Loads a pre-trained model and tokenizer.

Parameters:

config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.

Returns:

An instance of OmniModel.

inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶: Return binary predictions using the specified probability threshold.

loss_function(logits, labels)[source]¶: Binary cross-entropy loss on probabilities.

predict(sequence_or_inputs, **kwargs)[source]¶: Return probabilities for each label (alias of forward logits).

save_pretrained(save_directory: str, overwrite: bool = True)[source]¶

omnigenbench.src.model.baselines.create_baseline(tokenizer, *, backbone_type: str, task_name: str, num_labels: int, label2id: dict | None = None, **kwargs)[source]¶

Factory for building baselines via OmniGenericBaseline.

Example

model = create_baseline(: tokenizer, backbone_type=”deepstarr”, task_name=”multilabel_classification”, num_labels=8,

)