Baseline Models (Simple Backbones and Heads)¶
This page documents the simple baseline models and backbones provided in
omnigenbench.src.model.baselines. These components are lightweight and
useful for quick baselines and ablations across classification, regression,
and token-level tasks.
- class omnigenbench.src.model.baselines.OmniBPNetBaseline(tokenizer, *args, **kwargs)[source]¶
Bases:
OmniModelA lightweight BPNet-like dilated-convolution baseline.
The model converts token ids to one-hot nucleotides (A,C,G,T), applies a first convolution and a stack of exponentially-dilated 1D convolutions with residual connections, averages globally, then classifies with a sigmoid.
- Parameters:
tokenizer (Any) – Tokenizer that can convert tokens to ids for “A”, “C”, “G”, “T”.
num_labels (int) – Number of outputs.
n_filters (int, optional) – Number of channels in convolution blocks, by default 64.
n_dilated_layers (int, optional) – Number of dilated layers, by default 9.
conv1_kernel_size (int, optional) – Kernel size of the first conv, by default 25.
dil_kernel_size (int, optional) – Kernel size of dilated convs, by default 3.
dropout (float, optional) – Dropout after global pooling, by default 0.1.
Pseudocode
----------
code-block: (..) –
python: X = one_hot(input_ids) # (B, 4, L) H = relu(Conv1D(4->C, k=25)(X)) for i in range(n_layers):
R = H H = relu(DilatedConv1D(C->C, k=3, d=2**i)(H)) H = H + R
g = GlobalAvgPool1D(H) # (B, C) y = sigmoid(Linear(C->num_labels)(Dropout(g)))
- forward(**inputs)[source]¶
Forward pass returning probabilities and last hidden state.
- Returns:
dict – Keys:
logits(probabilities),last_hidden_state(pooled), optionallabels.
- classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶
Loads a pre-trained model and tokenizer.
- Parameters:
config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.
- Returns:
An instance of OmniModel.
- inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶
Return thresholded predictions along with confidence scores.
- loss_function(logits, labels)[source]¶
Calculates the loss. This method should be implemented by concrete model classes to define how the loss is calculated for their specific task (classification, regression, etc.).
- Parameters:
logits (torch.Tensor) – The model’s output logits.
labels (torch.Tensor) – The ground truth labels.
- Returns:
torch.Tensor – The calculated loss.
- Raises:
NotImplementedError – If the method is not implemented by the subclass.
Example
>>> # In a classification model >>> loss = model.loss_function(logits, labels)
- class omnigenbench.src.model.baselines.OmniBasenjiBaseline(tokenizer, *args, **kwargs)[source]¶
Bases:
OmniModelBasenji-like 1D CNN with dilations adapted for tokenizer-driven inputs.
This baseline maps token ids to A/C/G/T channels, stacks conv+pool blocks, applies dilated residual blocks, then global-average-pools and classifies.
- forward(**inputs)[source]¶
Forward pass returning probabilities and last hidden state.
- Returns:
dict – Keys:
logits(probabilities),last_hidden_state(pooled), optionallabels.
- classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶
Loads a pre-trained model and tokenizer.
- Parameters:
config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.
- Returns:
An instance of OmniModel.
- inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶
Return thresholded predictions along with confidence scores.
- loss_function(logits, labels)[source]¶
Calculates the loss. This method should be implemented by concrete model classes to define how the loss is calculated for their specific task (classification, regression, etc.).
- Parameters:
logits (torch.Tensor) – The model’s output logits.
labels (torch.Tensor) – The ground truth labels.
- Returns:
torch.Tensor – The calculated loss.
- Raises:
NotImplementedError – If the method is not implemented by the subclass.
Example
>>> # In a classification model >>> loss = model.loss_function(logits, labels)
- class omnigenbench.src.model.baselines.OmniCNNBaseline(tokenizer, *args, **kwargs)[source]¶
Bases:
OmniModelA simple 1D-CNN baseline with global max pooling for multi-label tasks.
This legacy-style model builds an embedding layer followed by multiple convolutional filters with kernel sizes specified in
kernel_sizes, concatenates the features, pools them, and applies a linear classifier with a sigmoid for multi-label probabilities.- Parameters:
tokenizer (Any) – Tokenizer providing
vocab_sizeorget_vocab()andpad_token_id.num_labels (int) – Number of output labels.
embed_dim (int, optional) – Token embedding dimension, by default 128.
num_filters (int, optional) – Number of output channels per convolutional filter, by default 128.
kernel_sizes (tuple[int, ...], optional) – Convolution kernel sizes, by default
(3, 5, 7).dropout (float, optional) – Dropout probability, by default 0.1.
Inputs
------
input_ids (torch.LongTensor of shape
(batch_size, seq_len))attention_mask (torch.LongTensor of shape
(batch_size, seq_len), optional)labels (torch.FloatTensor of shape
(batch_size, num_labels), optional)Outputs
-------
dict –
logits:(batch_size, num_labels)in[0,1]after sigmoid.last_hidden_state: pooled hidden vector(batch_size, hidden_size).labels: passthrough of input labels if provided.
Notes
Loss used is BCE (
nn.BCELoss) expecting labels in{0,1}floats.See
predict()andinference()for convenience wrappers.
Pseudocode¶
x = Embedding(input_ids) x = Dropout(x) feats = [Conv1D_k(ReLU)(x_T) for k in kernel_sizes] feats = concat(feats, dim=channels).T pooled = masked_global_max_pool(feats, attention_mask) logits = sigmoid(Linear(pooled))
- forward(**inputs)[source]¶
Forward pass.
- Parameters:
input_ids (torch.LongTensor) – Token ids of shape
(batch_size, seq_len).attention_mask (torch.LongTensor, optional) – Mask of shape
(batch_size, seq_len)with 1 for valid tokens.labels (torch.FloatTensor, optional) – Multi-hot label matrix
(batch_size, num_labels).
- Returns:
dict – Dictionary with
logits,last_hidden_state, and optionallabels.
- classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶
Loads a pre-trained model and tokenizer.
- Parameters:
config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.
- Returns:
An instance of OmniModel.
- inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶
Return binary predictions with a threshold.
- Parameters:
threshold (float, optional) – Decision threshold in
[0,1]applied to probabilities, by default 0.5.
- loss_function(logits, labels)[source]¶
Binary cross-entropy loss with logits already in probability space.
Notes
This legacy baseline uses
BCELossassuming inputs were passed through a sigmoid already. PreferBCEWithLogitsLossfor numerical stability in new code.
- class omnigenbench.src.model.baselines.OmniDeepSTARRBaseline(tokenizer, *args, **kwargs)[source]¶
Bases:
OmniModelDeepSTARR-like CNN with global pooling and MLP head adapted for tokenizer inputs.
- forward(**inputs)[source]¶
Forward pass returning probabilities and last hidden state.
- Returns:
dict – Keys:
logits(probabilities),last_hidden_state(pooled), optionallabels.
- classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶
Loads a pre-trained model and tokenizer.
- Parameters:
config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.
- Returns:
An instance of OmniModel.
- inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶
Return thresholded predictions along with confidence scores.
- loss_function(logits, labels)[source]¶
Calculates the loss. This method should be implemented by concrete model classes to define how the loss is calculated for their specific task (classification, regression, etc.).
- Parameters:
logits (torch.Tensor) – The model’s output logits.
labels (torch.Tensor) – The ground truth labels.
- Returns:
torch.Tensor – The calculated loss.
- Raises:
NotImplementedError – If the method is not implemented by the subclass.
Example
>>> # In a classification model >>> loss = model.loss_function(logits, labels)
- class omnigenbench.src.model.baselines.OmniGenericBaseline(tokenizer, *args, **kwargs)[source]¶
Bases:
OmniModelGeneric baseline model wiring a simple backbone to a selected head.
This class provides a flexible way to create small baselines for different tasks by choosing among CNN/LSTM/BPNet-style backbones and head types.
- Parameters:
tokenizer (Any) – Tokenizer instance used for vocabulary size and padding index. Must not be None.
backbone_type (str) – One of {“cnn”, “rnn”, “bpnet”, “deepstarr”, “basenji”}.
task_name (str) – One of {“multilabel_classification”, “classification”, “regression”, “token_classification”, “token_regression”}.
num_labels (int) – Number of outputs for the head.
label2id (dict, optional) – Mapping from label string to id; default builds {“0”:0, …}.
configuration (Other keyword arguments are forwarded to the chosen backbone)
(e.g.
embed_dim
hidden_dim
n_filters
etc.).
- property device¶
Return the actual device of model parameters, not cached value.
- forward(**inputs)[source]¶
Forward pass through backbone and task head.
- Returns:
dict – Always includes
logitsand passthroughlabelsif present. Includeslast_hidden_stateand optionallysequence_outputwhen the backbone provides it (e.g., for token-level heads).
- classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶
Load model, config, and tokenizer from a directory.
- inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶
Convenience inference wrapper producing final predictions.
Behavior depends on
task_name: -multilabel_classification: thresholded sigmoid probabilities. -classification: argmax over softmax probabilities. - others: returns post-processed outputs.
- predict(sequence_or_inputs, **kwargs)[source]¶
Return task-appropriate probabilities from head postprocess.
- save_pretrained(save_directory: str, overwrite: bool = True)[source]¶
Save model weights, config, tokenizer and metadata to a directory.
Files¶
config.json: Backbone/head configuration.pytorch_model.bin: Model weights.tokenizer: Saved via tokenizer’ssave_pretrainedwhen available.metadata.json: Lightweight metadata (class and library).
- class omnigenbench.src.model.baselines.OmniRNNBaseline(tokenizer, *args, **kwargs)[source]¶
Bases:
OmniModelA simple BiLSTM baseline for sequence modeling.
Embeds tokens, applies a multi-layer LSTM (optionally bidirectional), then mean-pools over valid tokens and classifies with a sigmoid layer for multi-label probabilities.
- Parameters:
tokenizer (Any) – Tokenizer with
vocab_sizeandpad_token_id.num_labels (int) – Number of output labels.
embed_dim (int, optional) – Embedding dimension, by default 128.
hidden_dim (int, optional) – LSTM hidden size per direction, by default 256.
num_layers (int, optional) – Number of LSTM layers, by default 1.
bidirectional (bool, optional) – Whether to use a bidirectional LSTM, by default True.
dropout (float, optional) – Dropout probability, by default 0.1.
Outputs
-------
dict –
logits: probabilities after sigmoid(batch_size, num_labels).last_hidden_state: pooled vector(batch_size, hidden_size).labels: passthrough if provided.
Pseudocode
----------
code-block: (..) – python: x = Embedding(input_ids) x = Dropout(x) seq_out, _ = LSTM(x) pooled = masked_mean(seq_out, attention_mask) logits = sigmoid(Linear(pooled))
- forward(**inputs)[source]¶
Forward pass producing multi-label probabilities and hidden state.
- Returns:
dict – Dictionary with keys
logits,last_hidden_state, optionallabels.
- classmethod from_pretrained(save_directory: str, tokenizer=None, map_location=None, **kwargs)[source]¶
Loads a pre-trained model and tokenizer.
- Parameters:
config_or_model – The name or path of the pre-trained model.
tokenizer – The tokenizer to use.
args – Additional positional arguments.
kwargs – Additional keyword arguments.
- Returns:
An instance of OmniModel.
- inference(sequence_or_inputs, threshold: float = 0.5, **kwargs)[source]¶
Return binary predictions using the specified probability threshold.
- omnigenbench.src.model.baselines.create_baseline(tokenizer, *, backbone_type: str, task_name: str, num_labels: int, label2id: dict | None = None, **kwargs)[source]¶
Factory for building baselines via OmniGenericBaseline.
Example
- model = create_baseline(
tokenizer, backbone_type=”deepstarr”, task_name=”multilabel_classification”, num_labels=8,
)