Downstream Datasets

Note

You are viewing the API reference documentation.

This page provides detailed API documentation for dataset classes. For a comprehensive feature guide with complete examples, see the ../OMNIDATASET_FEATURES.

  • Quick Reference: See below for class signatures

  • Complete Guide: ../OMNIDATASET_FEATURES (87KB guide with 9 sections)

  • Design Philosophy: Architecture & Design Philosophy (understanding OmniDataset abstraction)

This module provides templated dataset processing classes inherited from the abstract OmniDataset class, which handles datasets in the OmniGenBench framework.

Dataset Categories:

  • OmniDatasetForSequenceClassification: Sequence classification tasks (e.g., promoter prediction)

  • OmniDatasetForRegression: Sequence regression tasks (e.g., translation efficiency)

  • OmniDatasetForTokenClassification: Token (nucleotide) classification tasks (e.g., TFB prediction)

  • OmniDatasetForTokenRegression: Token (nucleotide) regression tasks

  • OmniDatasetForMultiLabelClassification: Multi-label classification tasks

OmniDataset

Specialized dataset classes for OmniGenome framework.

This module provides specialized dataset classes for various genomic tasks, inheriting from the abstract OmniDataset. These classes handle data preparation for token classification, sequence classification, token regression, and sequence regression, integrating with tokenizers and managing metadata.

class omnigenbench.src.dataset.omni_dataset.OmniDatasetForMultiLabelClassification(dataset_name_or_path, tokenizer, max_length=None, label_indices=None, **kwargs)[source]

Bases: OmniDataset

Dataset class for multi-label classification tasks in genomics.

This class extends OmniDataset to prepare input sequences and their corresponding multi-label targets. It’s designed for tasks where each sequence can belong to multiple categories simultaneously, such as transcription factor binding prediction where a DNA sequence can bind to multiple transcription factors.

Variables:
  • metadata – Dictionary containing dataset metadata including library information

  • label_indices – Optional list of label indices to select specific labels from the dataset

prepare_input(instance, **kwargs)[source]

Prepare a single data instance for multi-label classification.

This method handles both string sequences and dictionary instances containing sequence and multi-label information. It tokenizes the input sequence and prepares multi-hot encoded labels for classification.

Parameters:
  • instance – A single data instance. Can be a string representing the sequence or a dictionary with ‘seq’/’sequence’ and ‘labels’/’label’ keys. For multi-label tasks, labels should be a list or array of binary values.

  • **kwargs – Additional keyword arguments for tokenization, such as ‘padding’ and ‘truncation’.

Returns:

dict

A dictionary of tokenized inputs, including ‘input_ids’, ‘attention_mask’,

and ‘labels’ (tensor of multi-hot encoded labels).

Raises:

Exception – If the input instance format is unknown or if a dictionary instance does not contain a ‘seq’ or ‘sequence’ key.

Example

For an instance like: {

“sequence”: “ACGTAGCTAGCTAGCTAGC…”, “label”: [0, 1, 0, 0, 1, …, 0], # Multi-hot encoded labels for TF binding

} Returns tokenized inputs with labels as float tensor for multi-label classification.

print_label_statistics()[source]

Print statistics about the multi-label distribution in the dataset. This includes the number of positive labels per sample and per label class.

class omnigenbench.src.dataset.omni_dataset.OmniDatasetForSequenceClassification(dataset_name_or_path, tokenizer, max_length=None, **kwargs)[source]

Bases: OmniDataset

Dataset class for sequence classification tasks in genomics.

This class extends OmniDataset to prepare input sequences and their corresponding sequence-level labels. It’s designed for tasks where the entire sequence needs to be classified into one of several categories.

Variables:
  • metadata – Dictionary containing dataset metadata including library information

  • label2id – Mapping from label strings to integer IDs

prepare_input(instance, **kwargs)[source]

Prepare a single data instance for sequence classification.

This method handles both string sequences and dictionary instances containing sequence and label information. It tokenizes the input sequence and prepares sequence-level labels for classification.

Parameters:
  • instance – A single data instance. Can be a string representing the sequence or a dictionary with ‘seq’/’sequence’ and ‘labels’/’label’ keys.

  • **kwargs – Additional keyword arguments for tokenization, such as ‘padding’ and ‘truncation’.

Returns:

dict

A dictionary of tokenized inputs, including ‘input_ids’, ‘attention_mask’,

and ‘labels’ (tensor of sequence-level labels).

Raises:

Exception – If the input instance format is unknown or if a dictionary instance does not contain a ‘label’ or ‘labels’ key, or if the label is not an integer.

class omnigenbench.src.dataset.omni_dataset.OmniDatasetForSequenceRegression(dataset_name_or_path, tokenizer, max_length=None, **kwargs)[source]

Bases: OmniDataset

Dataset class for sequence regression tasks in genomics.

This class extends OmniDataset to prepare input sequences and their corresponding sequence-level regression targets. It’s designed for tasks where the entire sequence needs to be assigned a continuous value.

Variables:

metadata – Dictionary containing dataset metadata including library information

prepare_input(instance, **kwargs)[source]

Prepare a single data instance for sequence regression.

This method handles both string sequences and dictionary instances containing sequence and regression target information. It tokenizes the input sequence and prepares sequence-level regression targets.

Parameters:
  • instance – A single data instance. Can be a string representing the sequence or a dictionary with ‘seq’/’sequence’ and ‘labels’/’label’ keys.

  • **kwargs – Additional keyword arguments for tokenization, such as ‘padding’ and ‘truncation’.

Returns:

dict

A dictionary of tokenized inputs, including ‘input_ids’, ‘attention_mask’,

and ‘labels’ (tensor of sequence-level regression targets).

Raises:

Exception – If the input instance format is unknown or if a dictionary instance does not contain a ‘label’ or ‘labels’ key.

class omnigenbench.src.dataset.omni_dataset.OmniDatasetForTokenClassification(dataset_name_or_path, tokenizer, max_length=None, **kwargs)[source]

Bases: OmniDataset

Dataset class for token-level classification tasks in genomics.

This class extends OmniDataset to support tokenizing genomic sequences and aligning them with token-level labels for tasks like sequence tagging.

Variables:
  • metadata (dict) – Dataset metadata including library name, version, and task type.

  • label2id (dict) – Mapping from label strings to integer IDs used for training.

prepare_input(instance, **kwargs)[source]

Prepare a single data instance for token classification.

This method handles both string sequences and dictionary instances containing sequence and label information. It tokenizes the input sequence and prepares token-level labels for classification.

Parameters:
  • instance – A single data instance. Can be a string representing the sequence or a dictionary with ‘seq’/’sequence’ and ‘labels’/’label’ keys.

  • **kwargs – Additional keyword arguments for tokenization, such as ‘padding’ and ‘truncation’.

Returns:

dict

A dictionary of tokenized inputs, including ‘input_ids’, ‘attention_mask’,

and ‘labels’ (tensor of token-level labels).

Raises:

Exception – If the input instance format is unknown or if a dictionary instance does not contain a ‘seq’ or ‘sequence’ key.

class omnigenbench.src.dataset.omni_dataset.OmniDatasetForTokenRegression(dataset_name_or_path, tokenizer, max_length=None, **kwargs)[source]

Bases: OmniDataset

Dataset class for token regression tasks in genomics.

This class extends OmniDataset to prepare input sequences and their corresponding token-level regression targets. It’s designed for tasks where each token in a sequence needs to be assigned a continuous value.

Variables:

metadata – Dictionary containing dataset metadata including library information

prepare_input(instance, **kwargs)[source]

Prepare a single data instance for token regression.

This method handles both string sequences and dictionary instances containing sequence and regression target information. It tokenizes the input sequence and prepares token-level regression targets.

Parameters:
  • instance – A single data instance. Can be a string representing the sequence or a dictionary with ‘seq’/’sequence’ and ‘labels’/’label’ keys.

  • **kwargs – Additional keyword arguments for tokenization, such as ‘padding’ and ‘truncation’.

Returns:

dict

A dictionary of tokenized inputs, including ‘input_ids’, ‘attention_mask’,

and ‘labels’ (tensor of token-level regression targets).

Raises:

Exception – If the input instance format is unknown or if a dictionary instance does not contain a ‘seq’ or ‘sequence’ key.