Tokenizers¶
BPE Tokenizer¶
- class omnigenbench.src.tokenizer.bpe_tokenizer.OmniBPETokenizer(base_tokenizer=None, **kwargs)[source]
Bases:
OmniTokenizerA Byte Pair Encoding (BPE) tokenizer for genomic sequences.
This tokenizer uses BPE tokenization for genomic sequences and provides validation to ensure the base tokenizer is BPE-based. It supports sequence preprocessing and handles various input formats.
- Variables:
base_tokenizer – The underlying BPE tokenizer
metadata – Dictionary containing tokenizer metadata
Example
>>> from omnigenbench import OmniBPETokenizer >>> from transformers import AutoTokenizer >>> base_tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t12_35M_UR50D") >>> tokenizer = OmniBPETokenizer(base_tokenizer) >>> sequence = "ACGUAGGUAUCGUAGA" >>> tokens = tokenizer.tokenize(sequence) >>> print(tokens[:5]) ['▁A', 'C', 'G', 'U', 'A']
- decode(sequence, **kwargs)[source]
Decode a sequence using the base BPE tokenizer.
- Parameters:
sequence – Input sequence to decode (can be token IDs or tokens)
**kwargs – Additional keyword arguments
- Returns:
str – Decoded sequence
- Raises:
AssertionError – If the base tokenizer is not BPE-based
Example
>>> token_ids = [1, 2, 3, 4, 5] >>> sequence = tokenizer.decode(token_ids) >>> print(sequence) "ACGUAGGUAUCGUAGA"
- encode(sequence, **kwargs)[source]
Encode a sequence using the base BPE tokenizer.
- Parameters:
sequence (str) – Input sequence to encode
**kwargs – Additional keyword arguments
- Returns:
list – List of token IDs
- Raises:
AssertionError – If the base tokenizer is not BPE-based
Example
>>> sequence = "ACGUAGGUAUCGUAGA" >>> token_ids = tokenizer.encode(sequence) >>> print(len(token_ids)) 17
- static from_pretrained(config_or_model, **kwargs)[source]
Create a BPE tokenizer from a pre-trained model.
- Parameters:
config_or_model (str) – Name or path of the pre-trained model
**kwargs – Additional keyword arguments
- Returns:
OmniBPETokenizer – Initialized BPE tokenizer
Example
>>> tokenizer = OmniBPETokenizer.from_pretrained("facebook/esm2_t12_35M_UR50D") >>> print(type(tokenizer)) <class 'omnigenome.src.tokenizer.bpe_tokenizer.OmniBPETokenizer'>
- tokenize(sequence, **kwargs)[source]
Tokenize a sequence using the base BPE tokenizer.
- Parameters:
sequence (str) – Input sequence to tokenize
**kwargs – Additional keyword arguments
- Returns:
list – List of tokens
Example
>>> sequence = "ACGUAGGUAUCGUAGA" >>> tokens = tokenizer.tokenize(sequence) >>> print(tokens[:5]) ['▁A', 'C', 'G', 'U', 'A']
- omnigenbench.src.tokenizer.bpe_tokenizer.is_bpe_tokenization(tokens, threshold=0.1)[source]
Check if the tokenization is BPE-based by analyzing token characteristics.
This function examines the tokens to determine if they follow BPE tokenization patterns by analyzing token length distributions and special token patterns.
- Parameters:
tokens (list) – List of tokens to analyze
threshold (float, optional) – Threshold for determining BPE tokenization. Defaults to 0.1
- Returns:
bool – True if tokens appear to be BPE-based, False otherwise
Example
>>> tokens = ["▁hello", "▁world", "▁how", "▁are", "▁you"] >>> is_bpe = is_bpe_tokenization(tokens) >>> print(is_bpe) True
K-mers Tokenizer¶
- class omnigenbench.src.tokenizer.kmers_tokenizer.OmniKmersTokenizer(base_tokenizer=None, k=3, overlap=0, max_length=512, **kwargs)[source]
Bases:
OmniTokenizerA k-mer based tokenizer for genomic sequences.
This tokenizer breaks genomic sequences into overlapping k-mers and uses a base tokenizer to convert them into token IDs. It supports various k-mer sizes and overlap configurations for different genomic applications.
- Variables:
base_tokenizer – The underlying tokenizer for converting k-mers to IDs
k – Size of k-mers
overlap – Number of overlapping positions between consecutive k-mers
max_length – Maximum sequence length for tokenization
metadata – Dictionary containing tokenizer metadata
Example
>>> from omnigenbench import OmniKmersTokenizer >>> from transformers import AutoTokenizer >>> base_tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t12_35M_UR50D") >>> tokenizer = OmniKmersTokenizer(base_tokenizer, k=4, overlap=2) >>> sequence = "ACGUAGGUAUCGUAGA" >>> tokens = tokenizer.tokenize(sequence) >>> print(tokens) [['ACGU', 'GUAG', 'UAGG', 'AGGU', 'GGUA', 'GUAU', 'UAUC', 'AUCG', 'UCGU', 'CGUA', 'GUAG', 'UAGA']]
- decode(input_ids, **kwargs)[source]
Decode input IDs using the base tokenizer.
- Parameters:
input_ids – Input IDs to decode
**kwargs – Additional keyword arguments
- Returns:
Decoded sequence
- encode(input_ids, **kwargs)[source]
Encode input IDs using the base tokenizer.
- Parameters:
input_ids – Input IDs to encode
**kwargs – Additional keyword arguments
- Returns:
Encoded input IDs
- encode_plus(sequence, **kwargs)[source]
Encode a sequence with additional information.
This method is not yet implemented for k-mers tokenizer.
- Parameters:
sequence – Input sequence
**kwargs – Additional keyword arguments
- Raises:
NotImplementedError – This method is not implemented yet
- static from_pretrained(config_or_model, **kwargs)[source]
Create a k-mers tokenizer from a pre-trained model.
- Parameters:
config_or_model (str) – Name or path of the pre-trained model
**kwargs – Additional keyword arguments
- Returns:
OmniKmersTokenizer – Initialized k-mers tokenizer
Example
>>> tokenizer = OmniKmersTokenizer.from_pretrained("facebook/esm2_t12_35M_UR50D") >>> print(type(tokenizer)) <class 'omnigenome.src.tokenizer.kmers_tokenizer.OmniKmersTokenizer'>
- tokenize(sequence, **kwargs)[source]
Convert sequence(s) into k-mers.
This method breaks the input sequence(s) into overlapping k-mers based on the configured k-mer size and overlap parameters.
- Parameters:
sequence (str or list) – Input sequence(s) to convert to k-mers
**kwargs – Additional keyword arguments
- Returns:
list – List of k-mer lists for each input sequence
Example
>>> sequence = "ACGUAGGUAUCGUAGA" >>> k_mers = tokenizer.tokenize(sequence) >>> print(k_mers[0][:3]) ['ACGU', 'GUAG', 'UAGG']
Single Nucleotide Tokenizer¶
- class omnigenbench.src.tokenizer.single_nucleotide_tokenizer.OmniSingleNucleotideTokenizer(base_tokenizer=None, **kwargs)[source]
Bases:
OmniTokenizerTokenizer for single nucleotide tokenization in genomics.
This tokenizer converts genomic sequences into individual nucleotide tokens, where each nucleotide (A, T, C, G, U) becomes a separate token. It’s designed for genomic sequence processing where fine-grained nucleotide-level analysis is required.
The tokenizer supports various preprocessing options including U/T conversion and whitespace addition between nucleotides. It also handles special tokens like BOS (beginning of sequence) and EOS (end of sequence) tokens.
- Variables:
u2t (bool) – Whether to convert ‘U’ to ‘T’.
t2u (bool) – Whether to convert ‘T’ to ‘U’.
add_whitespace (bool) – Whether to add whitespace between nucleotides.
- decode(sequence, **kwargs)[source]
Converts a list of token IDs back into a sequence.
This method decodes token IDs back into genomic sequences using the underlying base tokenizer.
- Parameters:
sequence (list) – A list of token IDs.
**kwargs – Additional arguments for decoding.
- Returns:
str – The decoded sequence.
Example
>>> sequence = tokenizer.decode([1, 2, 3, 4]) >>> print(sequence) # "ATCG"
- encode(sequence, **kwargs)[source]
Converts a sequence into a list of token IDs.
This method encodes genomic sequences into token IDs using the underlying base tokenizer.
- Parameters:
sequence (str) – The input sequence to encode.
**kwargs – Additional arguments for encoding.
- Returns:
list – A list of token IDs.
Example
>>> token_ids = tokenizer.encode("ATCGATCG") >>> print(token_ids) # [1, 2, 3, 4, 1, 2, 3, 4]
- encode_plus(sequence, **kwargs)[source]
Encodes a sequence with additional information.
This method provides enhanced encoding with additional information like attention masks and token type IDs.
- Parameters:
sequence (str) – The input sequence to encode.
**kwargs – Additional arguments for encoding.
- Returns:
dict – A dictionary containing encoded information.
Example
>>> encoded = tokenizer.encode_plus("ATCGATCG") >>> print(encoded.keys()) # dict_keys(['input_ids', 'attention_mask'])
- static from_pretrained(config_or_model, **kwargs)[source]
Loads a single nucleotide tokenizer from a pre-trained model.
This method creates a single nucleotide tokenizer wrapper around a Hugging Face tokenizer loaded from a pre-trained model.
- Parameters:
config_or_model (str) – The name or path of the pre-trained model.
**kwargs – Additional arguments for the tokenizer.
- Returns:
OmniSingleNucleotideTokenizer – An instance of the tokenizer.
Example
>>> tokenizer = OmniSingleNucleotideTokenizer.from_pretrained("model_name")
- tokenize(sequence, **kwargs)[source]
Converts a sequence into a list of individual nucleotide tokens.
This method tokenizes genomic sequences by treating each nucleotide as a separate token. It handles both single sequences and lists of sequences.
- Parameters:
sequence (str or list) – A single sequence or list of sequences to tokenize.
**kwargs – Additional arguments (not used in this implementation).
- Returns:
list –
- A list of token lists, where each inner list contains
individual nucleotide tokens.
Example
>>> # Tokenize a single sequence >>> tokens = tokenizer.tokenize("ATCGATCG") >>> print(tokens) # [['A', 'T', 'C', 'G', 'A', 'T', 'C', 'G']]
>>> # Tokenize multiple sequences >>> tokens = tokenizer.tokenize(["ATCGATCG", "GCTAGCTA"]) >>> print(tokens) # [['A', 'T', 'C', 'G', ...], ['G', 'C', 'T', 'A', ...]]