transformer_lens.utilities.tokenize_utils module¶

tokenize_utils.

This module contains utility functions related to tokenization

transformer_lens.utilities.tokenize_utils.get_attention_mask(tokenizer: PreTrainedTokenizerBase, tokens: Tensor, prepend_bos: bool) → Tensor¶

Computes the attention mask for the tokenized input. NOTE: Only the leftmost leading pads (when padding_side == left) or rightmost trailing pads (when padding_side == right) are considered as real pad tokens that should not be attended.

Parameters:

tokenizer (PreTrainedTokenizerBase) – The tokenizer used for tokenization.
tokens (torch.Tensor) – The tokenized input.
prepend_bos (bool) – If True, a BOS token is prepended to the input.

Returns:

The attention mask for the input.

Return type:

torch.Tensor

transformer_lens.utilities.tokenize_utils.get_input_with_manually_prepended_bos(bos_token: str, input: str | list[str]) → str | list[str]¶

Manually prepends the bos token to the input.

Parameters:

bos_token (str) – The BOS token to prepend.
input (str | list[str]) – The input to prepend the bos token to.

Returns:

The input with the bos token manually prepended.

Return type:

str | list[str]

transformer_lens.utilities.tokenize_utils.get_tokenizer_with_bos(tokenizer: PreTrainedTokenizerBase) → PreTrainedTokenizerBase¶

Returns the tokenizer initialized with add_bos_token=True. Such a tokenizer should be set as the default tokenizer because the tokenization of some tokenizers like LlamaTokenizer are different when bos token is automatically/manually prepended.

Note: For tokenizers without a BOS token (e.g., T5), this returns the original tokenizer unchanged since add_bos_token=True would fail in transformers v5+ when bos_token is None.

Parameters:

tokenizer (PreTrainedTokenizerBase) – The tokenizer to initialize with add_bos_token=True.

Returns:

The tokenizer initialized with add_bos_token=True,: or the original tokenizer if it has no BOS token.

Return type:

PreTrainedTokenizerBase

transformer_lens.utilities.tokenize_utils.get_tokens_with_bos_removed(tokenizer: PreTrainedTokenizerBase, tokens: Tensor) → Tensor¶

Removes the bos token from the beginning of each sequence in tokens. The last dimension of tokens must be the sequence length.

Parameters:

tokenizer (PreTrainedTokenizerBase) – The tokenizer used to tokenize the input.
tokens (torch.Tensor) – The tokenized input.

Returns:

The tokenized input with the bos token removed.

Return type:

torch.Tensor

transformer_lens.utilities.tokenize_utils.tokenize_and_concatenate(dataset: Dataset | IterableDataset, tokenizer: PreTrainedTokenizerBase, streaming: bool = False, max_length: int = 1024, column_name: str = 'text', add_bos_token: bool = True, num_proc: int = 10, set_format: bool = True) → Dataset | IterableDataset¶

Tokenize each document, join with token-level EOS between docs, and reshape into (batch, sequence_length) rows.

Useful for training language models on a large text corpus without per-doc truncation or padding. Absolute-position-embedding models also benefit by avoiding early-token bias (e.g. news articles starting with “CNN”).

Parameters:

dataset – The dataset to tokenize. Accepts both arrow Dataset and IterableDataset (e.g. when loaded with streaming=True).
tokenizer (PreTrainedTokenizerBase) – The tokenizer. Must have bos_token_id and eos_token_id.
streaming (bool, optional) – If True, avoids parallelism. Defaults to False.
max_length (int, optional) – The length of the context window of the sequence. Defaults to 1024.
column_name (str, optional) – The name of the text column in the dataset. Defaults to ‘text’.
add_bos_token (bool, optional) – Whether to prepend bos_token_id to each output row. Defaults to True.
num_proc (int, optional) – Number of processes for parallel tokenization. Ignored when streaming=True. Defaults to 10.
set_format (bool, optional) – If True, calls set_format(type="torch") on the result. Set False for IterableDataset (which doesn’t support format setting); wrap the output in (torch.LongTensor(ex["tokens"]) for ex in tokenized_dataset) instead. Defaults to True.

Returns:

Tokenized dataset of token sequences in a single column "tokens". Returns the same dataset type as the input (Dataset or IterableDataset).