transformer_lens.utilities.tokenize_utils module¶
tokenize_utils.
This module contains utility functions related to tokenization
- transformer_lens.utilities.tokenize_utils.get_attention_mask(tokenizer: PreTrainedTokenizerBase, tokens: Tensor, prepend_bos: bool) Tensor¶
Computes the attention mask for the tokenized input. NOTE: Only the leftmost leading pads (when padding_side == left) or rightmost trailing pads (when padding_side == right) are considered as real pad tokens that should not be attended.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The tokenizer used for tokenization.
tokens (torch.Tensor) – The tokenized input.
prepend_bos (bool) – If True, a BOS token is prepended to the input.
- Returns:
The attention mask for the input.
- Return type:
torch.Tensor
- transformer_lens.utilities.tokenize_utils.get_input_with_manually_prepended_bos(bos_token: str, input: str | list[str]) str | list[str]¶
Manually prepends the bos token to the input.
- Parameters:
bos_token (str) – The BOS token to prepend.
input (str | list[str]) – The input to prepend the bos token to.
- Returns:
The input with the bos token manually prepended.
- Return type:
str | list[str]
- transformer_lens.utilities.tokenize_utils.get_tokenizer_with_bos(tokenizer: PreTrainedTokenizerBase) PreTrainedTokenizerBase¶
Returns the tokenizer initialized with add_bos_token=True. Such a tokenizer should be set as the default tokenizer because the tokenization of some tokenizers like LlamaTokenizer are different when bos token is automatically/manually prepended.
Note: For tokenizers without a BOS token (e.g., T5), this returns the original tokenizer unchanged since add_bos_token=True would fail in transformers v5+ when bos_token is None.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The tokenizer to initialize with add_bos_token=True.
- Returns:
- The tokenizer initialized with add_bos_token=True,
or the original tokenizer if it has no BOS token.
- Return type:
PreTrainedTokenizerBase
- transformer_lens.utilities.tokenize_utils.get_tokens_with_bos_removed(tokenizer: PreTrainedTokenizerBase, tokens: Tensor) Tensor¶
Removes the bos token from the beginning of each sequence in tokens. The last dimension of tokens must be the sequence length.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The tokenizer used to tokenize the input.
tokens (torch.Tensor) – The tokenized input.
- Returns:
The tokenized input with the bos token removed.
- Return type:
torch.Tensor
- transformer_lens.utilities.tokenize_utils.tokenize_and_concatenate(dataset: Dataset, tokenizer: PreTrainedTokenizerBase, streaming: bool = False, max_length: int = 1024, column_name: str = 'text', add_bos_token: bool = True, num_proc: int = 10) Dataset¶
Helper function to tokenizer and concatenate a dataset of text. This converts the text to tokens, concatenates them (separated by EOS tokens) and then reshapes them into a 2D array of shape (____, sequence_length), dropping the last batch. Tokenizers are much faster if parallelised, so we chop the string into 20, feed it into the tokenizer, in parallel with padding, then remove padding at the end.
This tokenization is useful for training language models, as it allows us to efficiently train on a large corpus of text of varying lengths (without, eg, a lot of truncation or padding). Further, for models with absolute positional encodings, this avoids privileging early tokens (eg, news articles often begin with CNN, and models may learn to use early positional encodings to predict these)
- Parameters:
dataset (Dataset) – The dataset to tokenize, assumed to be a HuggingFace text dataset.
tokenizer (PreTrainedTokenizerBase) – The tokenizer. Assumed to have a bos_token_id and an eos_token_id.
streaming (bool, optional) – Whether the dataset is being streamed. If True, avoids using parallelism. Defaults to False.
max_length (int, optional) – The length of the context window of the sequence. Defaults to 1024.
column_name (str, optional) – The name of the text column in the dataset. Defaults to ‘text’.
add_bos_token (bool, optional) – . Defaults to True.
- Returns:
Returns the tokenized dataset, as a dataset of tensors, with a single column called “tokens”
- Return type:
Dataset