transformer_lens.utilities.tokenize_utils module¶
tokenize_utils.
This module contains utility functions related to tokenization
- transformer_lens.utilities.tokenize_utils.get_attention_mask(tokenizer: PreTrainedTokenizerBase, tokens: Tensor, prepend_bos: bool) Tensor¶
Computes the attention mask for the tokenized input. NOTE: Only the leftmost leading pads (when padding_side == left) or rightmost trailing pads (when padding_side == right) are considered as real pad tokens that should not be attended.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The tokenizer used for tokenization.
tokens (torch.Tensor) – The tokenized input.
prepend_bos (bool) – If True, a BOS token is prepended to the input.
- Returns:
The attention mask for the input.
- Return type:
torch.Tensor
- transformer_lens.utilities.tokenize_utils.get_input_with_manually_prepended_bos(bos_token: str, input: str | list[str]) str | list[str]¶
Manually prepends the bos token to the input.
- Parameters:
bos_token (str) – The BOS token to prepend.
input (str | list[str]) – The input to prepend the bos token to.
- Returns:
The input with the bos token manually prepended.
- Return type:
str | list[str]
- transformer_lens.utilities.tokenize_utils.get_tokenizer_with_bos(tokenizer: PreTrainedTokenizerBase) PreTrainedTokenizerBase¶
Returns the tokenizer initialized with add_bos_token=True. Such a tokenizer should be set as the default tokenizer because the tokenization of some tokenizers like LlamaTokenizer are different when bos token is automatically/manually prepended.
Note: For tokenizers without a BOS token (e.g., T5), this returns the original tokenizer unchanged since add_bos_token=True would fail in transformers v5+ when bos_token is None.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The tokenizer to initialize with add_bos_token=True.
- Returns:
- The tokenizer initialized with add_bos_token=True,
or the original tokenizer if it has no BOS token.
- Return type:
PreTrainedTokenizerBase
- transformer_lens.utilities.tokenize_utils.get_tokens_with_bos_removed(tokenizer: PreTrainedTokenizerBase, tokens: Tensor) Tensor¶
Removes the bos token from the beginning of each sequence in tokens. The last dimension of tokens must be the sequence length.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The tokenizer used to tokenize the input.
tokens (torch.Tensor) – The tokenized input.
- Returns:
The tokenized input with the bos token removed.
- Return type:
torch.Tensor
- transformer_lens.utilities.tokenize_utils.tokenize_and_concatenate(dataset: Dataset | IterableDataset, tokenizer: PreTrainedTokenizerBase, streaming: bool = False, max_length: int = 1024, column_name: str = 'text', add_bos_token: bool = True, num_proc: int = 10, set_format: bool = True) Dataset | IterableDataset¶
Tokenize each document, join with token-level EOS between docs, and reshape into
(batch, sequence_length)rows.Useful for training language models on a large text corpus without per-doc truncation or padding. Absolute-position-embedding models also benefit by avoiding early-token bias (e.g. news articles starting with “CNN”).
- Parameters:
dataset – The dataset to tokenize. Accepts both arrow
DatasetandIterableDataset(e.g. when loaded withstreaming=True).tokenizer (PreTrainedTokenizerBase) – The tokenizer. Must have
bos_token_idandeos_token_id.streaming (bool, optional) – If True, avoids parallelism. Defaults to False.
max_length (int, optional) – The length of the context window of the sequence. Defaults to 1024.
column_name (str, optional) – The name of the text column in the dataset. Defaults to ‘text’.
add_bos_token (bool, optional) – Whether to prepend
bos_token_idto each output row. Defaults to True.num_proc (int, optional) – Number of processes for parallel tokenization. Ignored when
streaming=True. Defaults to 10.set_format (bool, optional) – If True, calls
set_format(type="torch")on the result. Set False forIterableDataset(which doesn’t support format setting); wrap the output in(torch.LongTensor(ex["tokens"]) for ex in tokenized_dataset)instead. Defaults to True.
- Returns:
Tokenized dataset of token sequences in a single column
"tokens". Returns the same dataset type as the input (DatasetorIterableDataset).