transformer_lens.utilities.tokenize_utils module

tokenize_utils.

This module contains utility functions related to tokenization

transformer_lens.utilities.tokenize_utils.get_attention_mask(tokenizer: PreTrainedTokenizerBase, tokens: Tensor, prepend_bos: bool) Tensor

Computes the attention mask for the tokenized input. NOTE: Only the leftmost leading pads (when padding_side == left) or rightmost trailing pads (when padding_side == right) are considered as real pad tokens that should not be attended.

Parameters:
  • tokenizer (PreTrainedTokenizerBase) – The tokenizer used for tokenization.

  • tokens (torch.Tensor) – The tokenized input.

  • prepend_bos (bool) – If True, a BOS token is prepended to the input.

Returns:

The attention mask for the input.

Return type:

torch.Tensor

transformer_lens.utilities.tokenize_utils.get_input_with_manually_prepended_bos(bos_token: str, input: str | list[str]) str | list[str]

Manually prepends the bos token to the input.

Parameters:
  • bos_token (str) – The BOS token to prepend.

  • input (str | list[str]) – The input to prepend the bos token to.

Returns:

The input with the bos token manually prepended.

Return type:

str | list[str]

transformer_lens.utilities.tokenize_utils.get_tokenizer_with_bos(tokenizer: PreTrainedTokenizerBase) PreTrainedTokenizerBase

Returns the tokenizer initialized with add_bos_token=True. Such a tokenizer should be set as the default tokenizer because the tokenization of some tokenizers like LlamaTokenizer are different when bos token is automatically/manually prepended.

Note: For tokenizers without a BOS token (e.g., T5), this returns the original tokenizer unchanged since add_bos_token=True would fail in transformers v5+ when bos_token is None.

Parameters:

tokenizer (PreTrainedTokenizerBase) – The tokenizer to initialize with add_bos_token=True.

Returns:

The tokenizer initialized with add_bos_token=True,

or the original tokenizer if it has no BOS token.

Return type:

PreTrainedTokenizerBase

transformer_lens.utilities.tokenize_utils.get_tokens_with_bos_removed(tokenizer: PreTrainedTokenizerBase, tokens: Tensor) Tensor

Removes the bos token from the beginning of each sequence in tokens. The last dimension of tokens must be the sequence length.

Parameters:
  • tokenizer (PreTrainedTokenizerBase) – The tokenizer used to tokenize the input.

  • tokens (torch.Tensor) – The tokenized input.

Returns:

The tokenized input with the bos token removed.

Return type:

torch.Tensor

transformer_lens.utilities.tokenize_utils.tokenize_and_concatenate(dataset: Dataset | IterableDataset, tokenizer: PreTrainedTokenizerBase, streaming: bool = False, max_length: int = 1024, column_name: str = 'text', add_bos_token: bool = True, num_proc: int = 10, set_format: bool = True) Dataset | IterableDataset

Tokenize each document, join with token-level EOS between docs, and reshape into (batch, sequence_length) rows.

Useful for training language models on a large text corpus without per-doc truncation or padding. Absolute-position-embedding models also benefit by avoiding early-token bias (e.g. news articles starting with “CNN”).

Parameters:
  • dataset – The dataset to tokenize. Accepts both arrow Dataset and IterableDataset (e.g. when loaded with streaming=True).

  • tokenizer (PreTrainedTokenizerBase) – The tokenizer. Must have bos_token_id and eos_token_id.

  • streaming (bool, optional) – If True, avoids parallelism. Defaults to False.

  • max_length (int, optional) – The length of the context window of the sequence. Defaults to 1024.

  • column_name (str, optional) – The name of the text column in the dataset. Defaults to ‘text’.

  • add_bos_token (bool, optional) – Whether to prepend bos_token_id to each output row. Defaults to True.

  • num_proc (int, optional) – Number of processes for parallel tokenization. Ignored when streaming=True. Defaults to 10.

  • set_format (bool, optional) – If True, calls set_format(type="torch") on the result. Set False for IterableDataset (which doesn’t support format setting); wrap the output in (torch.LongTensor(ex["tokens"]) for ex in tokenized_dataset) instead. Defaults to True.

Returns:

Tokenized dataset of token sequences in a single column "tokens". Returns the same dataset type as the input (Dataset or IterableDataset).