transformer_lens.utilities.hf_utils module¶

hf_utils.

This module contains utility functions related to HuggingFace

transformer_lens.utilities.hf_utils.call_hf_with_retry(func: Callable[[...], T], *args: Any, max_attempts: int = 3, base_delay: float = 10.0, **kwargs: Any) → T¶

Retry func(*args, **kwargs) on HTTP 429, honoring Retry-After.

Exponential backoff with ±20% jitter, capped at _HF_RETRY_MAX_DELAY_SECONDS. Non-429 exceptions propagate immediately.

transformer_lens.utilities.hf_utils.download_file_from_hf(repo_name, file_name, subfolder='.', cache_dir='/home/runner/.cache/huggingface/hub', force_is_torch=False, **kwargs)¶

Helper function to download files from the HuggingFace Hub, from subfolder/file_name in repo_name, saving locally to cache_dir and returning the loaded file (if a json or Torch object) and the file path otherwise.

If it’s a Torch file without the “.pth” extension, set force_is_torch=True to load it as a Torch object.

transformer_lens.utilities.hf_utils.enable_hf_retry() → None¶

Globally wrap transformers Auto*.from_pretrained with retry-on-429.

Opt-in via TRANSFORMERLENS_HF_RETRY=1 or by calling this function. Idempotent. See call_hf_with_retry().

transformer_lens.utilities.hf_utils.get_dataset(dataset_name: str, **kwargs) → Dataset¶

Returns a small HuggingFace dataset, for easy testing and exploration. Accesses several convenience datasets with 10,000 elements (dealing with the enormous 100GB - 2TB datasets is a lot of effort!). Note that it returns a dataset (ie a dictionary containing all the data), not a DataLoader (iterator over the data + some fancy features). But you can easily convert it to a DataLoader.

Each dataset has a ‘text’ field, which contains the relevant info, some also have several meta data fields

Kwargs will be passed to the huggingface dataset loading function, e.g. “data_dir”

Possible inputs: * openwebtext (approx the GPT-2 training data https://huggingface.co/datasets/openwebtext) * pile (The Pile, a big mess of tons of diverse data https://pile.eleuther.ai/) * c4 (Colossal, Cleaned, Common Crawl - basically openwebtext but bigger https://huggingface.co/datasets/c4) * code (Codeparrot Clean, a Python code dataset https://huggingface.co/datasets/codeparrot/codeparrot-clean ) * c4_code (c4 + code - the 20K data points from c4-10k and code-10k. This is the mix of datasets used to train my interpretability-friendly models, though note that they are not in the correct ratio! There’s 10K texts for each, but about 22M tokens of code and 5M tokens of C4) * wiki (Wikipedia, generated from the 20220301.en split of https://huggingface.co/datasets/wikipedia )

transformer_lens.utilities.hf_utils.get_hf_token() → str | None¶: Get HuggingFace token from environment. Returns None if not set.

transformer_lens.utilities.hf_utils.get_rotary_pct_from_config(config: Any) → float¶

Get the rotary percentage from a config object.

In transformers v5, rotary_pct was moved to rope_parameters[‘partial_rotary_factor’]. This function handles both the old and new config formats.

Parameters:: config – Config object (HuggingFace or custom)
Returns:: The rotary percentage (0.0 to 1.0)
Return type:: float

transformer_lens.utilities.hf_utils.keep_single_column(dataset: Dataset | IterableDataset, col_name: str)¶: Acts on a HuggingFace dataset to delete all columns apart from a single column name - useful when we want to tokenize and mix together different strings