transformer_lens.utilities.hf_utils module¶
hf_utils.
This module contains utility functions related to HuggingFace
- transformer_lens.utilities.hf_utils.download_file_from_hf(repo_name, file_name, subfolder='.', cache_dir='/home/runner/.cache/huggingface/hub', force_is_torch=False, **kwargs)¶
Helper function to download files from the HuggingFace Hub, from subfolder/file_name in repo_name, saving locally to cache_dir and returning the loaded file (if a json or Torch object) and the file path otherwise.
If it’s a Torch file without the “.pth” extension, set force_is_torch=True to load it as a Torch object.
- transformer_lens.utilities.hf_utils.get_dataset(dataset_name: str, **kwargs) Dataset¶
Returns a small HuggingFace dataset, for easy testing and exploration. Accesses several convenience datasets with 10,000 elements (dealing with the enormous 100GB - 2TB datasets is a lot of effort!). Note that it returns a dataset (ie a dictionary containing all the data), not a DataLoader (iterator over the data + some fancy features). But you can easily convert it to a DataLoader.
Each dataset has a ‘text’ field, which contains the relevant info, some also have several meta data fields
Kwargs will be passed to the huggingface dataset loading function, e.g. “data_dir”
Possible inputs: * openwebtext (approx the GPT-2 training data https://huggingface.co/datasets/openwebtext) * pile (The Pile, a big mess of tons of diverse data https://pile.eleuther.ai/) * c4 (Colossal, Cleaned, Common Crawl - basically openwebtext but bigger https://huggingface.co/datasets/c4) * code (Codeparrot Clean, a Python code dataset https://huggingface.co/datasets/codeparrot/codeparrot-clean ) * c4_code (c4 + code - the 20K data points from c4-10k and code-10k. This is the mix of datasets used to train my interpretability-friendly models, though note that they are not in the correct ratio! There’s 10K texts for each, but about 22M tokens of code and 5M tokens of C4) * wiki (Wikipedia, generated from the 20220301.en split of https://huggingface.co/datasets/wikipedia )
- transformer_lens.utilities.hf_utils.get_hf_token() str | None¶
Get HuggingFace token from environment. Returns None if not set.
- transformer_lens.utilities.hf_utils.get_rotary_pct_from_config(config: Any) float¶
Get the rotary percentage from a config object.
In transformers v5, rotary_pct was moved to rope_parameters[‘partial_rotary_factor’]. This function handles both the old and new config formats.
- Parameters:
config – Config object (HuggingFace or custom)
- Returns:
The rotary percentage (0.0 to 1.0)
- Return type:
float
- transformer_lens.utilities.hf_utils.keep_single_column(dataset: Dataset, col_name: str)¶
Acts on a HuggingFace dataset to delete all columns apart from a single column name - useful when we want to tokenize and mix together different strings