transformer_lens.utilities.hf_utils module

hf_utils.

This module contains utility functions related to HuggingFace

transformer_lens.utilities.hf_utils.download_file_from_hf(repo_name, file_name, subfolder='.', cache_dir='/home/runner/.cache/huggingface/hub', force_is_torch=False, **kwargs)

Helper function to download files from the HuggingFace Hub, from subfolder/file_name in repo_name, saving locally to cache_dir and returning the loaded file (if a json or Torch object) and the file path otherwise.

If it’s a Torch file without the “.pth” extension, set force_is_torch=True to load it as a Torch object.

transformer_lens.utilities.hf_utils.get_dataset(dataset_name: str, **kwargs) Dataset

Returns a small HuggingFace dataset, for easy testing and exploration. Accesses several convenience datasets with 10,000 elements (dealing with the enormous 100GB - 2TB datasets is a lot of effort!). Note that it returns a dataset (ie a dictionary containing all the data), not a DataLoader (iterator over the data + some fancy features). But you can easily convert it to a DataLoader.

Each dataset has a ‘text’ field, which contains the relevant info, some also have several meta data fields

Kwargs will be passed to the huggingface dataset loading function, e.g. “data_dir”

Possible inputs: * openwebtext (approx the GPT-2 training data https://huggingface.co/datasets/openwebtext) * pile (The Pile, a big mess of tons of diverse data https://pile.eleuther.ai/) * c4 (Colossal, Cleaned, Common Crawl - basically openwebtext but bigger https://huggingface.co/datasets/c4) * code (Codeparrot Clean, a Python code dataset https://huggingface.co/datasets/codeparrot/codeparrot-clean ) * c4_code (c4 + code - the 20K data points from c4-10k and code-10k. This is the mix of datasets used to train my interpretability-friendly models, though note that they are not in the correct ratio! There’s 10K texts for each, but about 22M tokens of code and 5M tokens of C4) * wiki (Wikipedia, generated from the 20220301.en split of https://huggingface.co/datasets/wikipedia )

transformer_lens.utilities.hf_utils.get_hf_token() str | None

Get HuggingFace token from environment. Returns None if not set.

transformer_lens.utilities.hf_utils.get_rotary_pct_from_config(config: Any) float

Get the rotary percentage from a config object.

In transformers v5, rotary_pct was moved to rope_parameters[‘partial_rotary_factor’]. This function handles both the old and new config formats.

Parameters:

config – Config object (HuggingFace or custom)

Returns:

The rotary percentage (0.0 to 1.0)

Return type:

float

transformer_lens.utilities.hf_utils.keep_single_column(dataset: Dataset, col_name: str)

Acts on a HuggingFace dataset to delete all columns apart from a single column name - useful when we want to tokenize and mix together different strings