transformer_lens.evals#

Evaluation Helpers.

This module contains some rough evals for models, but you are likely better off using the HuggingFace evaluate library if you want to do anything properly. This is however here if you want it and want to eg cheaply and roughly compare models you’ve trained to baselines.

class transformer_lens.evals.IOIDataset(tokenizer, templates: List[str] | None = None, names: List[str] | None = None, nouns: Dict[str, List[str]] | None = None, num_samples: int = 1000, symmetric: bool = False, prepend_bos: bool = True)#

Bases: Dataset

Dataset for Indirect Object Identification tasks. Paper: https://arxiv.org/pdf/2211.00593.pdf

Example:

>>> from transformer_lens.evals import ioi_eval, IOIDataset
>>> from transformer_lens.HookedTransformer import HookedTransformer

>>> model = HookedTransformer.from_pretrained('gpt2-small')
Loaded pretrained model gpt2-small into HookedTransformer

>>> # Evaluate like this, printing the logit difference
>>> print(round(ioi_eval(model, num_samples=100)["Logit Difference"], 3))
5.476

>>> # Can use custom dataset
>>> ds = IOIDataset(
...     tokenizer=model.tokenizer,
...     num_samples=100,
...     templates=['[A] met with [B]. [B] gave the [OBJECT] to [A]'],
...     names=['Alice', 'Bob', 'Charlie'],
...     nouns={'OBJECT': ['ball', 'book']},
... )
>>> print(round(ioi_eval(model, dataset=ds)["Logit Difference"], 3))
5.397
__getitem__(idx)#
__len__()#
static get_default_names()#
static get_default_nouns()#
static get_default_templates()#
get_sample(symmetric=False) List[Dict[str, str]]#
transformer_lens.evals.evaluate(model, truncate=100, batch_size=8, tokenizer=None)#
transformer_lens.evals.evaluate_on_dataset(model, data_loader, truncate=100, device='cuda')#
transformer_lens.evals.induction_loss(model, tokenizer=None, batch_size=4, subseq_len=384, prepend_bos=None, device='cuda')#

Generates a batch of random sequences repeated twice, and measures model performance on the second half. Tests whether a model has induction heads.

By default, prepends a beginning of string token (when prepend_bos flag defaults to None, model.cfg.default_prepend_bos is used whose default is True unless specified otherwise), which is useful to give models a resting position, and sometimes models were trained with this.

transformer_lens.evals.ioi_eval(model, dataset=None, batch_size=8, num_samples=1000, tokenizer=None, symmetric=False)#

Evaluate the Model on the Indirect Object Identification Task.

Parameters:
  • model – HookedTransformer model.

  • dataset – PyTorch Dataset that returns a dict with keys “prompt”, “IO”, and “S”.

  • batch_size – Batch size to use.

  • num_samples – Number of samples to use.

  • tokenizer – Tokenizer to use.

  • symmetric – Whether to use the symmetric version of the task.

Returns:

Average logit difference and accuracy.

transformer_lens.evals.make_code_data_loader(tokenizer, batch_size=8)#

Evaluate on the CodeParrot dataset, a dump of Python code.

All models seem to get significantly lower loss here (even non-code trained models like GPT-2), presumably code is much easier to predict than natural language?

transformer_lens.evals.make_mmlu_data_loader(subjects: str | List[str] | None = None, split: str = 'test', num_samples: int | None = None)#

Load MMLU (Massive Multitask Language Understanding) dataset.

MMLU tests model performance on 57 subjects across STEM, humanities, social sciences, and more. Each question is multiple choice with 4 options (A, B, C, D).

Paper: https://arxiv.org/abs/2009.03300 Dataset: https://huggingface.co/datasets/cais/mmlu

Parameters:
  • subjects – Subject(s) to evaluate on. Can be: - None: Use all 57 subjects (default) - str: Single subject name (e.g., “abstract_algebra”) - List[str]: Multiple subjects

  • split – Which split to use - “test”, “validation”, or “dev”. Default is “test”.

  • num_samples – Optional limit on number of samples per subject. If None, uses all samples.

Returns:

  • “question”: str

  • ”choices”: List[str] (4 choices)

  • ”answer”: int (0-3, correct choice index)

  • ”subject”: str

Return type:

List of dictionaries with MMLU examples, each containing

Examples:

>>> from transformer_lens.evals import make_mmlu_data_loader

>>> # Load specific subject
>>> mmlu_data = make_mmlu_data_loader(subjects="college_mathematics")  

>>> # Load multiple subjects
>>> mmlu_data = make_mmlu_data_loader(  
...     subjects=["abstract_algebra", "astronomy", "college_chemistry"]
... )
transformer_lens.evals.make_owt_data_loader(tokenizer, batch_size=8)#

Evaluate on OpenWebText an open source replication of the GPT-2 training corpus (Reddit links with >3 karma)

I think the Mistral models were trained on this dataset, so they get very good performance.

transformer_lens.evals.make_pile_data_loader(tokenizer, batch_size=8)#

Evaluate on the first 10k texts from The Pile.

The Pile is EleutherAI’s general-purpose english dataset, made of 22 subsets including academic papers, books, internet content…

transformer_lens.evals.make_wiki_data_loader(tokenizer, batch_size=8)#

Evaluate on Wikitext 2, a dump of Wikipedia articles. (Using the train set because it’s larger, I don’t really expect anyone to bother with quarantining the validation set nowadays.)

Note there’s likely to be dataset leakage into training data (though I believe GPT-2 was explicitly trained on non-Wikipedia data)

transformer_lens.evals.mmlu_eval(model, tokenizer=None, subjects: str | List[str] | None = None, split: str = 'test', num_samples: int | None = None)#

Evaluate a model on the MMLU benchmark.

MMLU (Massive Multitask Language Understanding) is a benchmark for evaluating language models on 57 subjects across STEM, humanities, social sciences, and more. Each question is multiple-choice with 4 options.

For each question, all four answer choices (A-D) are shown in the prompt and the model’s log probability for each answer letter token is compared. This is a zero-shot evaluation; standard MMLU benchmarks typically use 5-shot prompting for higher accuracy.

Paper: https://arxiv.org/abs/2009.03300

Parameters:
  • model – HookedTransformer model to evaluate.

  • tokenizer – Tokenizer to use. If None, uses model.tokenizer.

  • subjects – Subject(s) to evaluate on. Can be None (all 57 subjects), a single subject string, or a list of subjects. See MMLU_SUBJECTS for valid names.

  • split – Which split to use - “test”, “validation”, or “dev”. Default is “test”.

  • num_samples – Optional limit on number of samples per subject. If None, uses all samples.

Returns:

  • “accuracy”: Overall accuracy (0-1)

  • ”num_correct”: Number of correct predictions

  • ”num_total”: Total number of questions

  • ”subject_scores”: Dict mapping subject names to their accuracy

Return type:

Dictionary containing

Examples:

>>> from transformer_lens import HookedTransformer
>>> from transformer_lens.evals import mmlu_eval

>>> model = HookedTransformer.from_pretrained("gpt2-small")  
>>> results = mmlu_eval(model, subjects="abstract_algebra", num_samples=10)  
>>> print(f"Accuracy: {results['accuracy']:.2%}")  
transformer_lens.evals.sanity_check(model)#

Very basic eval - just feeds a string into the model (in this case, the first paragraph of Circuits: Zoom In), and returns the loss. It’s a rough and quick sanity check - if the loss is <5 the model is probably OK, if the loss is >7 something’s gone wrong.

Note that this is a very basic eval, and doesn’t really tell you much about the model’s performance.