transformer_lens.evals#

Evaluation Helpers.

This module contains some rough evals for models, but you are likely better off using the HuggingFace evaluate library if you want to do anything properly. This is however here if you want it and want to eg cheaply and roughly compare models you’ve trained to baselines.

class transformer_lens.evals.IOIDataset(tokenizer, templates: List[str] | None = None, names: List[str] | None = None, nouns: Dict[str, List[str]] | None = None, num_samples: int = 1000, symmetric: bool = False, prepend_bos: bool = True)#

Bases: Dataset

Dataset for Indirect Object Identification tasks. Paper: https://arxiv.org/pdf/2211.00593.pdf

Example:

>>> from transformer_lens.evals import ioi_eval, IOIDataset
>>> from transformer_lens.HookedTransformer import HookedTransformer

>>> model = HookedTransformer.from_pretrained('gpt2-small')
Loaded pretrained model gpt2-small into HookedTransformer

>>> # Evaluate like this, printing the logit difference
>>> print(round(ioi_eval(model, num_samples=100)["Logit Difference"], 3))
5.476

>>> # Can use custom dataset
>>> ds = IOIDataset(
...     tokenizer=model.tokenizer,
...     num_samples=100,
...     templates=['[A] met with [B]. [B] gave the [OBJECT] to [A]'],
...     names=['Alice', 'Bob', 'Charlie'],
...     nouns={'OBJECT': ['ball', 'book']},
... )
>>> print(round(ioi_eval(model, dataset=ds)["Logit Difference"], 3))
5.397

__getitem__(idx)#

__len__()#

static get_default_names()#

static get_default_nouns()#

static get_default_templates()#

get_sample(symmetric=False) → List[Dict[str, str]]#

transformer_lens.evals.evaluate(model, truncate=100, batch_size=8, tokenizer=None)#

transformer_lens.evals.evaluate_on_dataset(model, data_loader, truncate=100, device='cuda')#

transformer_lens.evals.induction_loss(model, tokenizer=None, batch_size=4, subseq_len=384, prepend_bos=None, device='cuda')#

Generates a batch of random sequences repeated twice, and measures model performance on the second half. Tests whether a model has induction heads.

By default, prepends a beginning of string token (when prepend_bos flag defaults to None, model.cfg.default_prepend_bos is used whose default is True unless specified otherwise), which is useful to give models a resting position, and sometimes models were trained with this.

transformer_lens.evals.ioi_eval(model, dataset=None, batch_size=8, num_samples=1000, tokenizer=None, symmetric=False)#

Evaluate the Model on the Indirect Object Identification Task.

Parameters:

model – HookedTransformer model.
dataset – PyTorch Dataset that returns a dict with keys “prompt”, “IO”, and “S”.
batch_size – Batch size to use.
num_samples – Number of samples to use.
tokenizer – Tokenizer to use.
symmetric – Whether to use the symmetric version of the task.

Returns:

Average logit difference and accuracy.

transformer_lens.evals.make_code_data_loader(tokenizer, batch_size=8)#

Evaluate on the CodeParrot dataset, a dump of Python code.

All models seem to get significantly lower loss here (even non-code trained models like GPT-2), presumably code is much easier to predict than natural language?

transformer_lens.evals.make_owt_data_loader(tokenizer, batch_size=8)#

Evaluate on OpenWebText an open source replication of the GPT-2 training corpus (Reddit links with >3 karma)

I think the Mistral models were trained on this dataset, so they get very good performance.

transformer_lens.evals.make_pile_data_loader(tokenizer, batch_size=8)#

Evaluate on the first 10k texts from The Pile.

The Pile is EleutherAI’s general-purpose english dataset, made of 22 subsets including academic papers, books, internet content…

transformer_lens.evals.make_wiki_data_loader(tokenizer, batch_size=8)#

Evaluate on Wikitext 2, a dump of Wikipedia articles. (Using the train set because it’s larger, I don’t really expect anyone to bother with quarantining the validation set nowadays.)

Note there’s likely to be dataset leakage into training data (though I believe GPT-2 was explicitly trained on non-Wikipedia data)

transformer_lens.evals.sanity_check(model)#

Very basic eval - just feeds a string into the model (in this case, the first paragraph of Circuits: Zoom In), and returns the loss. It’s a rough and quick sanity check - if the loss is <5 the model is probably OK, if the loss is >7 something’s gone wrong.

Note that this is a very basic eval, and doesn’t really tell you much about the model’s performance.