Tutorials#
Start with the main demo to learn how the library works, and the basic features.
Where To Start#
To see what using it for exploratory analysis in practice looks like, check out my notebook analysing Indirect Objection Identification or my recording of myself doing research!
Demos#
Activation Patching in TransformerLens - Accompanies the [Exploratory Analysis Demo](https://colab.research.google.com/github/TransformerLensOrg/TransformerLens/blob/main/demos/Exploratory Analysis Demo.ipynb). This demo explains how to use Activation Patching in TransformerLens, a mechanistic interpretability technique that uses causal intervention to identify which activations in a model matter for producing an output.
Attribution Patching - Attribution Patching is an incomplete project that uses gradients to take a linear approximation to activation patching. It’s a good approximation when patching in small activations like the outputs of individual attention heads, and bad when patching in large activations like a residual stream.
Exploratory Analysis - Probably the best place to start, after the Main Demo. Demonstrates how to use TransformerLens to perform exploratory analysis - focuses less on rigor and more on getting a grasp of what’s going on quickly. Uses a lot of useful interpretability techniques like logit attribution and activation patching. Steal liberally from this!
Grokking - “Grokking” is a phenomenon where a model can learn to memorise the training data (minimising training loss) but then, if trained for a lot longer, can learn to generalise, leading to a sharp decrease in test loss as well. This demo shows training a model on the task of modular addition, verifying that it groks, and doing analysis. The demo is light on explanation, so you’ll probably want to pair it with Neel’s video series on the paper it’s based on.
Head Detector - Shows how to use TransformerLens to automatically detect several common types of attention head, as well as create your own custom detection algorithms to find your own!
Interactive Neuroscope - Very hacky demo, but this is a feature, not a bug. Shows how to quickly create useful web-based visualisations of data, even if you’re not a professional front-end developer. This demo creates an interactive Neuroscope - a visualization of a neuron’s activations on text that will dynamically update as you edit the text.
LLaMA - Converts Meta’s LLaMA model (7B parameter version for now until multi-GPU support is added) to TransformerLens.
Main Demo - The main demo. This is where to start if you’re new to TransformerLens. Shows a lot of great features for getting started, including available models, how to access model activations, and generally useful features you should know about.
No Position Experiment - The accompanying notebook to Neel’s real-time research video. Trains a model with no positional embeddings to predict the previous token, and makes a start at analysing what’s going on there!
Othello-GPT - This is a demo notebook porting the weights of the Othello-GPT Model from the excellent Emergent World Representations paper to TransformerLens. Neel’s sequence on investigating this is also well worth reading if you’re interested in this topic!
SVD Interpreter Demo - Based on the Conjecture post about how the singular value decompositions of transformer matrices are surprisingly interpretable, this demo shows how to use TransformerLens to reproduce this and investigate further.
Tracr to TransformerLens - Tracr is a cool new DeepMind tool that compiles a written program in RASP to transformer weights.This is a (hacky!) script to convert Tracr weights from the JAX form to a TransformerLens HookedTransformer in PyTorch.