Getting Started in Mechanistic Interpretability¶
Mechanistic interpretability is a very young and small field, and there are a lot of open problems. This means there’s both a lot of low-hanging fruit, and that the bar for entry is low - if you would like to help, please try working on one! The standard answer to “why has no one done this yet” is just that there aren’t enough people! Key resources:
ARENA Mechanistic Interpretability Tutorials from Callum McDougall. A comprehensive practical introduction to mech interp, written in TransformerLens - full of snippets to copy and they come with exercises and solutions! Notable tutorials:
Coding GPT-2 from scratch, with accompanying video tutorial from me (1 2) - a good introduction to transformers
Introduction to Mech Interp and TransformerLens: An introduction to TransformerLens and mech interp via studying induction heads. Covers the foundational concepts of the library
Indirect Object Identification: a replication of interpretability in the wild, that covers standard techniques in mech interp such as direct logit attribution, activation patching and path patching
A Pragmatic Vision for Interpretability: My take on where mech interp should be heading and what a useful research project actually looks like - essential reading before picking a problem to work on
How Can Interpretability Researchers Help AGI Go Well: The case for why interpretability matters for AI safety, and how to orient your research toward the things that actually move the needle
A Comprehensive Mechanistic Interpretability Explainer: To look up all the jargon and unfamiliar terms you’re going to come across!
Neel Nanda’s Youtube channel: A range of mech interp video content, including paper walkthroughs, and walkthroughs of doing research