Getting Started in Mechanistic Interpretability#

Mechanistic interpretability is a very young and small field, and there are a lot of open problems. This means there’s both a lot of low-hanging fruit, and that the bar for entry is low - if you would like to help, please try working on one! The standard answer to “why has no one done this yet” is just that there aren’t enough people! Key resources:

A Guide to Getting Started in Mechanistic Interpretability
ARENA Mechanistic Interpretability Tutorials from Callum McDougall. A comprehensive practical introduction to mech interp, written in TransformerLens - full of snippets to copy and they come with exercises and solutions! Notable tutorials:
- Coding GPT-2 from scratch, with accompanying video tutorial from me (1 2) - a good introduction to transformers
- Introduction to Mech Interp and TransformerLens: An introduction to TransformerLens and mech interp via studying induction heads. Covers the foundational concepts of the library
- Indirect Object Identification: a replication of interpretability in the wild, that covers standard techniques in mech interp such as direct logit attribution, activation patching and path patching
Mech Interp Paper Reading List
200 Concrete Open Problems in Mechanistic Interpretability
A Comprehensive Mechanistic Interpretability Explainer: To look up all the jargon and unfamiliar terms you’re going to come across!
Neel Nanda’s Youtube channel: A range of mech interp video content, including paper walkthroughs, and walkthroughs of doing research