Getting Started in Mechanistic Interpretability#
Mechanistic interpretability is a very young and small field, and there are a lot of open problems. This means there’s both a lot of low-hanging fruit, and that the bar for entry is low - if you would like to help, please try working on one! The standard answer to “why has no one done this yet” is just that there aren’t enough people! Key resources:
ARENA Mechanistic Interpretability Tutorials from Callum McDougall. A comprehensive practical introduction to mech interp, written in TransformerLens - full of snippets to copy and they come with exercises and solutions! Notable tutorials:
Coding GPT-2 from scratch, with accompanying video tutorial from me (1 2) - a good introduction to transformers
Introduction to Mech Interp and TransformerLens: An introduction to TransformerLens and mech interp via studying induction heads. Covers the foundational concepts of the library
Indirect Object Identification: a replication of interpretability in the wild, that covers standard techniques in mech interp such as direct logit attribution, activation patching and path patching
A Comprehensive Mechanistic Interpretability Explainer: To look up all the jargon and unfamiliar terms you’re going to come across!
Neel Nanda’s Youtube channel: A range of mech interp video content, including paper walkthroughs, and walkthroughs of doing research