Gallery#
Research done involving TransformerLens:
Progress Measures for Grokking via Mechanistic Interpretability (ICLR Spotlight, 2023) by Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Finding Neurons in a Haystack: Case Studies with Sparse Probing by Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas
Towards Automated Circuit Discovery for Mechanistic Interpretability by Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, AdriĆ Garriga-Alonso
Actually, Othello-GPT Has A Linear Emergent World Representation by Neel Nanda
A circuit for Python docstrings in a 4-layer attention-only transformer by Stefan Heimersheim and Jett Janiak
A Toy Model of Universality (ICML, 2023) by Bilal Chughtai, Lawrence Chan, Neel Nanda
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models (2023, ICLR Workshop RTML) by Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Fazl Barez
Eliciting Latent Predictions from Transformers with the Tuned Lens by Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt
User contributed examples of the library being used in action:
Induction Heads Phase Change Replication: A partial replication of In-Context Learning and Induction Heads from Connor Kissane
Decision Transformer Interpretability: A set of scripts for training decision transformers which uses transformer lens to view intermediate activations, perform attribution and ablations. A write up of the initial work can be found here.