ICML 2025 Tutorial on Mechanistic Interpretability for Language Models

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. Given how fast this topic is now attracting the ML/AI community's attention, the goal of this tutorial is to provide a comprehensive overview of MI for LMs, including its historical contexts, the various techniques to implement and evaluate MI, findings and applications based on MI, and future challenges. The tutorial will be particularly presented following a Beginner's Roadmap to MI that the presenters carefully curated, aiming to enable researchers new to MI to quickly pick up this field and leverage MI techniques in their LM applications.

This tutorial follows the outline of our survey paper. If you find it helpful, please cite the work:


@article{rai2024practical,
    title={A practical review of mechanistic interpretability for transformer-based language models},
    author={Rai, Daking and Zhou, Yilun and Feng, Shi and Saparov, Abulhair and Yao, Ziyu},
    journal={arXiv preprint arXiv:2407.02646},
    year={2024}
    }