ICML 2025 Tutorial on Mechanistic Interpretability for Language Models

Ziyu Yao, Daking Rai
Department of Computer Science, George Mason University
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. Given how fast this topic is now attracting the ML/AI community's attention, the goal of this tutorial is to provide a comprehensive overview of MI for LMs, including its historical contexts, the various techniques to implement and evaluate MI, findings and applications based on MI, and future challenges. The tutorial will be particularly presented following a Beginner's Roadmap to MI that the presenters carefully curated, aiming to enable researchers new to MI to quickly pick up this field and leverage MI techniques in their LM applications.

This tutorial follows the outline of our survey paper. If you find it helpful, please cite the work:

@article{rai2024practical,
    title={A practical review of mechanistic interpretability for transformer-based language models},
    author={Rai, Daking and Zhou, Yilun and Feng, Shi and Saparov, Abulhair and Yao, Ziyu},
    journal={arXiv preprint arXiv:2407.02646},
    year={2024}
    }
                        

Schedule

Monday, July 14th, 2025.
https://icml.cc/virtual/2025/tutorial/40007
Details coming soon!

Presenters

Ziyu Yao

Dr. Ziyu Yao

Assistant Professor, GMU CS

Daking Rai

Daking Rai

PhD Student, GMU CS

Acknowledgement