ARENA: AI interpretability
Mechanistic interpretability for transformer architectures (and more!).
During my second batch at RC, I joined a study group to cover the materials from Callum McDougall’s Alignment Research Engineer Accelerator aka ARENA program which is normally delivered by LISA: the London Initiative for Safe AI. ARENA is based on the MLAB AI alignment bootcamp by Redwood Research. Access to MLAB’s curriculum can be requested at the link.
I may get around to writing up notes on the material covered, but please enjoy the resource and reference links below in the meanwhile.
Curriculum
Chapter 0 - Fundamentals
- prereqs
- Python/PyTorch exercise: raytracing
- CNNs & ResNets
- optimization & hyperparameters
- backprop
- VAEs & GANs
Chapter 1 - Transformer Interpretability
Chapter 2 - Reinforcement Learning
- intro to RL
- Q-learning (model free) and DQN
- proximal policy optimization
- reinforcement learning from human feedback
Chapter 3 - LLM Evaluations
Recommended reading
ImageNet Classification with Deep Convolutional Neural Networks
A Mathematical Framework for Transformer Circuits
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
[2201.02177] Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
[2301.05217] Progress measures for grokking via mechanistic interpretability
Additional resources
On the Measure of Intelligence - Chollet
CMU: Intro to machine learning
Neural networks: zero to hero (Karpathy)
Neel Nanda 2024.07: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum
Multimodal interpretability in 2024
See also my informal notes on neural networks and safe AI.