ARENA: AI interpretability
🌱 notes 🌱
During my second batch at RC, I joined a study group to cover the materials from Callum McDougall’s Alignment Research Engineer Accelerator aka ARENA program which is normally delivered by LISA: the London Initiative for Safe AI. ARENA is based on the MLAB AI alignment bootcamp by Redwood Research. Access to MLAB’s curriculum can be requested at the link.
Notes in the works!
Recommended reading
ImageNet Classification with Deep Convolutional Neural Networks
A Mathematical Framework for Transformer Circuits
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Additional resources
- See also my informal notes on: neural networks and AI
CMU: Intro to machine learning
Neural networks: zero to hero (Karpathy)
Google: ML crash course: Linear regression
Neel Nanda 2024.07: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum