Contents

ARENA: AI interpretability

Mechanistic interpretability for transformer architectures (and more!).

During my second batch at RC, I joined a study group to cover the materials from Callum McDougall’s Alignment Research Engineer Accelerator aka ARENA program which is normally delivered by LISA: the London Initiative for Safe AI. ARENA is based on the MLAB AI alignment bootcamp by Redwood Research. Access to MLAB’s curriculum can be requested at the link.

I may get around to writing up notes on the material covered, but please enjoy the resource and reference links below in the meanwhile.


Curriculum

Chapter 0 - Fundamentals

Chapter 1 - Transformer Interpretability

Chapter 2 - Reinforcement Learning

Chapter 3 - LLM Evaluations


ImageNet Classification with Deep Convolutional Neural Networks

Generative Adversarial Nets

Attention is All You Need

A Mathematical Framework for Transformer Circuits

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs — LessWrong

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

[2201.02177] Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

[2301.05217] Progress measures for grokking via mechanistic interpretability


Additional resources

On the Measure of Intelligence - Chollet

CMU: Intro to machine learning

Neural networks: zero to hero (Karpathy)

Neel Nanda 2024.07: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum

Multimodal interpretability in 2024

See also my informal notes on neural networks and safe AI.