AI Alignment Research

Research interests

Technical AI safety and alignment, broadly:

  • AI evaluations and measurement methodology, mechanistic interpretability, AI control, scalable oversight, multi-agent safety, model organisms of misalignment.

Independent research program:

  • Measurement foundations of multi-agent AI systems: evaluation-invariant observables, perturbation–response analysis of collective fragility, and lengthening recovery times as early warning of catastrophic transitions. Methodological orientation drawn from physics, systems biology, and statistical mechanics, informing first-principles measurement theory to complement benchmark-based evaluations.

Active projects

emmy: Evaluation-invariant measurements for alignment of multi-agent systems

(Name inspired by Emmy Noether (1882–1935), whose foundational work connecting symmetries to invariants underlies the framing of evaluation-invariant measurement.)

activation tomography: Reconstruction methods for model activations

liar liar: To err is human: Is the model scheming to deceive… or is it just wrong?