Activation tomography
20260515: Living draft. Posted alongside nascent project work and will update/finalize this post soon-ish.
A measure of safety
AI safety and alignment doesn’t have a neat shape, nor well defined boundaries, and it needs to track rapidly-improving capabilities. Theoretical and empirical frameworks, though not easy to establish under those conditions, would be helpful to keeping technical AI safety and alignment ahead of the game.
Technical AI safety and alignment needs to span model training, fine tuning, and deployment to collectively deliver a layered - or, swiss cheese, if you prefer - model that reduces risk at every opportunity. A foundation that’s well characterized and broadly adopted would make iterative progress more robust and, potentially, a lot faster. And well characterized lego building blocks would accelerate further still while also making it easier to track which parts of the solution space have been explored.
Proposals for a science of evaluations point to these same needs. For technical AI safety and alignment interventions to be broadly transferrable - from one benchmark to another, one policy reference to another, one research paper to another, one implementation to another - standardized measurement tools are essential, as are robust evaluation frameworks. These enable enforceable policies and regulations, and shed light on what can and should be measured as models outpace human intelligence. That phase shift is important because it’ll become progressively more difficult to monitor or control model behavior.
One specific challenge is that the model’s internal process isn’t transparent; we need to probe in creative ways to learn how the model processes an input and arrives at an output. This is consequential, for example, if models deliberately obfuscate intent or adjust their outputs based on awareness of being monitored.
Even though the internal structure and the mathematical tools that bind internal processes are known, snapshots of the results of those internal computations are hard for humans to interpret. If we could make sense of those, we could read the model’s mind as it deliberates before producing an output.
Reading the mind of a large language model
Consider that we want to probe a model’s thought process to predict, e.g., if a model might be prone to adapting its outputs based on whether it’s being evaluated. Do we have measurement tools that surface cognition not detectable in model behavior? And do they scale with models as they become more capable?
Anthropic’s interpretability team recently introduced Natural Language Autoencoders (NLAs), a measurement approach that lets us look into the model’s mind - to interpret its activations directly, without the model translating for us. A pair of fine-tuned language models translate between activation vectors and plain English: a verbalizer reads a model’s internal state and writes a description; a reconstructor inverts that description back to activations. The verbalizer and reconstructor are trained jointly, with the objective that the round trip recover the original activation. We’re confident that the intermediate human-interpretable text represents model ’thoughts’ since the input and output activations line up.
Interestingly, the training doesn’t explicitly ask the verbalizer to generate interpretable outputs, but Anthropic found that they are consistently human-interpretable. The paper presents cases where NLAs surface a model’s beliefs about a user (an unsuspected “user is really Russian” assumption that explained an early Claude Opus 4.6 language-switching bug), unverbalized evaluation awareness in safety tests, and pre-computed answers that the model used to override conflicting tool outputs without admitting to it. None of these were detectable in the model’s responses, yet were recovered from its activations.
Anthropic’s findings recover a high-dimensional, opaque latent (an activation vector) from a low-dimensional projection (a ~500-token verbalization), and validate the recovery via round-trip consistency. That process is familiar to me from work in a different field, and it’s probably familiar to many of you, as well…
Inverse problems
I worked on medical image reconstruction and detector physics for many years, during my doctoral and post-doctoral research. The pattern sensor data → uninterpretable intermediate → reconstruction → interpretable image is the unifying epistemology of medical imaging modalities. A CT scanner doesn’t perceive organs the way we visualize organs; it sees line integrals of X-ray attenuation along thousands of paths through the body. The challenge is determining which 3D density distribution would have produced those measurements.
Image reconstruction methods are developed with the help of experiments that rely on phantoms: known physical objects imaged under controlled experimental conditions. Recovering object characteristics from the experimental data validates the technique. The algorithms themselves are either iterative inverse-problem solvers, or they lean on physics-based modeling of the whole system, from target to probing (generation) signals and detectors. Though they rely on different physical phenomena, all 2D/3D medical imaging modalities share an inverse-problem core: measure projections, reconstruct the latent, validate.
There’s a framing that says NLAs are doing exactly this for cognition, and I’m calling it activation tomography. The activation is the latent; the verbalization is the projection; the reconstructor is the inverse map; training minimizes round-trip consistency. It’s not load-bearing, but it helps me construct an angle of approach.
Safety recipes
What I’m exploring here is more methodological than empirical: a systematic protocol for characterizing, configuring, and validating NLAs as measurement instruments across different target models. A new model developer who wants to deploy NLAs for safety evaluations or control monitoring on their own model currently has to engage in quasi-random trial and error. The released NLAs are, understandably, characterized for the paper’s specific case studies, but there’s no systematic protocol for layer selection, instrument characterization across configurations, downstream-application calibration, cross-model comparison, or multi-method validation discipline. The activation-tomography framing proposes a measurement-instrument-first view of NLAs, leveraging priors from ill-posed inverse problems, and may help identify paths to methodological wins.
Project proposal
Consider a particular AI control research protocol: an untrusted monitor reviewing an untrusted model, supervised by an auditor that samples a relatively small proportion of interactions. The auditor can inspect the monitor through its behavior/verdicts. Research has found that models adapt their outputs when they’re being watched and are capable of deception under pressure, so the auditor’s ability to validate whether the monitor model is genuinely aligned is limited.
Now, replace or augment the auditor with NLAs that read the monitor’s activations to surface what it was actually thinking when it produced a verdict. The proposal to test this involves building a scenario where one open model plays both the untrusted model M and the untrusted monitor M’, distinguished by system prompts and conversation role. Using trajectories that incentivize monitor failure under collusion, sandbagging, or distraction conditions, control whether the monitored output is actually problematic, making verdict-vs-truth divergence observable. Apply NLA to M’ at the selected verdict token. Score whether NLA-recovered claims about M’ correlate with monitor-failure ground truth. Independent validators might include causal steering, linear probes, and behavioral consistency checks that could corroborate or reject NLA-only positives.
Two methodology-oriented preparation steps: 1. replicate the paper’s Language Switching finding on the released Gemma-3-27B NLA, to calibrate that the instrument instance works as intended, and 2. characterize NLA recovery across candidate (layer, token) configurations so a null result couldn’t conflate instrument limitation and substantive absence.
And at scale?…
A parallel question is whether NLAs’ abilities scale with the models they’re inspecting. Three possibilities come to mind:
- NLA reads stay legible and reconstructively faithful: scales.
- Reconstruction stays faithful but legibility drops; the verbalizer encodes more compressed or potentially steganographic content while the round trip still succeeds: silent failure mode.
- Both fail visibly: doesn’t scale.
If NLAs are used for safety findings, including monitor-activation sampling, at capability levels where their reliability has silently degraded, the auditing NLA looks like it’s working but isn’t. A capability-vs-legibility curve across the released open NLAs (7B → 70B) is a toy measurement to explore that direction but, ultimately, measures at the scale of frontier models would be needed to track this concept.
Status check
I’m thinking out loud here. The framing might earn its keep; it might not. Feedback welcome! I’d love to hear from people working on AI control or interpretability tooling who might see overlap with existing work or issues with the framing.
LLM use disclosure: Claude was consulted on and contributed to some sections; those contributions were diluted significantly as the draft evolved. Editorial responsibility is my own.