Activation tomography

2026-05-15 1631 words 8 minutes

Contents

Living draft (updated 2026-06-02). Posted alongside project work; the research plan has developed since the first version — see the repo’s RESEARCH.md for the current direction.

A measure of safety

AI safety and alignment research doesn’t have a neat shape, nor well defined boundaries, and it needs to track rapidly-improving capabilities. Theoretical and empirical frameworks, though not easy to establish under those conditions, are helpful for staying ahead of the game.

Technical AI safety and alignment needs to span model training, fine tuning, and deployment to collectively deliver a layered - or, swiss cheese, if you prefer - model that reduces risk at every opportunity. A foundation that’s well characterized and broadly adopted contributes to making iterative progress more robust and, potentially, a lot faster. And well characterized lego building blocks help accelerate further still, while also making it easier to track which parts of the solution space have been explored.

Proposals for a science of evaluations point to these same needs. For technical AI safety and alignment interventions to be broadly transferrable - from one benchmark to another, one policy reference to another, one research paper to another, one implementation to another - standardized measurement tools are essential, as are robust evaluation frameworks. These enable enforceable policies and regulations, and shed light on what can and should be measured as models outpace human intelligence. That phase shift is important because it’ll become progressively more difficult to monitor or control model behavior.

One specific challenge is that the model’s internal process isn’t transparent; we need to probe in creative ways to learn how the model processes an input and arrives at an output. This is consequential, for example, if models deliberately obfuscate intent or adjust their outputs based on awareness of being monitored.

Even though the internal structure and the mathematical tools that bind internal processes are known, snapshots of the results of those internal computations are hard for humans to interpret. If we could make sense of those, we could read the model’s mind as it effectively deliberates while producing an output.

Reading the mind of a large language model

Consider that we want to probe a model’s thought process to predict, e.g., if a model might be prone to adapting its outputs based on whether it’s being evaluated. Do we have measurement tools that surface cognition not detectable in model behavior? And do they scale with models as they become more capable?

Anthropic’s interpretability team recently introduced Natural Language Autoencoders (NLAs), a measurement approach that lets us look into the model’s mind - to interpret its activations directly, without the model translating for us. A pair of fine-tuned language models translate between activation vectors and plain English: a verbalizer reads a model’s internal state and writes a description; a reconstructor inverts that description back to activations. The verbalizer and reconstructor are trained jointly so that the round trip recovers the original activation. That round trip is where the traction comes from: if the English description inverts back to (nearly) the original activation, it captured enough of the internal state to reconstruct it. It’s worth being precise about what that does and doesn’t buy, though: a faithful reconstruction means the verbalization is sufficient to regenerate the activation, not that it’s a faithful account of the model’s reasoning. In their study, Anthropic’s team observed cases of confabulation, i.e., NLAs making verifiably false claims about the input, so downstream use will still need to lean on independent validation.

Interestingly, the training doesn’t explicitly ask the verbalizer to generate interpretable outputs, but Anthropic found that they are consistently human-interpretable. The paper presents cases where NLAs surface a model’s beliefs about a user, unverbalized evaluation awareness in safety tests, and silently using pre-computed answers to override conflicting tool outputs. None of these were detectable in the model’s responses, yet were recovered from its activations.

Anthropic’s findings recover a high-dimensional, opaque latent (an activation vector) from a low-dimensional projection (a ~500-token verbalization), and validate the recovery via round-trip consistency. That process is familiar to me from work in a different field, and it’s probably familiar to many of you, as well…

Inverse problems

I worked on medical image reconstruction and detector physics for many years, during my doctoral and post-doctoral research. The pattern sensor data → uninterpretable intermediate → reconstruction → interpretable image is the unifying epistemology of medical imaging modalities. A CT scanner doesn’t perceive organs the way we visualize organs; it sees line integrals of X-ray attenuation along thousands of paths through the body. The challenge is determining which 3D density distribution would have produced those measurements.

Image reconstruction methods are developed with the help of experiments that rely on phantoms: known physical objects imaged under controlled experimental conditions. Recovering object characteristics from the experimental data validates the technique. The algorithms themselves are either iterative inverse-problem solvers, or they lean on (physics-based) modeling of the whole system, from target to probing (generation) signals and detectors. Though they rely on different physical phenomena, all 2D/3D medical imaging modalities share an inverse-problem core: measure projections, reconstruct the latent, validate.

There’s a framing that says NLAs are doing exactly this for cognition, and I’m calling it activation tomography. The activation is the latent; the verbalization is the projection; the reconstructor is the inverse map; training optimizes for round-trip consistency. It’s not load-bearing, but it helps me construct an angle of approach.

The iterative v. system-model split also carries over: NLAs are the iterative branch; they train the inverter from data, with no structural prior on what activations should encode. The system-modeling branch that specifies a structural model of cognition for a task, then tests whether that model is consistent with network behavior is being pursued by Goodfire’s causalab. The two are complementary.

The rest of this post is about the iterative approach that maps to NLAs.

Safety recipes

What I’m exploring leans more methodological than empirical: a systematic protocol for characterizing, configuring, and validating NLAs as measurement instruments across different target models. A new model developer who wants to deploy NLAs for safety evaluations or control monitoring on their own model currently has to engage in quasi-random trial and error. The released NLAs are, understandably, characterized for the paper’s specific case studies. A systematic protocol for layer selection, instrument characterization across configurations, downstream-application calibration, cross-model comparison, and multi-method validation discipline would make the method easier to adopt. The activation-tomography framing proposes a measurement-instrument-first view of NLAs, leveraging priors from ill-posed inverse problems, and may help identify paths to methodological wins.

Project proposal

Consider a particular AI control research protocol: an untrusted monitor reviewing an untrusted model, supervised by an auditor that samples a relatively small proportion of interactions. The auditor can inspect the monitor through its behavior/verdicts. Research has found that models adapt their outputs when they’re being watched and are capable of deception under pressure, so the auditor’s ability to validate whether the monitor model is genuinely aligned is limited.

It’s worth naming why behavior alone falls short here: a behavioral audit tends to share a blind spot with what it audits. The same observation-awareness that lets a monitor present as aligned also defeats a verdict-level check of that monitor — the check fails where its target does. Reading activations helps because it’s a less-correlated channel: it doesn’t run through the behavior that adapts to being watched. It isn’t uncorrelated, though — the NLA is tied to the very model it reads, which is what lets it read at all and also why it can inherit that model’s blind spots — so independent validators stay in the loop.

Now, replace or augment the auditor with NLAs that read the monitor’s activations to surface what it was actually thinking when it produced a verdict. The proposal to test this involves building a scenario where one open model plays both the untrusted model ${M}$ and the untrusted monitor ${M_{reviewer}}$, distinguished by system prompts and conversation role. (Using a single model makes collusion the natural failure mode to study; a two-model variant would test trust-asymmetry instead.) Using trajectories that incentivize monitor failure under collusion, sandbagging, or distraction conditions, control whether the monitored output is misaligned, making verdict-vs-truth divergence observable. Apply NLA to ${M_{reviewer}}$ at the selected verdict token. Score whether NLA-recovered claims about ${M_{reviewer}}$ correlate with monitor-failure ground truth. Independent validators might include causal steering, linear probes, and behavioral consistency checks to corroborate or reject NLA-only positives.

Two methodology-oriented preparation steps: 1. replicate the paper’s Language Switching finding on the released Gemma-3-27B NLA, to calibrate that the instrument instance works as intended, and 2. characterize NLA recovery across candidate (layer, token) configurations so a null result wouldn’t be conflating instrument limitation and substantive absence.

And at scale?…

A parallel question is whether NLAs’ abilities scale with the models they’re inspecting. Three possibilities come to mind:

NLA reads stay legible and reconstructively faithful: scales.
Reconstruction stays faithful but legibility drops; the verbalizer encodes more compressed or potentially steganographic content while the round trip still succeeds: silent failure mode.
Both fail visibly: doesn’t scale.

If NLAs are used for safety findings, including monitor-activation sampling, at capability levels where their reliability has silently degraded, the auditing NLA looks like it’s working but isn’t. A capability-vs-legibility curve across the released open NLAs (7B → 70B) is a toy probe of that direction — though those models span different families, so capability and architecture are confounded — and tracking the concept properly would need measurements at frontier scale.

Status check

I’m thinking out loud here. The framing might earn its keep; it might not. Feedback welcome! I’d love to hear from people working on AI control or interpretability tooling who might see overlap with existing work or issues with the framing.

LLM use disclosure: Claude reviewed my initial draft and contributed to edits.