Contents

AI: align + control + ?

Safety layers

Recently, I listened to a conversation in which Anthropic AI safety researchers discussed AI Control as an add-on to aligned models, and found myself again thinking about how layering mitigations for AI safety will necessarily extend beyond layering multiple techniques in and over a single model.

Some of the control (and, also, robustness) work involves using small models to monitor larger (more capable) models for misalignment, since misaligned characteristics emerge as models scale. This dual model approach seems prudent, and experiments indicate it’s effective. Still, both model alignment and control (as currently contemplated) both target a single model’s misalignment tendencies.

Additional layers of protection might be added by aggregating outputs from different models (some equivalent of air-gapped, so they can’t confer). Or could independent models somehow choose (or fall into) misalignment together? Seemingly impossible if effectively air-gapped, right? Then I recall that Einstein was skeptical of quantum entanglement as “spooky action at a distance” and I start giving the near-certain overlaps in training data the side-eye.

It’s worth noting that if a bad actor has access to a stand-alone model, or a misaligned model becomes incorrigible and chooses to act autonomously, such techniques wouldn’t apply in any case.

Coherent control of quantum states

I was introduced to coherent control of quantum systems in graduate school. (Completely unrelated to my research in medical imaging, but I knew researchers in the field.) The fundamental idea is that creating coherent superpositions of quantum states of a system generates interference, making the resulting population distribution sensitive to the relative phases of the pathway amplitudes. An ultrafast laser pulse carrying suitable spectral bandwidth establishes the coherent superposition, and the pulse’s phase/amplitude structure sets those relative phases — thus controlling which states are preferentially populated. The pulse shape is adaptively optimized by measuring the target state yield.

A technique being developed at the time was the use of evolutionary algorithms for adaptive feedback control. The feedback prompts modifications to the pulse to optimize its interaction with the target system.

If you’re curious, I recommend this slightly dated but informative overview by a pioneer of coherent control, Princeton’s Herschel Rabitz - Control of quantum phenomena: past, present and future. It covers the prior 20+ years’ development of the technique.

Interpretability and control of LLMs

Quantum coherent control raises a structural question about LLMs: if capabilities are represented as circuit contributions in activation space, and those contributions combine via vector addition, does the framework of interference and phase-based control carry over? The interference turns out to be real-valued and geometric rather than complex and wave-like, but the parallel is more than metaphorical.

In a transformer’s residual stream, multiple circuits write vector contributions that sum linearly. The downstream effect depends on the relative geometry of those vectors; their angles in activation space function analogously to phase. When circuits write to overlapping subspaces, their contributions can reinforce or cancel depending on relative orientation, which is real-valued interference. Feature superposition and feature splitting, already documented in mechanistic interpretability research, are arguably interference phenomena described in different language.

This raises concrete questions. Can mechanistic interpretability characterize the relative geometry between circuit outputs for a given capability? Ablation studies, where one circuit is knocked out and nonlinear output changes are measured, may already be the LLM equivalent of the two-slit experiment: evidence that pathways interact rather than contribute independently. If so, the coherent control framework suggests a methodology: rather than interpreting circuits individually, map the interference structure between them, then steer by modulating their relative orientations (via activation patching, steering vectors, or structured input design).

The analogy has limits. LLM activations lack complex phase, so there’s no equivalent of topological or geometric phase effects. A circuit is a pathway, not a state; the closer analog of an eigenstate would be an input that passes through a circuit unchanged up to scaling. And the “Hamiltonian” (the equivalent of the known dynamics that makes quantum control tractable) is an active target of mechanistic interpretability research.

Entanglement

Quantum entanglement offers a further parallel: when two subsystems interact, their states become correlated such that neither can be fully described independently. Phase information lives in the joint state, not in either subsystem alone.

The entanglement analogy extends this to multi-agent settings: agents sharing training signal or environment may develop correlated internal representations, where single-agent interpretability loses essential information, just as partial trace over one subsystem destroys coherence in an entangled state.

The connection to AI Control (e.g., Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats) is structural rather than superficial: both quantum coherent control and adaptive AI deployment achieve control over a system whose internals are opaque, through structured interaction and output feedback alone. The critical disanalogy is adversarial: a quantum system doesn’t model its controller. Still, the shared methodological stance - optimize what you can modulate while circumventing a requirement of full internal access - may point toward a useful synthesis between interpretability-based and behavior-based approaches to alignment.


Just leaving this here as a placeholder of sorts for the time being. And, to be clear, I think of these layers as prudent caution for safe(r) use of untrusted systems. In a perfect world, the systems’ emergent behaviors would actually improve alignment - a topic for another day.