AI: align + control + ?

2025-09-21 569 words 3 minutes

Contents

Safety layers

Recently, I listened to a conversation in which Anthropic AI safety researchers discussed AI Control as an add-on to aligned models, and found myself again thinking about how layering mitigations for AI safety will necessarily extend beyond layering multiple techniques in and over a single model.

Some of the control (and, also, robustness) work involves using small models to monitor larger (more capable) models for misalignment, since misaligned characteristics emerge as models scale. This dual model approach seems prudent, and experiments indicate it’s effective. Still, both model alignment and control (as currently contemplated) both target a single model’s misalignment tendencies.

Additional layers of protection might be added by aggregating outputs from different models (some equivalent of air-gapped, so they can’t confer). Or could independent models somehow choose (or fall into) misalignment together? Seemingly impossible if effectively air-gapped, right? Then I recall that Einstein was skeptical of quantum entanglement as “spooky action at a distance” and I start giving the near-certain overlaps in training data the side-eye.

It’s worth noting that if a bad actor has access to a stand-alone model, or a misaligned model becomes incorrigible and chooses to act autonomously, such techniques wouldn’t apply in any case.

Coherent control

I was introduced to coherent control of quantum systems in graduate school. (Completely unrelated to my research in medical imaging, but I knew researchers in the field.) The fundamental idea is that eliciting coherence between the quantum states of a system generates interference. That interference carries phase information that can be used to control system state. The coherent state elicitation may be achieved with ultrafast laser pulses that carry suitable phase content, optimized based on interference measurements.

A technique being developed at the time was the use of evolutionary algorithms for adaptive feedback control. The feedback prompts modifications to the pulse to optimize its interaction with the target system.

If you’re curious, I recommend this slightly dated but informative overview by a pioneer of coherent control, Princeton’s Herschel Rabitz - Control of quantum phenomena: past, present and future. It covers the prior 20+ years’ development of the technique.

Adaptive feedback

This topic came to mind in the context of representing capabilities as ‘states’ in an LLM’s network structure. Admittedly, it’s a stretch to propose the existence of associated wave-like characteristics which might interfere.

Still, contemplating whether there’s some sense in which it might be so, a few naïve questions arise: Is mechanistic interpretability capable of shedding light on whether LLMs have some analog to states capable of interference? Is there any plausibility in the idea of a transformer circuit as an eigenstate? What test might measure whether LLM system states (presuming there’s a suitable definition) have the ability to interfere? Basically, what’s the LLM equivalent of the two-slit experiment? And, if the concept holds any water, what’s the mechanism for coaxing the LLM into coherent states and then using that for alignment or control purposes?

I don’t yet have anything particularly deep to share on this line of thought, but wanted to jot it down, particularly after (finally) pulling up Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats.

Just leaving this here as a placeholder of sorts for the time being. And, to be clear, I think of these layers as prudent caution for safe(r) use of untrusted systems. In a perfect world, the systems’ emergent behaviors would actually improve alignment - a topic for another day.