Estimation and detection in AI safety evals
Signal detection theory in AI safety evals.
As used in the context of AI safety, the term evaluation represents two different classes of measurements: one class is estimation (how close is a measurement to ground truth?); the other is detection (is this actually what we think it is?). Those classes lean on different math, different metrics, and different specs for “good.”
My PhD and postdoc research was in medical imaging, a field that treats detection and estimation distinctly in the sense of detection reflecting imaging system capabilities and estimation reflecting clinical-use capabilities (i.e., of the imaging system deployed in practice). I thought it might be interesting to try to trace a path from there to their analogs in AI safety evaluations. This post works through that distinction, the metrics in each class, and one place in eval practice where a detection step tends to sit inside what looks like a simple estimate. My goal here is, largely, to clarify my own thinking; I’m sharing it in case it’s useful to others, or in case there are factual or reasoning gaps that a kind soul might point out.
1. Validity is the style guide
Berkeley’s Ben Recht puts it cleanly in Strunk and White for science: “Science is correlations with stories. Validity is the style guide for these stories.” Numbers don’t speak for themselves; what counts as a defensible inference from them is set by community convention, which splits three ways:
- Internal validity: Does the evidence actually support the specific claim? (Controls in place, no leakage, statistics not misapplied.)
- External validity: Does the result generalize past the exact context it was measured in?
- Construct validity: Do the things you measured actually connect to the abstract concept you mean to study?
Construct validity is the subtlest of the three. Whether a number is an estimate of a quantity or the outcome of a detection is a construct-validity question: it fixes what the number means, before any statistics are run.
2. The two evaluation classes in medical imaging
One system I worked on involved developing an alternative to the signal generation/detection transducer for ultrasound. The piezoelectric transducer is the ceramic workhorse inside an ultrasound probe; it converts electric signals to pressure waves and vice versa. Our team worked on replacing it with optoacoustic elements1, aiming for flexible, miniature, synthetic-aperture arrays that met specs for standard pulse-echo medical ultrasound imaging. The targets were well defined: the clinical tasks the devices and their resulting images had to support, the resolution needed for lab and animal work, and the clinical-applications safety limits on laser power. In essence, the work was engineering a novel sensor to meet known specifications under tight constraints. Within that setting, two distinct questions came up constantly, each with its own class of metrics.
a) Estimation: reconstructing the image.
Reconstruction via synthetic-aperture beamforming recovers an estimate of the acoustic properties of the target. There is no hypothesis and no decision; the question is how close the estimate is to the truth. The metrics live in signal space:
- signal-to-noise ratio, a power ratio: ${SNR = P_{signal} / P_{noise}}$;
- contrast-to-noise ratio between regions ${A}$ and ${B}$: ${CNR = |μ_A − μ_B| / σ_{background}}$;
- spatial resolution, and a reconstruction loss (often squared error on the field).
b) Detection: deciding whether something is there.
The moment you ask whether a clinically relevant feature is present, the question becomes a hypothesis test. The standard framework is Barrett and Myers’ task-based image quality, which scores an image by how well an ideal observer performs a specified task — typically, is a lesion present or not? That brings a different geometry:
-
two distributions of the observer’s decision statistic, signal-absent and signal-present;
-
a decision threshold;
-
an ROC curve that sweeps the threshold, tracing true-positive rate (power, 1 − β) against false-positive rate (α);
-
a single scalar governing the curve — the standardized separation of the two distributions:
${d' = (μ_{present} − μ_{absent}) / σ}$
For the equal-variance Gaussian case, ${d'}$ summarizes the whole ROC. The area under the curve is
${AUC = Φ(d' / √2)}$,
and for a one-sided test detecting a mean shift ${d'}$ at significance ${α}$, the power is
${1 − β = Φ(d' − z_{1−α})}$.
So ${d'}$ is an amplitude SNR, and ${d'^2}$ is the corresponding power SNR (the deflection coefficient). The detection SNR is not the same number as the reconstruction SNR above: one measures how separable two hypotheses are, the other how clean the field estimate is. Raising the latter usually helps the former, but they are different quantities, and a design choice that improves one need not improve the other. So the convention is to report both, each in its own terms — reconstruction quality in signal-space metrics, detection performance as an ROC.
3. The two classes in AI safety evals
Many eval outputs are estimates. “The model refuses 94% of harmful prompts”; “the model solves 31% of the tasks.” These are rates — proportions, reported ideally with a confidence interval, e.g. the Wald interval ${p̂ ± z·√(p̂(1−p̂)/n)}$ (or a Wilson interval at small ${n}$). Read as estimates of a rate, they are well-posed.
Other eval outputs are detections. Deception and sandbagging probes, backdoor and trojan detectors, and chain-of-thought or agent monitors are appropriately scored with ROC curves and AUC, at an operating point chosen from the cost of each kind of error. Those costs are often very asymmetric — a missed deceptive model is worse than a falsely flagged aligned one — so the threshold is set to keep ${β}$ small even at the cost of a larger ${α}$. That choice isn’t fixed by the math; it’s a values judgment which, in Recht’s terms, is an internal-validity convention the community sets.
So the detection toolkit is already standard in part of eval work. The next section looks at a case where a detection step is present but isn’t always treated as one.
4. When an estimate contains a detection step
Return to “refuses 94% of harmful prompts.” Whether each response counts as harmful, or as a refusal, is usually decided by a model — an LLM judge, a classifier, or a rubric grader. That grading step is itself a detection problem: the grader has a true-positive rate and a false-positive rate, and an implied ROC.
So the reported rate is not a direct estimate of the underlying rate; it is a mixture of the true rate and the grader’s error rates. If ${p}$ is the true rate and the grader has sensitivity ${TPR}$ and false-positive rate ${FPR}$, the observed rate is
${p_{obs} = p · TPR + (1 − p) · FPR}$
which inverts to
${p = (p_{obs} − FPR) / (TPR − FPR)}$, ${\forall}$ ${TPR ≠ FPR}$
(the Rogan–Gladen estimator, used in epidemiology to recover true prevalence from an imperfect test). Unless the grader is perfect (${TPR = 1}$, ${FPR = 0}$), the headline number is biased, and the size of the bias depends on the grader’s operating point. With the grader’s TPR and FPR in hand, a reader can interpret the rate and, if needed, correct it; without them, the rate is conditional on an uncharacterized detector.
This is where the imaging convention seems worth borrowing. In imaging, it is standard to characterize the observer and report its ROC alongside the result; doing the same for a model grader — reporting its detection error, or its agreement with reference labels, as context — makes the rate easier to interpret. The reality for evals is that model graders are frequently the only scalable option, so characterizing grader reliability is essential since, when a rate depends on a grader, the grader’s detection error is part of the measurement.
In Recht’s vocabulary, this is a construct-validity point: the reported rate quietly combines an estimate and a detection, and naming which is which is what lets the number be read correctly.
5. Summary
A few things I want to keep from working through this:
- Estimation optimizes signal fidelity — SNR, resolution, CNR. Detection optimizes separability — d′, ROC, AUC — and also requires choosing an operating point, which encodes a values judgment about the relative cost of false positives and false negatives.
- “How good is this?” and “how much data do I need?” are well-posed only after naming which of the two jobs you are doing. That naming is the construct-validity step, and it comes before the statistics.
- In evals, a reported rate can contain a detection step when a model does the grading. Characterizing that grader — its TPR, FPR, or ROC — turns the rate back into something interpretable.
-
Optoacoustics in ultrasound review article: Looking at sound: optoacoustics with all-optical ultrasound detection, Georg Wissmeyer, Miguel A. Pleitez, Amir Rosenthal & Vasilis Ntziachristos (2018) ↩︎