Estimation and detection in AI safety evals

2026-05-30 1989 words 10 minutes

Contents

Signal detection theory in AI safety evals.

As used in the context of AI safety, the term evaluation represents two different classes of measurements: one class is estimation (how close is a measurement to ground truth?); the other is detection (are we picking up the desired signal?). Each class leans on its own metrics and its own spec for “good.”

Medical imaging was the subject of my PhD and postdoctoral research. It’s a field that takes care to handle estimation and detection separately: estimation reflects measurement system characteristics, while detection reflects clinical decision support capabilities. I thought it might be interesting to trace a path from there to analogs in AI safety evaluations. Working through that, I couldn’t avoid noticing that standard eval practice sometimes silently embeds an uncharacterized detector inside an estimation protocol. I’m far from the first to notice, and recent work on resolving this is summarized in the Recent work… section. I’m writing this to clarify my own thinking, and sharing it in case it’s useful to others. I welcome all feedback/corrections.

TL;DR

Estimation optimizes signal fidelity — SNR, resolution, CNR. Detection optimizes separability — d′, ROC, AUC — and also requires choosing an operating point, which encodes a values judgment about the relative cost of false positives and false negatives.
“How good is this?” and “how much data do I need?” are well-posed only after naming which of the two jobs you are doing: estimation or detection. That naming is the construct validity step, and it comes before the statistics.
In evals, a reported rate depends on detection statistics when a model does the grading. Characterizing that grader — its TPR, FPR, or ROC — makes the rate more robustly interpretable.

Validity is the style guide

As Ben Recht wrote in Strunk and White for science: “Science is correlations with stories. Validity is the style guide for these stories.” Numbers don’t speak for themselves; what counts as a defensible inference from them is set by community convention. In turn, community convention splits three ways:

Internal validity: Does the evidence actually support the specific claim?
External validity: Does the result generalize past the exact context it was measured in?
Construct validity: Do the things you measured actually connect to the abstract concept you mean to study?

Construct validity can be subtle, and it often raises the question: Is the measurement a valid estimate of a quantity, or is it polluted by detection artifacts?

The two evaluation classes in medical imaging

One system I worked on involved developing an alternative to the de facto standard signal generation/detection element for ultrasound imaging: the piezoelectric transducer. It’s the ceramic workhorse of ultrasound probes, converting electric signals to pressure waves (and vice versa) via the piezoelectric effect. Our team worked on replacing piezos with optoacoustic elements¹ and a laser source, aiming for flexible, miniature, synthetic aperture arrays that met specs for standard pulse-echo medical ultrasound imaging.

Responsible for the detection side of things, one of my research objectives was to develop a novel sensor to meet known specifications under tight constraints. The targets were well defined, including the clinical spec (tasks to support), the instrument quality spec (resolution and contrast required), and the safety spec (e.g., clinic-suitable limits on laser power). Throughout, two distinct considerations anchored the research, each with its own class of metrics.

a) Estimation: image reconstruction.

Reconstruction via synthetic aperture beamforming recovers an estimate of the acoustic properties of the target. No hypothesis, no decision; just a question: How close is the estimate? The system is tested under typical operating conditions, using target phantom object(s) with known acoustic properties. System parameters are swept and that technical characterization informs downstream use and also qualifies the images produced. Signal detection and image processing metrics include:

signal-to-noise ratio, a power ratio: SNR = ${P_{signal} / P_{noise}}$;
contrast-to-noise ratio between regions A and B: CNR = ${|μ_A − μ_B| / σ_{background}}$;
spatial resolution, the full width at half maximum of the point-spread function: ${δ}$ = FWHM(PSF);
reconstruction loss, the (mean) squared error on the field: MSE = ${(1/N) \sum_i |\hat{f}_i − f_i|^2}$.

b) Detection: lesion identification.

The moment you ask whether a clinically relevant feature is present, the question becomes a hypothesis test. The standard framework is Barrett and Myers’ task-based image quality, which scores an image by how well an ideal observer performs a specified task. Here, that’s typically “Is a lesion present or not?” and the characterization leans on a different set of metrics:

two distributions of the observer’s decision statistic, signal-absent and signal-present;
a decision threshold;
an ROC curve that sweeps the threshold, tracing true-positive rate (power, 1 − β) against false-positive rate (α);
a single scalar governing the curve — the standardized separation of the two distributions: ${d' = (μ_{present} − μ_{absent}) / σ}$

For the equal-variance Gaussian case, ${d'}$ summarizes the whole ROC. The area under the curve is

AUC = ${Φ(d' / √2)}$,

and, for a one-sided test detecting a mean shift ${d'}$ at significance ${α}$, the power is

${1 − β = Φ(d' − z_{1−α})}$.

So ${d'}$ is an amplitude SNR, and ${d'^2}$ is the corresponding power SNR (the deflection coefficient). The detection SNR is not the same number as the reconstruction SNR above: one measures how separable two hypotheses are, the other how clean the field estimate is. Raising the latter usually helps the former, but they are different quantities, and a design choice that improves one need not improve the other. So the convention is to report both, each in its own terms — reconstruction quality in signal-space metrics, detection performance as an ROC.

The two classes in AI safety evals

Many eval outputs are estimates: “the model refuses 94% of harmful prompts”; “the model solves 31% of the tasks.” These are proportions which are, ideally, reported with a confidence interval, e.g. the Wald interval ${p̂ ± z·√(p̂(1−p̂)/n)}$ (alternately, the Wilson interval for small ${n}$… or if you simply don’t trust Wald²). Read as pure estimates of a proportion, they are well-posed.

Other eval outputs are detections. Deception and sandbagging probes, backdoor and trojan detectors, and chain-of-thought or agent monitors are appropriately scored with ROC curves and AUC, at an operating point chosen from the cost of each kind of error. Those costs are often very asymmetric — a missed deceptive model is worse than a falsely flagged aligned one — so the threshold is set to keep ${β}$ small even at the cost of a larger ${α}$. That choice isn’t mathematically derived; it’s a values judgment which, in Recht’s terms, is an internal validity convention.

When an estimation relies on an uncharacterized detector

Return to “refuses 94% of harmful prompts.” Whether each response counts as harmful, or as a refusal, is usually decided by a model — an LLM judge, a classifier, or a rubric grader. That grading step is itself a detection problem: the grader has a true-positive rate and a false-positive rate, and an implied ROC.

So the reported rate is not a direct estimate of the underlying rate; it is a mixture of the true rate and the grader’s error rates. If ${p}$ is the true rate and the grader has sensitivity TPR and false-positive rate FPR, the observed rate is

${p_{obs} = p ·}$ TPR ${+ (1 − p) ·}$ FPR

which inverts to

${p = (p_{obs}}$ − FPR ${) / (}$ TPR − FPR ${)}$, ${\forall}$ TPR ${≠}$ FPR

(the Rogan–Gladen estimator, used in epidemiology to recover true prevalence from an imperfect test). Unless the grader is perfect (TPR = 1, FPR = 0), the reported result is biased, and the degree of bias depends on the grader’s operating point. With the grader’s TPR and FPR in hand, a reader can interpret the rate and, if needed, correct it; without them, the rate is conditional on an uncharacterized detector.

The reality for evals is that model graders are frequently the only scalable option. This is where the imaging convention of characterizing the observer and reporting its ROC alongside the estimation result seems worth borrowing. Reporting a grader model’s detection error, or its agreement with reference labels, makes for a more robust estimate that doesn’t gloss over limitations attributable to the grader.

In Recht’s vocabulary, this is a construct validity point: the reported rate quietly combines an estimate and a detection, and naming which is which is what lets the number be read correctly.

Of course, I’m not the first to make this observation and there’s recent work on reducing the bias introduced by insufficiently characterized LLM detector-judges. Also, some safety benchmarks now do treat their grader as an instrument rather than an oracle, though generally disclosing detector error separately from headline results.

Recent work on characterizing LLM judges for evals

In Efficient Inference for Noisy LLM-as-a-Judge Evaluation (2026, Chen et al.), two classes of solution are compared: the direct misclassification correction per Rogan-Gladen, and debiasing estimates based on calibrating the detector’s residuals on ‘gold-labeled’ data. The latter technique is based on prediction-powered inference (PPI), which was introduced by Angelopoulos et al. (2023), and extended to PPI++ and RePPI. Chen et al. use semiparametric efficiency theory to propose a third class based on efficient influence function (EIF) theory.

How to Correctly Report LLM-as-a-Judge Evaluations (Lee et al., 2025) uses Rogen-Gladen because it stays unbiased under distribution shift between the calibration set and the test set; matching confidence intervals are also reported. Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges (Feng et al., 2026) estimates the judge’s TPR and FPR from a human labeled calibration set and the estimation uncertainty is propagated into a variance-corrected decision threshold. A finite-sample control of the bound on the probability of certifying an unsafe model is established. Threshold selection is motivated by keeping the costly error rare.

A notable tradeoff may be unavoidable, and boils down to how closely the eval data resembles the gold-label data. Rogan–Gladen better tolerates a calibration set drawn from a different distribution than the eval set, but typically means accepting higher variance. PPI leans on the calibration and test residuals matching, and achieves lower variance.

Published evals with characterized graders

In terms of published evals, HarmBench and JailbreakBench are two examples that characterize the grader. HarmBench fine-tunes its own attack-success classifier and reports its agreement with human labels on a held-out, manually annotated validation set. Dependent on behavior type and classifier variants, agreement percentages are mid-80s to low-90s. Agreement is reported for a single operating point, not a swept curve. Taking grader characterization a bit further, JailbreakBench reports agreement, false-positive rate, and false-negative rate for six candidate classifiers, scored against the majority vote of three expert annotators. Classifier selection targets relatively low false-positive rate. The tradeoff is in accepting a measured success rate that may be depressed in exchange for not mislabeling benign behavior as a jailbreak; effectively, an operating-point decision. Still, and notably, that error isn’t propagated into a corrected rate or widened interval on the reported metric.

An important caveat is that even a well-characterized grader can be fragile. A judge’s false-positive and false-negative rates can move substantially under benign response formatting such as bullet points, narrative tone, etc., as shown in, e.g., Know Thy Judge and Safer or Luckier?.

It’s good news that the frontier has moved from treating the grader as an oracle to treating it as a characterized instrument, yet it’s still important to watch for the raw observed rate ${p_{obs}}$ solo-headlining while the grader’s error is reported separately. Ideally, the grader’s curve would be swept into a reported ROC, and any point estimate inverted onto the true rate.

Optoacoustics in ultrasound review article: Looking at sound: optoacoustics with all-optical ultrasound detection, Georg Wissmeyer, Miguel A. Pleitez, Amir Rosenthal & Vasilis Ntziachristos (2018) ↩︎
I was alerted to issues with Wald by this detailed but concise math stack exchange comment - easier to start with than the paper linked in the body, which covers much more than the Wald question. ↩︎