ADR-047: Verifier Calibration & Judge Reliability¶
Status: Implemented 2026-07-03 (epic #417, v0.31.0) Date: 2026-07-03 Decision Makers: Chris Joseph, LLM Council Council Review: 2026-07-03 (4 models, balanced) — feedback incorporated: screening thresholds defined, blocking-path invariant operationalized, DoD covers P3/P4, reference consistency Related: ADR-036 (Phase 2 calibration report — this implements that slice), ADR-034 (verify), ADR-015/017/018 (bias & randomization), ADR-022 (tiers), #397/#403 (error surfacing)
Context¶
Operational evidence from driving five release epics through the verify gate (2026-07-01→03):
- Confidence is miscalibrated. On real code with zero blocking issues, confidence pins at 0.49–0.6 — below the 0.7 PASS threshold — producing multi-round "asymptotes" where each re-verify surfaces finer or inverting nits instead of converging (documented across PRs #364/#387/#395…). PASS is effectively unreachable for scrutinized prose/code even when reviewers find nothing blocking.
- UNCLEAR conflates three unlike things: infra failure (chairman call errored — billing outage, #397), genuine low confidence, and global timeout. Callers (CI gates, epic-loop) need to treat these differently; today they all read as exit code 2.
- Cost asymmetry: every gate spends a full council (~4 models × 3 stages) even for changes a cheap screen could pass/flag in seconds.
2026 judge research matches: multi-agent judges can amplify rather than
cancel bias; lightweight sub-second screening judges are production practice;
confidence signals require calibration against observed outcomes. We own the
calibration corpus: every verify run persists a full transcript + verdict to
.council/logs.
Decision¶
Phase 1 — UNCLEAR disambiguation (semantics first)¶
Split UNCLEAR into machine-readable causes on VerifyResponse:
unclear_reason ∈ {infra_failure, low_confidence, timeout} (infra detection
uses the #403 error_status; timeout uses timeout_fired). Exit code stays 2
(compat); the reason field lets automation apply distinct policies (retry
infra, accept-and-audit low-confidence, re-tier timeout). Loop/skill docs
updated to consume it.
Phase 2 — Confidence calibration from the transcript corpus¶
Build a calibration analysis over .council/logs (verdict, confidence, rubric
scores, blocking count vs. eventual disposition where recorded): fit a simple
monotonic recalibration mapping (e.g. isotonic/binned) and surface BOTH raw
and calibrated confidence on responses. The PASS rule may then use calibrated
confidence — behind a flag, default off, until the mapping is validated
(ADR-036 P2's cross-model calibration report is a by-product).
Phase 3 — Lightweight screening judge¶
An opt-in pre-gate: a single quick-tier model scores the change against the
rubric in seconds; unambiguous outcomes short-circuit to a cheap
PASS-with-audit-note, everything else proceeds to the full council unchanged.
Concrete gates (council feedback; initial values, env-tunable): a change
may be screen-passed only if ALL hold — screen rubric score ≥ 9/10 on every
dimension; diff < 5,000 chars; no changed file matches a risk glob
(**/auth*, **/security*, **/crypto*, **/*payment*); and the request
is not blocking-capable. Blocking-capable (operationalized): any request
carrying evidence items with strength=blocking, or rubric_focus in
{security}, is NEVER screened — full council always runs. Default OFF;
shadow-first: every screen decision is logged with its score so the screen's
own precision is measured before trusting it (like ADR-044 P2).
Phase 4 — Bias-amplification check (report only)¶
Extend the ADR-015/018 bias pipeline with a reviewer-agreement decomposition per ADR-036 P2 (do reviewers converge because the work is good, or because of shared bias/position effects?). Report-only; no gating.
Consequences¶
Positive: the flagship CI-gate feature becomes trustworthy (distinct UNCLEAR causes; confidence that means something); screening cuts gate cost for easy changes; aligns with current judge-reliability research; the asymptote pattern gets a principled fix instead of a per-session cap heuristic.
Negative / risks: calibration data is observational and modest-N (mitigation: publish CIs, flag-gated adoption, keep raw confidence); a screening judge is itself a judge (mitigation: shadow-first with logged decisions; never screens BLOCKING-capable paths silently); changing PASS semantics affects automation (mitigation: compat exit codes + additive fields only).
Definition of Done (per phase — council feedback: all phases covered)¶
- P1:
unclear_reasonon every UNCLEAR response + tests for all three causes; verify tool description documents it. - P2: calibration analysis reproducible from
.council/logs; raw AND calibrated confidence on responses; flag-off behaviour unchanged (test). - P3: screen decisions logged with scores; blocking-capable invariant enforced by test; flag-off byte-identical.
- P4: report renders from existing bias store; no gating side-effects (test).
- All phases: README verify section, CLAUDE.md, CHANGELOG; epic-loop guidance updated to consume new fields.
References¶
docs/roadmap-2026-h2.mditem 4 (sources: agent-as-judge survey, bias amplification, lightweight judges); memory: verify asymptote + #397 infra misdiagnosis; ADR-036 §Phase 2.