ADR-051: Verify Findings Channel & Verdict–Evidence Consistency¶
Status: Implemented 2026-07-05 (epic #484, C1–C6; C1–C3 in v0.36.0, C4–C6 in v0.37.0. Behind LLM_COUNCIL_STRUCTURED_FINDINGS, default-OFF; the default-ON flip is a separate future breaking release.) — proposed 2026-07-04 (rev 2: literature claims verified via adversarial deep-research, 105 agents, 3-vote, checked 2026-07-04, with the anchor paper's numbers independently confirmed against the primary arXiv source. See "Research verification".)
Date: 2026-07-04
Decision Makers: llm-council maintainers (review requested)
Proposed by: maintainer triage of a downstream field report (amiable-dev/epic-loop)
Relates to: ADR-025b (Jury Mode / binary verdict), ADR-034 (verification), ADR-042 (evidence injection), ADR-047 (verifier calibration & unclear taxonomy), ADR-016 (rubric scoring)
Tracking: #482
Implementation spec: ADR-051-implementation-spec.md (enforcement mechanism, response schema, flagged migration, doc checklist, child breakdown)
Context¶
verify()'s primary machine-readable contract is its verdict plus a
blocking_issues list — automation (CI gates, epic-loop's green-chase cap)
routes on the blocking count. Field data from a downstream consumer
(amiable-dev/epic-loop: 27 calls / 25 completed across two tiers, every
verification_id logged in docs/assessments/council-verify-stats.md) plus
an independent second corpus (this repo's ADR-049/050 delivery epics, same
council config) show a correctness defect:
blocking_issuesis[]on effectively every call — pass, fail, and unclear alike — including FAILs whose own rationale names findings the council calls "critical".
Root cause — verdict and findings are decoupled by construction¶
The two fields come from different, unlinked sources:
- Verdict — the structured ADR-025b BINARY
VerdictResult(verification/verdict_extractor.py::_verdict_from_structured), authoritative since #355. A clean go/no-go + confidence. blocking_issues— a regex scrape of the chairman's prose (extract_blocking_issues), matching only severity tokens anchored at line start:^\s*[-*]?\s*\**(CRITICAL|MAJOR|MINOR)\**:\s+….
Modern chairmen (gemini-3.1-pro, opus-4.8) write findings as prose, which the
strict regex never matches, so blocking_issues == [] even when the verdict
is fail. There is no structured findings[] anywhere in the pipeline,
and blocking-strength evidence (ADR-042) does not feed blocking_issues
(it drives screening/budget only). This is the classic LLM-judge
verdict–evidence decoupling pathology: a rejection with no machine-readable
justification.
#355 backstory (why it swung to never-fires): the regex was originally
loose ((CRITICAL|MAJOR|MINOR)[:\s]+ anywhere), which fabricated blocking
issues from approval prose ("the critical issues have been resolved"). #355
tightened it to strict line-anchored markers — trading false positives for
universal false negatives. A third regex is not the fix; a structured
channel is.
Secondary observations from the same corpus¶
- UNCLEAR discards the inner verdict.
verdict_extractor.pysoftenspass→unclearwhen the calibrated confidence is below threshold, but drops the fact that the structured verdict was e.g. "approved @ 0.85". Theunclear_reasontaxonomy (ADR-047 P1) exists; the inner verdict/confidence do not surface. completenessdoes not discriminate. Scores sit in a narrow 5.3–6.6 band across trivial and complex changes (or fall to the syntheticmean_score * 0.9estimate), yet carry 0.20 rubric weight — a Q&A-oriented axis acting as a constant drag on code-review verdicts.- Balanced tier churns; high tier converges. Balanced-tier FAILs surfaced fresh marginal findings each round instead of confirming fixes (7/9 failed, one arithmetically false, one re-litigated). High tier (+ a factual informational evidence item) produced substantively real findings every time and twice gave explicit fix-acknowledgement. Consistent with the overconfidence / calibration literature (arXiv 2508.06225; see Research verification).
- ADR-042 evidence firewall: flag-then-penalize. The council correctly firewalled imperative language inside an informational evidence item, then cited that flagged "steering" sentence in the rejection rationale — counting it against the submission rather than flag-and-ignore.
- Structured non-verdicts work.
input_too_largeandunclear_reason=low_confidencewere clean, routable outcomes — keep them.
Research verification (rev 2)¶
The literature scaffolding was checked by adversarial deep-research (105 agents, 3-vote) against primary sources; the load-bearing statistic was then independently confirmed by the maintainer against the arXiv HTML.
| Claim | Status | Evidence |
|---|---|---|
| The cited "86.3% acc / 6.4% ECE vs 77.4%" figures are real | VERIFIED | arXiv 2508.06225 (Tian et al., Overconfidence in LLM-as-a-Judge): LLM-as-a-Fuser = 86.29% / 6.42% ECE vs single-judge Self-Confidence baseline 77.43% / 11.78% on JudgeBench (Tables 1 & 4, confirmed against the v3 HTML). Majority-vote / confidence-weighted-vote = 80.00%. |
| Method name "critique-fusion" | CORRECTED | The method is LLM-as-a-Fuser (fuses decisions and rationales; ships a TH-Score calibration metric). The ADR's "critique-fusion" label was a misremember; the decoy papers 2508.16889 (ObjexMT) and 2601.05420 (single-judge debiasing) report neither the method nor these numbers. |
| Verdict–evidence decoupling / post-hoc rationalization is an established LLM-judge pathology | VERIFIED | Multi-paper finding: superficial cues drive verdicts the rationale never mentions; judges trust asserted reasoning over observable evidence. Directly supports Parts 1–3. |
| Structured rationale alone fixes it | REFUTED (important nuance) | Free-text/JSON rationale does not close the gap; a protocol that locks cited evidence BEFORE scoring ("Proof-Before-Preference") substantially reduces rationalization. ⇒ Part 1 must emit findings before the verdict, not alongside it. |
| LLM judges are overconfident / miscalibrated; reasoning models are better-calibrated judges | VERIFIED | Overconfidence documented (2508.06225 + others); reasoning/extended-CoT models strictly better calibrated in 33/36 model×dataset settings — corroborates the observed "high tier converges, balanced tier churns". |
| Critique/evidence fusion is necessary to beat independent tallying | MIXED — do not overclaim | Some fusion aggregators beat tallying, but PoLL ("Replacing Judges with Juries", arXiv 2404.18796) shows a diverse panel with simple independent voting already beats a single strong judge, ~7× cheaper. llm-council already does diverse-panel tallying. ⇒ Part 5 is "does the Fuser beat our existing panel", not "adopt fusion". |
Council review (rev 2, high tier)¶
The Council endorsed the core diagnosis and the structured-findings fix, and
raised amendments, all folded in above:
1. Proof-Before-Preference needs enforcement, not a prompt → Part 1 → the
fork review (spec §1) landed on a mechanical gate: LLM emits findings,
host code computes the verdict.
2. Consistency guard → the mechanical gate makes pass+critical /
fail-without-critical structurally impossible, so Part 2 downgrades to a
defensive invariant assertion + severity-calibration telemetry.
3. inner_verdict is a footgun as a top-level field → Part 3 nests it
under diagnostics, telemetry-only.
4. P4 (completeness) is orthogonal → deferred to its own follow-up.
5. P5 must be bounded → pre-registered accept thresholds.
6. Breaking change → MAJOR bump / feature flag, not a CHANGELOG line.
One council claim was rejected on the code: blocking_issues is not
List[str] (no type-crash risk) — it is already List[BlockingIssueResponse],
the exact object shape findings[] produces.
Decision (proposed)¶
Five parts, ordered by leverage. Parts 1–3 are the core structural fix; 4 is deferred to a follow-up; 5 is a bounded spike.
1. Structured findings channel (P0)¶
Have the chairman emit findings as structured data, not prose to be scraped. Extend the ADR-025b verdict output (or add a sibling structured field the chairman is prompted to fill) with:
findings: [ { severity: critical|major|minor,
description: str,
location: str | null, # file:line where derivable
dimension: str | null } ] # which rubric axis it maps to
Populate blocking_issues from findings filtered to
severity == critical (and any ADR-042 blocking-evidence dispositions),
not from the prose regex. Non-critical findings (major/minor/info)
stay in findings[] (not dropped) but do not gate. No type break (Council
rev-2): blocking_issues is already List[BlockingIssueResponse]
({severity, description, location: Optional[str]}, schemas.py), the exact
shape findings[] produces — this is object→object, and location already
permits null (a holistic finding with no line). Keep extract_blocking_issues
only as a fallback for models that don't return structured findings, and set
findings_source: structured|fallback + fallback_reason on the response so
consumers see when the channel degraded.
Findings precede the verdict — enforced by a mechanical gate, not a
prompt (Council rev-2 + fork review). Research verification refuted the
weaker version: a structured rationale emitted alongside the verdict does not
fix decoupling. The fork review (implementation spec §1) then rejected a second
LLM verdict call too — it re-judges and can approve over its own critical
finding. The chosen mechanism: the chairman (LLM) emits severity-tagged
findings[]; the verdict is computed by deterministic host code
(verdict = policy(findings), v1: any critical ⇒ fail). The verdict is a
provable function of the findings, single-hop, and impossible to decouple —
see ADR-051-implementation-spec.md §1.
2. Verdict–evidence consistency guard¶
Under the mechanical gate (Part 1) the verdict is policy(findings), so the
two inconsistencies this guard targeted — fail with empty findings, and
pass with a critical finding — are structurally impossible. The guard
therefore downgrades to a defensive invariant assertion: if it ever fires,
that is a bug in the gate policy, not model behavior, and it is logged as
verdict_evidence_mismatch. The residual risk moves to severity
mis-labelling (a real critical tagged major), surfaced by
findings-count/severity telemetry rather than a verdict-vs-findings check.
A zero-finding FAIL is legitimate and must survive (Council rev-2): a
holistic rejection ("fundamentally wrong approach") may have no line-level
finding. The marker is observability, never a forced flip. If a bounded re-run
is enabled behind a flag it must be non-coercive — it may ask the model to
localize its existing failure (fill findings), never to reconsider the
verdict. Coercing localization risks fabricated findings or a false pass
just to satisfy the parser.
3. Surface the inner verdict on UNCLEAR¶
When a structured "approved @ c" is softened to unclear because
c < threshold, carry inner_verdict and inner_confidence (and the
calibrated value) so automation can distinguish "approved but under threshold"
from "genuinely undecided" without parsing prose. Nest these under a
diagnostics: {} object (Council rev-2) and document them as telemetry-only,
so consumers don't parse inner_verdict to bypass the low-confidence safety
gate — the softening to unclear is the contract; the inner state is for
threshold tuning and observability, not control flow.
4. Recalibrate completeness — DEFERRED to a follow-up (Council rev-2)¶
completeness (0.20 weight) does not discriminate for code review — drop its
weight in the code-review rubric profile or redefine it as a code-relevant axis
("tests/edge-cases covered"). But this is a scoring-heuristic tweak,
orthogonal to the evidence-plumbing defect (it lives in the stage-2 rubric
path, verdict_extractor.py:135, not the blocking_issues regex), so it is
deferred out of this ADR to keep the P0/P1 structural fix unblocked.
Re-measure completeness after P0/P1 lands to confirm the flatness isn't
entangled with the findings channel, then reweight in its own change.
5. (Spike) evaluate a fuser aggregator against our existing panel¶
Evaluate whether an LLM-as-a-Fuser aggregator (one strong model fusing the panel's decisions and rationales — Tian et al., arXiv 2508.06225: 86.29% acc / 6.42% ECE vs 77.43% for the best single judge on JudgeBench) beats llm-council's current diverse-panel independent tallying.
Framed carefully, because the research verification found the "fusion is necessary" premise is mixed: PoLL ("Replacing Judges with Juries", arXiv 2404.18796) shows a diverse panel with simple independent voting already beats a single strong judge ~7× cheaper — and that is essentially what llm-council already does. So the question is not "single judge → fusion" (a straw man; we don't use a single judge) but "existing panel-tallying → fusion: does it add enough calibration/accuracy to justify the extra synthesis cost?" A research spike, not a committed change — and pre-registered to keep it bounded (Council rev-2): before running it, fix the thresholds that would justify adopting a Fuser — a minimum accuracy/calibration delta over the existing panel, a maximum added latency, and a maximum cost multiple. If the spike doesn't clear all three, the incumbent panel-tallying stays. Note the 2508.06225 baseline is Self-Confidence (a per-judge confidence method), not self-consistency.
Consequences¶
Positive. Gates key on structured findings instead of prose regex, so
blocking_issues reflects reality; the verdict–evidence mismatch becomes
observable; UNCLEAR carries the inner verdict for cleaner routing; the rubric
stops dragging on a non-signal. Directly unblocks consumers whose automation
keys on blocking count (the reported downstream pain).
Negative / cost. The chairman prompt gains a structured-output
requirement (a compatibility surface across models; needs the same
JSON-won't-parse fallback the rubric already has). blocking_issues
semantics change is a breaking change under Hyrum's Law (Council rev-2):
consumers built gating logic against always-[], and populating it on FAIL
wakes that dormant logic. It warrants a MAJOR version bump (or a transition
feature flag defaulting to the new behavior with an opt-out) — not just a
CHANGELOG line. Parts 1–3 touch the verdict-extraction hot path and need golden
tests against both the structured and fallback paths; the mechanical gate keeps
this to a single chairman hop (no extra generation call — the verdict is
host code).
Neutral. input_too_large / unclear_reason structured outcomes are
unchanged (keep). No new external dependency.
Compliance / Validation¶
- Unit: chairman returns structured
findings⇒blocking_issues= critical-severity findings; a FAIL with findings never yieldsblocking_issues: []. - Unit: prose-only chairman (no structured findings) ⇒ fallback regex path,
response flags
findings_source: fallback. - Unit: the verdict policy is a pure function of
findings— anycritical⇒fail, none ⇒pass;pass-with-critical /fail-without-critical are unreachable (theverdict_evidence_mismatchassertion never fires in normal operation) present. - Unit: softened UNCLEAR carries
inner_verdict/inner_confidence. - Regression: the #355 corpus (approval prose like "critical issues resolved") must NOT fabricate blocking issues under the structured or fallback path.
- Corpus replay: re-run the epic-loop 25-call log (verification_ids in
council-verify-stats.md) and assert FAILs now carry non-empty findings.