ADR-051 Implementation Spec — Verify Findings Channel¶
Companion to: ADR-051 (Proposed) · Tracking: #482
Status: Draft spec 2026-07-05 · resolves the ADR's open implementation forks before /adr-epic.
This spec pins the how (the ADR pins the what/why): the enforcement
mechanism, the concrete response schema, the flagged migration, the exhaustive
documentation surface + a new drift guard, and the child breakdown. It is
written to be deep-read by /adr-epic.
1. Enforcement mechanism — two-phase generation (decided)¶
Context. ADR-051 Part 1 requires the chairman to enumerate findings[]
before committing its verdict ("Proof-Before-Preference"), because a rationale
emitted alongside the verdict does not fix verdict–evidence decoupling. Today
there is no structured-output plumbing: verdict.py:parse_binary_verdict
regex-parses the chairman's prose. So every option below is net-new.
Decision: mechanical gate — LLM findings, deterministic verdict. (Revised after the Council fork review, 2026-07-05, high tier — it challenged plain two-phase and 3/4 models converged on this hybrid.)
- Phase 1 (LLM): ONE chairman call emits severity-tagged
findings[](severity, description, cited location) and the human-readable synthesis prose. No verdict is requested. - Phase 2 (deterministic host code): the verdict is computed from the
findings, not generated — v1 policy: any
severity == "critical"finding (or an ADR-042 blocking-evidence disposition) ⇒fail; otherwisepass. The policy is explicit, auditable code (tunable later, e.g. "N majors ⇒ fail"). Confidence continues to come from the existing deliberation/agreement signal (calculate_confidence_from_agreement), independent of the verdict.
The one decisive reason: the verdict is a provable function of the
findings — it is literally verdict = policy(findings) in code, the strongest
possible form of "findings-first." A generated Phase-2 verdict is only
hopefully causal; the Council showed a second LLM call that sees the code
re-judges freshly and can emit ACCEPT over its own CRITICAL finding (the
"Yes-Man contradiction").
| Option | Verdict genuinely = f(findings)? | LLM hops | Failure mode left open |
|---|---|---|---|
| Mechanical gate (chosen) | Yes — computed in code | 1 (same as today) | Severity mis-labelling by the model (a real "critical" tagged "major" won't fail) — localized, auditable, tunable via the rubric + policy |
| Two-phase (both LLM) | No — Phase 2 can re-judge / Yes-Man contradict | 2 (extra hop, waterfall-budget risk) | Contradiction + latency; needs info-starvation + a code guard anyway → collapses toward the mechanical gate |
| Constrained decoding | No — JSON field order ≠ reasoning order | 1 | Rationalization survives; net-new per-provider adapters |
| Single free-form + parse | No | 1 | Weakest; ≈ today's prose fragility |
Why this beats plain two-phase (Council): it is single-hop (no waterfall- budget penalty — Gemini's objection), makes the Yes-Man contradiction structurally impossible (so the Part-2 guard becomes a defensive invariant, §below), and makes soft-fail safe (a deterministic verdict from stranded findings, never an untrusted LLM inference). It also lands the ADR's root fix maximally: the verdict is no longer decoupled from the evidence — it is derived from it.
Graceful degradation (soft-fail, ADR-011/024). If the Phase-1 findings
emission fails or won't parse (or the flag is off), fall back to the legacy
single synthesis + parse_binary_verdict + prose-regex path and mark
findings_source: fallback, fallback_reason: <cause>. The verdict computation
(Phase 2) is pure code and never fails. Verify never crashes on a bad model
output.
Constrained decoding is explicitly deferred to a future per-chairman-model optimization for how Phase 1 emits findings (e.g. json_schema / responseSchema / tool-use) — a robustness upgrade to parsing, never the verdict mechanism.
The Council validation ran 2026-07-05, high tier (OpenRouter billing restored). It did not rubber-stamp two-phase — the mechanical-gate pivot is its convergent recommendation, folded in here.
2. Response schema (concrete)¶
New Pydantic in verification/schemas.py (and mirrored in types.py):
class Finding(BaseModel):
severity: Literal["critical", "major", "minor", "info"]
description: str
location: Optional[str] = None # "file.py:42" or "global"/None for holistic
dimension: Optional[str] = None # which rubric axis, when derivable
class VerifyDiagnostics(BaseModel): # telemetry-only; NOT control flow
inner_verdict: Optional[str] = None # "approved"/"rejected" pre-softening
inner_confidence: Optional[float] = None
inner_confidence_calibrated: Optional[float] = None
verdict_evidence_mismatch: Optional[str] = None # invariant assertion — should never fire
findings_source: Literal["structured", "fallback"] = "fallback"
fallback_reason: Optional[str] = None
verdict_source: Literal["mechanical", "legacy"] = "legacy" # mechanical = policy(findings)
VerifyResponse gains:
- findings: List[Finding] — the full structured list (all severities).
- diagnostics: VerifyDiagnostics — nested, telemetry-only.
Verdict is derived, not parsed. On the structured path the verdict is
verdict_source: mechanical = policy(findings) (v1: any critical ⇒ fail,
else pass); confidence stays from the deliberation/agreement signal. On the
fallback path it is legacy (parse_binary_verdict + prose regex).
blocking_issues is derived, unchanged in type
(List[BlockingIssueResponse] — already {severity, description, location}, so
no type break): blocking_issues = [f for f in findings if f.severity ==
"critical"] plus any ADR-042 blocking-evidence dispositions. Non-critical
findings live only in findings[].
Invariants (now structural, not hoped-for). Because the verdict is
policy(findings), fail-with-no-critical and pass-with-critical are
impossible by construction on the mechanical path — the Part-2
verdict_evidence_mismatch marker is a defensive assertion that should never
fire (if it does, it's a code bug in the gate policy, and it's logged). Tests:
fail ⇒ blocking_issues non-empty; findings[] ⊇ blocking_issues; the policy
is a pure function (property test over synthetic findings).
3. Migration & versioning — flagged, non-breaking epic; deliberate flip¶
The blast radius is a breaking contract change (blocking_issues:
always-[] → populated on FAIL; Hyrum's Law — epic-loop keys its green-chase
cap on the count). De-risked as a two-step:
- Epic ships behind
LLM_COUNCIL_STRUCTURED_FINDINGS, default OFF. Flag off ⇒ byte-identical to today (legacy path,findings: [],blocking_issuesvia regex). This whole epic is therefore a non-breaking, opt-in minor — consumers (epic-loop) flip it on, migrate their gate logic off "always-empty", and validate. - A separate, deliberate flip to default-ON is the breaking release —
MAJOR bump (or a clearly-
### BREAKINGminor for a 0.x line) with a migration note. Not bundled into the build epic.
New env var (documented in docs/reference/environment-variables.md, enforced
by the drift guard): LLM_COUNCIL_STRUCTURED_FINDINGS (default false in the
epic; the flip changes the default, not the code).
4. Documentation surface (the checklist — DoD, not afterthought)¶
Every child that changes the contract updates its slice; the consolidated docs child (C6) closes the list. All of these reference the verify contract and MUST be reconciled:
docs/guides/verify.md— findings/diagnostics fields,findings_source, the consistency-guard marker, the flag, exit-code semantics unchanged.docs/api.md—POST /v1/council/verifyresponse schema (new fields).docs/guides/mcp.md— theverifyMCP tool output fields.docs/guides/skills.md,docs/blog/12-cicd-quality-gates.md— gate examples.- Bundled skills (must stay in sync with the shipped tool, sync-tested):
council-verify/SKILL.md+references/{rubrics.md, unclear-routing.md};council-gate/SKILL.md+references/ci-cd-rubric.md;council-review/SKILL.md+references/code-review-rubric.md. CHANGELOG.md(with a### BREAKINGentry on the flip),CLAUDE.md(verification module note),docs/reference/environment-variables.md(flag).- A consumer migration guide (
docs/guides/verify.md#migratingor a note): "stop keying onblocking_issues == []; key onfindings/severity."
New drift guard (highest-leverage completeness guarantee). Extend
tests/test_docs_drift.py: assert every field on VerifyResponse (and each
Finding/VerifyDiagnostics field) appears by name in docs/guides/verify.md
or docs/api.md. Turns "did we document the new response fields?" into a red
build — the gap the current guards (env / ADR-nav / snippet) don't cover.
5. Child breakdown for /adr-epic (sequenced)¶
Per-decision granularity; foundation-first; the breaking flip is out of the
epic. Non-critical/info findings are retained in findings[].
- C1 — flag + additive schema (foundation, non-breaking). Add
LLM_COUNCIL_STRUCTURED_FINDINGS(default off), theFinding/VerifyDiagnosticsmodels, and the additiveVerifyResponsefields (empty by default). Flag-off ⇒ byte-identical (test-pinned). Env-reference + drift-guard field assertion. - C2 — structured findings emission (behind flag). One chairman call emits
severity-tagged
findings[](+ synthesis prose); populatefindings[]; soft-fail to the legacy path (findings_source/fallback_reason). - C3 — mechanical verdict + derive
blocking_issues.verdict = policy(findings)(anycritical⇒fail) as a pure host function (verdict_source: mechanical);blocking_issues = findings[critical]; prose regex demoted to the flagged fallback; #355 regression pinned (approval prose must not fabricate criticals). - C4 — consistency invariant + severity-calibration telemetry. Assert the
structural invariant (
fail⇔critical present) and logverdict_evidence_mismatchif it ever fires (a gate-policy bug); emit findings-count / severity-distribution telemetry so severity mis-labelling (the named residual failure mode) is observable over time. - C5 —
diagnostics.inner_verdict/inner_confidenceon softened UNCLEAR. - C6 — docs sweep + drift guard + migration guide. The §4 checklist, bundled-skill sync, CHANGELOG (flag), CLAUDE.md. Flag still default-off.
Out of this epic (per ADR-051 + Council rev-2):
- P4 completeness reweight — a separate follow-up PR; re-measure after
C1–C6 land (it lives in the stage-2 rubric path, verdict_extractor.py:135,
not the findings channel).
- P5 LLM-as-a-Fuser spike — a separate research task after the epic (needs
structured findings to exist); pre-registered accept thresholds; produces a
go/no-go report, spawning its own ADR only if it clears them.
- Default-ON flip — a deliberate breaking release after consumers migrate.
6. Test plan (across the epic)¶
- Flag-off byte-identical (C1).
- Findings emission: a Phase-1 failure / unparseable output degrades to the legacy path, never crashes (C2).
- Mechanical verdict is a pure function — property test over synthetic
findings[]:policy(findings)is deterministic; anycritical⇒fail; nocritical⇒pass; verdict never depends on prose (C3). blocking_issuesinvariants: FAIL ⇒ non-empty;findings ⊇ blocking_issues; #355 approval-prose regression (C3).- Structural invariant:
pass-with-critical /fail-without-critical cannot be produced; theverdict_evidence_mismatchassertion never fires in normal operation (C4). - Softened UNCLEAR carries
diagnostics.inner_verdict(C5). - Drift guard: an undocumented
VerifyResponsefield fails CI (C1/C6). - Corpus replay: re-run the epic-loop 25-call log (verification_ids in
council-verify-stats.md) with the flag on; assert FAILs now carry non-emptyfindings. (OpenRouter credits restored 2026-07-05.)