Verification & CI Gating¶
LLM Council's most distinctive surface: multi-model verification of code,
documents, or any work product, with machine-actionable verdicts (ADR-034).
Four models deliberate over your change; a chairman renders pass / fail /
unclear with confidence, rubric scores, and blocking issues.
Three ways in¶
| Surface | Invocation | Use for |
|---|---|---|
| MCP tool | verify(snapshot_id, target_paths, tier, ...) |
agent sessions (Claude Code, Cursor) |
| CLI | llm-council gate --snapshot <sha> [--tier ...] |
CI/CD pipelines (exit code 0/1/2) |
| HTTP | POST /v1/council/verify |
services |
llm-council gate --snapshot $(git rev-parse HEAD) \
--file-paths src/module.py --tier balanced --rubric-focus Security
Exit codes: 0 PASS · 1 FAIL · 2 UNCLEAR.
Tiers¶
| Tier | Budget | Max input | Use |
|---|---|---|---|
quick |
~30s | 15K chars | sanity checks, small diffs |
balanced |
~90s | 30K chars | default — routine verification |
high |
~180s | 50K chars | security-critical reviews |
reasoning |
~600s | 50K chars | complex architectural decisions |
Reading an UNCLEAR verdict (ADR-047)¶
The exit code stays 2 for every UNCLEAR cause — that is a deliberate
compatibility contract (ADR-047): existing automation keying on exit codes
keeps working, and unclear_reason is the routing signal you layer policies
on top of:
infra_failure— the chairman call itself errored (billing, auth, rate limit). Check your gateway/billing, then retry; never treat it as a review outcome.low_confidence— deliberation completed below the confidence threshold. Common policy: accept-and-audit whenblocking_issuesis empty.timeout— the tier deadline fired. Re-tier or reduce input scope.
Calibrated confidence (ADR-047)¶
Every response carries confidence (raw) and confidence_calibrated (raw
passed through a monotonic mapping fitted against your recorded human
dispositions). Build the mapping from your own transcript corpus:
llm-council calibration-report # analyze .council/logs
llm-council calibration-report --fit # fit mapping from dispositions
The PASS threshold consumes the calibrated value only behind
LLM_COUNCIL_CALIBRATED_CONFIDENCE=true (default off).
Screening judge (ADR-047, opt-in)¶
A single quick-tier model can pre-screen easy changes
(LLM_COUNCIL_SCREENING=shadow|active; default off). Blocking-capable
requests (blocking evidence, security focus, risk-glob paths) are never
screened — the full council always runs for those. Start with shadow and
read .council/screening/decisions.jsonl before trusting active.
Evidence injection (ADR-042)¶
Feed upstream tool output (linters, scanners) as structured evidence; the
council must disposition each item. strength: blocking items make the
request blocking-capable.
Prompt-cache cost note (ADR-049)¶
Verification prompts are assembled stable-prefix-first and cached on
Anthropic council members (0.1× read price on repeat rounds; verified on
the OpenRouter route). Multi-round verify sessions on the same subject are
therefore much cheaper than round 1. The verify path uses a 1-hour cache
TTL by default (rounds typically land 3–11 minutes apart);
LLM_COUNCIL_PROMPT_CACHE_TTL=5m|1h overrides it, and
LLM_COUNCIL_PROMPT_CACHING=false disables injection entirely.
input_metrics reports cached_tokens (reads), cache_write_tokens, and
cache_session_id — zero reads across rounds means a broken prefix or a
lapsed TTL.
Operational tips¶
- Scope
target_pathsto the files that changed — whole-file expansion of pre-existing code invites off-scope findings. - Repeated re-verification of the same scrutinized files hits diminishing returns; act on verdicts rather than re-rolling them.
- Every run persists a full transcript under
.council/logs/<timestamp-id>/(theauditMCP tool retrieves them).