Skip to content

Verification & CI Gating

LLM Council's most distinctive surface: multi-model verification of code, documents, or any work product, with machine-actionable verdicts (ADR-034). Four models deliberate over your change; a chairman renders pass / fail / unclear with confidence, rubric scores, and blocking issues.

Three ways in

Surface Invocation Use for
MCP tool verify(snapshot_id, target_paths, tier, ...) agent sessions (Claude Code, Cursor)
CLI llm-council gate --snapshot <sha> [--tier ...] CI/CD pipelines (exit code 0/1/2)
HTTP POST /v1/council/verify services
llm-council gate --snapshot $(git rev-parse HEAD) \
  --file-paths src/module.py --tier balanced --rubric-focus Security

Exit codes: 0 PASS · 1 FAIL · 2 UNCLEAR.

Tiers

Tier Budget Max input Use
quick ~30s 15K chars sanity checks, small diffs
balanced ~90s 30K chars default — routine verification
high ~180s 50K chars security-critical reviews
reasoning ~600s 50K chars complex architectural decisions

Reading an UNCLEAR verdict (ADR-047)

The exit code stays 2 for every UNCLEAR cause — that is a deliberate compatibility contract (ADR-047): existing automation keying on exit codes keeps working, and unclear_reason is the routing signal you layer policies on top of:

  • infra_failure — the chairman call itself errored (billing, auth, rate limit). Check your gateway/billing, then retry; never treat it as a review outcome.
  • low_confidence — deliberation completed below the confidence threshold. Common policy: accept-and-audit when blocking_issues is empty.
  • timeout — the tier deadline fired. Re-tier or reduce input scope.

Calibrated confidence (ADR-047)

Every response carries confidence (raw) and confidence_calibrated (raw passed through a monotonic mapping fitted against your recorded human dispositions). Build the mapping from your own transcript corpus:

llm-council calibration-report          # analyze .council/logs
llm-council calibration-report --fit    # fit mapping from dispositions

The PASS threshold consumes the calibrated value only behind LLM_COUNCIL_CALIBRATED_CONFIDENCE=true (default off).

Screening judge (ADR-047, opt-in)

A single quick-tier model can pre-screen easy changes (LLM_COUNCIL_SCREENING=shadow|active; default off). Blocking-capable requests (blocking evidence, security focus, risk-glob paths) are never screened — the full council always runs for those. Start with shadow and read .council/screening/decisions.jsonl before trusting active.

Evidence injection (ADR-042)

Feed upstream tool output (linters, scanners) as structured evidence; the council must disposition each item. strength: blocking items make the request blocking-capable.

Prompt-cache cost note (ADR-049)

Verification prompts are assembled stable-prefix-first and cached on Anthropic council members (0.1× read price on repeat rounds; verified on the OpenRouter route). Multi-round verify sessions on the same subject are therefore much cheaper than round 1. The verify path uses a 1-hour cache TTL by default (rounds typically land 3–11 minutes apart); LLM_COUNCIL_PROMPT_CACHE_TTL=5m|1h overrides it, and LLM_COUNCIL_PROMPT_CACHING=false disables injection entirely. input_metrics reports cached_tokens (reads), cache_write_tokens, and cache_session_id — zero reads across rounds means a broken prefix or a lapsed TTL.

Operational tips

  • Scope target_paths to the files that changed — whole-file expansion of pre-existing code invites off-scope findings.
  • Repeated re-verification of the same scrutinized files hits diminishing returns; act on verdicts rather than re-rolling them.
  • Every run persists a full transcript under .council/logs/<timestamp-id>/ (the audit MCP tool retrieves them).