Skip to content

ADR-048: Council Quality Benchmark & Golden-Dataset Regression

Status: Implemented 2026-07-03 (epic #421, v0.32.0) Date: 2026-07-03 Decision Makers: Chris Joseph, LLM Council Council Review: 2026-07-03 (4 models, balanced) — feedback incorporated: spend caps, CLI spec, dataset governance, per-phase DoD Related: ADR-036 (Phase 3 — this implements the golden-dataset/eval-bridge slice), ADR-011 (cost ground truth), ADR-044 (router needs this data), ADR-047 (calibration corpus overlaps)


Context

The core product claim — a deliberating council beats a single frontier model — is unproven in our own repo, and the counter-question that matters commercially is sharper: when is it worth the cost? We are uniquely positioned to answer: heterogeneous-model ensembles are called out in the test-time-compute literature as underexplored, and (unusually) we hold per-query cost ground truth (ADR-011) next to quality metrics (ADR-036 CSS/DDI/SAS, rubric scores).

Meanwhile nothing guards council quality against silent drift: model pool changes, prompt tweaks, and routing features (ADR-044) all shift output with no regression signal. ADR-036 Phase 3 specified golden datasets and DeepEval/RAGAS bridges but was deferred.

Decision

Phase 1 — Golden dataset + drift regression harness

  • A small curated question set (~20–30 items across domains: coding, reasoning, factual, judgment) committed to the repo with expected-quality envelopes (rubric ranges, CSS floor, key-content assertions — not exact-string).
  • A harness (llm-council bench run) executes the set against a configured council and reports per-item metrics + aggregate deltas vs. a committed baseline snapshot.
  • Runs on demand and nightly (scheduled CI), never per-PR — each run costs real API spend; the harness prints its own ADR-011 cost total.
  • Spend caps (council feedback): a hard per-run cap (LLM_COUNCIL_BENCH_MAX_USD, default $2.00) enforced via the ADR-011 estimator before each item and actuals after; the run aborts gracefully at the cap with partial results marked partial. The scheduled workflow sets an explicit cap and a monthly guard (skip if month-to-date bench spend, summed from run artefacts, exceeds LLM_COUNCIL_BENCH_MONTHLY_USD, default $30).
  • CLI (council feedback): llm-council bench run [--config PATH] [--max-usd N] [--items id,…], bench baseline --set (snapshot current as baseline), bench report [--format md|json]. Exit codes: 0 within envelope, 1 drift beyond envelope, 2 aborted/partial.

Phase 2 — Council-vs-single-model benchmark (quality per dollar)

  • The same question set run across configurations: each council member solo, the full council, and (flag-on) ADR-044 graduated depth.
  • Output: quality-per-dollar table using ADR-011 actual costs — the empirical answer to "when does deliberation pay?", and the tuning data ADR-044's thresholds need.
  • Scoring: rubric scores from a fixed judge configuration, with the ADR-047 calibration caveats attached; publish methodology with results.

Dataset governance (council feedback): the dataset is versioned in-repo (bench/dataset/vN/), changes only by PR review; items must be original or verifiably licensed, contain no PII, and record provenance + rationale; envelope changes require a note justifying the new expected range (no silent goal-post moves).

Phase 3 — Publication + eval-framework bridges

  • Results page in the docs site, regenerated by the harness (methodology, costs, dates, model versions — reproducibility first).
  • Thin DeepEval/RAGAS adapters so external eval suites can drive the council as a target (ADR-036 P3 commitment).

Consequences

Positive: evidence for the core claim (or honest boundaries of it) in a crowding ecosystem; a drift alarm for every future routing/prompt change; the dataset doubles as ADR-047 calibration input and ADR-044 threshold-tuning input — three consumers, one asset.

Negative / risks: benchmark runs cost real money (mitigation: small set, scheduled not per-PR, cost printed via ADR-011); judge-based scoring inherits judge bias (mitigation: fixed judge config, publish raw outputs, ADR-047 calibration notes); a public "council loses at X" result is possible (mitigation: that is the point — publish honestly with cost context).

Definition of Done (per phase — council feedback: all phases covered)

  • P1: harness unit-tested with mocked councils (no live spend in CI unit tests); dataset v1 committed with governance doc; baseline snapshot; spend-cap abort tested.
  • P2: config matrix runs produce the quality-per-dollar table from ADR-011 actuals; methodology documented alongside results.
  • P3: docs-site page regenerated by the harness; DeepEval/RAGAS adapters with at least one round-trip example each.
  • All phases: README bench section, CLAUDE.md, CHANGELOG; scheduled workflow reviewed for both caps.

References

  • docs/roadmap-2026-h2.md item 5; ADR-036 §Phase 3; ADR-011 usage schema.