ADR-036: Output Quality Quantification Framework¶
| Field | Value |
|---|---|
| Status | Draft |
| Date | 2025-12-29 |
| Author | Chris (amiable-dev) |
| Supersedes | - |
| Related ADRs | ADR-015 (Bias Auditing), ADR-016 (Structured Rubric Scoring), ADR-018 (Cross-Session Bias Aggregation), ADR-025b (Jury Mode), ADR-030 (Scoring Refinements) |
Context¶
LLM Council implements a sophisticated multi-stage peer review system where multiple LLMs deliberate to produce synthesized answers. While existing ADRs address internal mechanics (rubric scoring, bias detection, consensus mechanisms), we lack a comprehensive framework for quantifying the accuracy and quality of the council's final output in a way that:
- Provides meaningful metrics to users about answer reliability
- Enables regression testing and quality monitoring over time
- Supports differentiated value propositions between OSS and paid tiers
- Integrates with established LLM evaluation frameworks in the ecosystem
The current gap: users receive a synthesized answer but have limited visibility into how confident the council is, how much agreement existed, and whether the answer meets objective quality thresholds.
Industry Context¶
The LLM evaluation landscape has matured significantly, with frameworks like DeepEval, RAGAS, Promptfoo, TruLens, and LangSmith providing standardized approaches to quantifying LLM output quality. Key patterns emerging:
- LLM-as-a-Judge has become the dominant paradigm, replacing traditional metrics (BLEU/ROUGE) that poorly capture semantic nuance
- Multi-dimensional rubric scoring with weighted dimensions (similar to our ADR-016)
- Faithfulness/groundedness metrics for RAG-style systems
- Confidence calibration to detect overconfident or underconfident outputs
- Statistical aggregation across sessions for meaningful trend analysis
LLM Council's unique architecture—where multiple LLMs already evaluate each other in Stage 2—provides inherent advantages for quality quantification that single-model systems cannot match.
Decision¶
Implement a three-tier Output Quality Quantification (OQQ) framework that exposes quality metrics at increasing levels of sophistication:
Tier 1: Core Metrics (OSS)¶
Available in the open-source package, included in all council responses:
1.1 Consensus Strength Score (CSS)¶
A normalized 0.0–1.0 score indicating the level of agreement among council members during Stage 2 peer review.
Calculation:
def consensus_strength_score(rankings: list[dict]) -> float:
"""
Calculate consensus strength from Stage 2 rankings.
High CSS (>0.8): Strong agreement, high confidence in synthesis
Medium CSS (0.5-0.8): Moderate agreement, nuanced topic
Low CSS (<0.5): Significant disagreement, consider debate mode
"""
# Extract Borda scores per response
borda_scores = aggregate_borda_scores(rankings)
# Normalize to 0-1 range
normalized = normalize_scores(borda_scores)
# Calculate score spread (low spread = high consensus)
spread = max(normalized) - min(normalized)
# Invert: high spread means clear winner = high consensus
# But also factor in clustering (are scores bunched?)
variance = statistics.variance(normalized)
# CSS formula: balance spread (clear winner) with low variance (agreement)
css = (spread * 0.6) + ((1 - variance) * 0.4)
return round(css, 3)
Interpretation:
| CSS Range | Interpretation | Recommended Action |
|-----------|---------------|-------------------|
| 0.85–1.0 | Strong consensus | High confidence in synthesis |
| 0.70–0.84 | Moderate consensus | Synthesis reliable, note minority views |
| 0.50–0.69 | Weak consensus | Consider include_dissent=true |
| < 0.50 | Significant disagreement | Recommend debate mode |
1.2 Deliberation Depth Index (DDI)¶
Quantifies how thoroughly the council considered the query.
Components: - Response Diversity (0.0–1.0): Semantic dissimilarity of Stage 1 responses - Review Coverage (0.0–1.0): Percentage of valid rankings received - Critique Richness (0.0–1.0): Average length/detail of Stage 2 justifications
def deliberation_depth_index(
stage1_responses: list[str],
stage2_rankings: list[dict],
stage2_justifications: list[str]
) -> float:
"""
Higher DDI indicates more thorough deliberation.
"""
# Response diversity via embedding cosine distances
embeddings = embed_responses(stage1_responses)
diversity = average_pairwise_distance(embeddings)
# Review coverage (did all models participate in Stage 2?)
expected_reviews = len(stage1_responses) ** 2
actual_reviews = len(stage2_rankings)
coverage = actual_reviews / expected_reviews
# Critique richness (non-trivial justifications)
avg_justification_tokens = mean([len(j.split()) for j in stage2_justifications])
richness = min(1.0, avg_justification_tokens / 50) # 50 tokens = 1.0
# Weighted combination
ddi = (diversity * 0.35) + (coverage * 0.35) + (richness * 0.30)
return round(ddi, 3)
1.3 Synthesis Attribution Score (SAS)¶
When using include_details=true, indicates how well the final synthesis traces back to peer-reviewed responses.
def synthesis_attribution_score(
synthesis: str,
winning_responses: list[str],
all_responses: list[str]
) -> dict:
"""
Measures attribution of synthesis to source responses.
"""
synthesis_embedding = embed(synthesis)
# How much does synthesis align with top-ranked responses?
winner_similarity = mean([
cosine_similarity(synthesis_embedding, embed(r))
for r in winning_responses
])
# Is synthesis grounded in ANY council response?
max_similarity = max([
cosine_similarity(synthesis_embedding, embed(r))
for r in all_responses
])
# Hallucination risk: synthesis diverges from all responses
hallucination_risk = 1.0 - max_similarity
return {
"winner_alignment": round(winner_similarity, 3),
"max_source_alignment": round(max_similarity, 3),
"hallucination_risk": round(hallucination_risk, 3),
"grounded": max_similarity > 0.6
}
Tier 2: Enhanced Analytics (Paid - Standard)¶
Extended metrics for organizations requiring quality assurance:
2.1 Temporal Consistency Score (TCS)¶
Track whether council responses to similar queries remain consistent over time. Requires persistence (builds on ADR-018).
def temporal_consistency_score(
current_response: str,
historical_responses: list[tuple[str, datetime]],
query_similarity_threshold: float = 0.85
) -> dict:
"""
For semantically similar past queries, measure response consistency.
"""
current_embedding = embed(current_response)
similar_historical = [
(resp, ts) for resp, ts in historical_responses
if cosine_similarity(embed(resp), current_embedding) > query_similarity_threshold
]
if len(similar_historical) < 3:
return {"status": "insufficient_data", "score": None}
# Calculate drift from historical consensus
historical_mean = mean_embedding([embed(r) for r, _ in similar_historical])
drift = 1.0 - cosine_similarity(current_embedding, historical_mean)
return {
"status": "calculated",
"score": round(1.0 - drift, 3),
"sample_size": len(similar_historical),
"drift_detected": drift > 0.2
}
2.2 Cross-Model Calibration Report¶
Extends ADR-015 bias auditing with calibration curves per model.
calibration_report:
session_count: 147
confidence_level: "high" # Per ADR-018 thresholds
models:
"anthropic/claude-opus-4.5":
mean_score_given: 7.2
mean_score_received: 7.8
calibration_delta: -0.6 # Tends to score harshly
self_preference_rate: 0.12 # 12% of time ranks self #1
"openai/gpt-4o":
mean_score_given: 7.6
mean_score_received: 7.4
calibration_delta: +0.2
self_preference_rate: 0.08
recommendations:
- "claude-opus-4.5 scores 0.6 points below median; consider calibration weight"
- "Cross-session length bias: r=0.42 (moderate positive correlation)"
2.3 Rubric Dimension Breakdown¶
When ADR-016 rubric scoring is enabled, expose per-dimension aggregates:
{
"rubric_breakdown": {
"accuracy": {
"mean": 8.2,
"std": 0.9,
"ceiling_applied": false
},
"relevance": {
"mean": 8.7,
"std": 0.4,
"weight": 0.10
},
"completeness": {
"mean": 7.4,
"std": 1.2,
"weight": 0.20,
"flag": "high variance suggests split opinions"
},
"conciseness": {
"mean": 6.8,
"std": 0.7,
"weight": 0.15
},
"clarity": {
"mean": 8.1,
"std": 0.5,
"weight": 0.20
},
"weighted_composite": 7.73,
"accuracy_ceiling_effect": "none"
}
}
Tier 3: Enterprise Quality Assurance (Paid - Enterprise)¶
Comprehensive quality pipeline with external validation:
3.1 External Evaluation Framework Integration¶
Integrate with industry-standard evaluation frameworks for validation:
DeepEval Integration:
from deepeval import assert_test
from deepeval.metrics import GEval, AnswerRelevancyMetric, HallucinationMetric
from deepeval.test_case import LLMTestCase
async def validate_council_output(
query: str,
council_response: str,
ground_truth: Optional[str] = None,
context: Optional[list[str]] = None
) -> dict:
"""
Validate council output using DeepEval metrics.
"""
test_case = LLMTestCase(
input=query,
actual_output=council_response,
expected_output=ground_truth,
retrieval_context=context
)
metrics = []
# Answer relevancy (always)
relevancy = AnswerRelevancyMetric(threshold=0.7)
await relevancy.a_measure(test_case)
metrics.append(("relevancy", relevancy.score, relevancy.reason))
# Hallucination (if context provided)
if context:
hallucination = HallucinationMetric(threshold=0.5)
await hallucination.a_measure(test_case)
metrics.append(("hallucination_free", 1.0 - hallucination.score, hallucination.reason))
# Correctness (if ground truth provided)
if ground_truth:
correctness = GEval(
name="Correctness",
criteria="Factual accuracy compared to ground truth",
evaluation_params=["actual_output", "expected_output"],
threshold=0.7
)
await correctness.a_measure(test_case)
metrics.append(("correctness", correctness.score, correctness.reason))
return {
"validation_framework": "deepeval",
"metrics": {name: {"score": score, "reason": reason} for name, score, reason in metrics},
"overall_pass": all(score >= 0.7 for _, score, _ in metrics)
}
RAGAS Integration (for RAG-enhanced council queries):
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
async def validate_rag_council_output(
query: str,
council_response: str,
retrieved_context: list[str],
ground_truth: str
) -> dict:
"""
For council queries that included RAG context.
"""
dataset = Dataset.from_dict({
"question": [query],
"answer": [council_response],
"contexts": [retrieved_context],
"ground_truth": [ground_truth]
})
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
return {
"validation_framework": "ragas",
"metrics": result.to_dict(),
"faithfulness_score": result["faithfulness"],
"rag_quality_pass": result["faithfulness"] >= 0.8
}
3.2 Golden Dataset Regression Testing¶
Maintain organization-specific golden datasets for regression testing:
# llm_council_golden.yaml
golden_datasets:
- name: "compliance_queries"
description: "Regulatory compliance questions with verified answers"
queries:
- id: "gdpr-001"
query: "What are the key requirements for GDPR data portability?"
ground_truth: "Article 20 requires..."
minimum_scores:
accuracy: 8
completeness: 7
required_mentions:
- "Article 20"
- "machine-readable format"
- "30 days"
forbidden_claims:
- "automatic" # Common hallucination
- name: "technical_architecture"
description: "Cloud architecture decision questions"
evaluation_mode: "expert_review" # Requires human validation
queries:
- id: "arch-001"
query: "Microservices vs monolith for a 10-person startup?"
rubric:
- dimension: "trade_off_coverage"
weight: 0.3
criteria: "Mentions both approaches' pros/cons"
- dimension: "context_awareness"
weight: 0.3
criteria: "Considers team size and startup constraints"
- dimension: "actionability"
weight: 0.4
criteria: "Provides clear recommendation with reasoning"
Regression Test Runner:
async def run_regression_suite(
golden_dataset: str,
council_config: CouncilConfig
) -> RegressionReport:
"""
Run council against golden dataset and compare to historical baselines.
"""
dataset = load_golden_dataset(golden_dataset)
results = []
for case in dataset.queries:
# Run council
stage1, stage2, stage3, metadata = await run_full_council(
case.query,
confidence=council_config.confidence_level
)
# Validate against case criteria
validation = await validate_against_case(stage3, case)
results.append({
"case_id": case.id,
"passed": validation.passed,
"scores": validation.scores,
"regressions": validation.regressions_from_baseline
})
return RegressionReport(
dataset=golden_dataset,
timestamp=datetime.utcnow(),
pass_rate=sum(r["passed"] for r in results) / len(results),
results=results,
baseline_comparison=compare_to_baseline(results, dataset)
)
3.3 Continuous Quality Monitoring Dashboard¶
Real-time quality metrics exposed via API and webhooks:
@dataclass
class QualityMonitoringEvent:
"""Emitted after each council deliberation for monitoring systems."""
timestamp: datetime
request_id: str
query_hash: str # Privacy-preserving hash
# Tier 1 metrics
consensus_strength: float
deliberation_depth: float
synthesis_attribution: dict
# Tier 2 metrics (if enabled)
rubric_breakdown: Optional[dict]
bias_indicators: Optional[dict]
# Tier 3 metrics (if validation enabled)
external_validation: Optional[dict]
# Alerts
alerts: list[str] # e.g., ["low_consensus", "hallucination_risk"]
# Webhook integration
async def emit_quality_event(event: QualityMonitoringEvent):
"""Emit to configured monitoring endpoints."""
for webhook in config.monitoring_webhooks:
await dispatch_webhook(webhook, event.to_dict())
Metric Exposure Strategy¶
API Response Enhancement¶
All council responses include a quality_metrics object:
{
"stage3_response": "The synthesized answer...",
"metadata": {
"duration_ms": 4523,
"models_participated": ["claude-opus-4.5", "gpt-4o", "gemini-3-pro"],
"quality_metrics": {
"tier": "standard", // "core" | "standard" | "enterprise"
"core": {
"consensus_strength": 0.82,
"deliberation_depth": 0.74,
"synthesis_attribution": {
"winner_alignment": 0.89,
"hallucination_risk": 0.08,
"grounded": true
}
},
"standard": { // Paid tier only
"rubric_breakdown": { ... },
"calibration_notes": ["claude-opus-4.5 scored 0.5 below peer average"]
},
"enterprise": { // Enterprise tier only
"external_validation": {
"deepeval_relevancy": 0.91,
"deepeval_hallucination_free": 0.94
},
"regression_baseline_delta": +0.03
}
},
"quality_alerts": [] // Empty = no concerns
}
}
CLI Quality Reports¶
# Per-session quality summary
llm-council quality-report --session latest
# Cross-session quality trends (requires persistence)
llm-council quality-report --sessions 50 --format json
# Regression test against golden dataset
llm-council regression-test --dataset compliance_queries --output report.html
MCP Tool Enhancement¶
Update consult_council tool to accept quality options:
@mcp_tool
async def consult_council(
query: str,
confidence: str = "high",
include_details: bool = False,
verdict_type: str = "synthesis",
include_dissent: bool = False,
# New quality options
quality_metrics: bool = True, # Include Tier 1 metrics
validate_output: bool = False, # Run external validation (Tier 3)
ground_truth: Optional[str] = None # For correctness validation
) -> CouncilResponse:
...
OSS vs Paid Tier Differentiation¶
| Capability | OSS (Free) | Standard (Paid) | Enterprise (Paid) |
|---|---|---|---|
| Consensus Strength Score | ✅ | ✅ | ✅ |
| Deliberation Depth Index | ✅ | ✅ | ✅ |
| Synthesis Attribution | ✅ | ✅ | ✅ |
| Per-response rubric breakdown | ❌ | ✅ | ✅ |
| Cross-model calibration reports | ❌ | ✅ | ✅ |
| Temporal consistency tracking | ❌ | ✅ | ✅ |
| Cross-session bias aggregation | Basic (ADR-018) | Enhanced | Full |
| DeepEval/RAGAS integration | ❌ | ❌ | ✅ |
| Golden dataset regression testing | ❌ | ❌ | ✅ |
| Continuous monitoring webhooks | ❌ | ❌ | ✅ |
| Quality SLA guarantees | ❌ | ❌ | ✅ |
| Historical quality data retention | 7 days | 90 days | Unlimited |
Value Proposition Alignment¶
OSS Users (Developers, Hobbyists): - Get immediate feedback on answer reliability - Core metrics sufficient for most use cases - Builds trust and drives adoption
Standard Tier (Teams, Small Organizations): - Understand why quality varies (rubric dimensions, calibration) - Track quality over time for regression detection - Justify council usage to stakeholders
Enterprise Tier (Regulated Industries, Large Organizations): - External validation for audit trails - Golden dataset integration for compliance - SLA-backed quality guarantees - Continuous monitoring for production systems
Implementation Phases¶
Phase 1: Core Metrics (OSS) — 2 weeks¶
- [ ] Implement
consensus_strength_score()incouncil.py - [ ] Implement
deliberation_depth_index()with embedding service - [ ] Implement
synthesis_attribution_score() - [ ] Add
quality_metricsto response metadata - [ ] Update MCP tool schema
- [ ] Documentation and examples
Phase 2: Enhanced Analytics — 3 weeks¶
- [ ] Temporal consistency tracking (extend ADR-018 persistence)
- [ ] Cross-model calibration report generation
- [ ] Rubric breakdown exposure (extend ADR-016)
- [ ] CLI
quality-reportcommand - [ ] Tier gating infrastructure
Phase 3: Enterprise Quality Assurance — 4 weeks¶
- [ ] DeepEval integration module
- [ ] RAGAS integration module
- [ ] Golden dataset schema and loader
- [ ] Regression test runner
- [ ] Monitoring webhook system
- [ ] Dashboard API endpoints
Consequences¶
Positive¶
- User Trust: Quantified quality metrics build confidence in council outputs
- Regression Detection: Quality trends surface degradation before users notice
- Differentiated Value: Clear tier separation justifies paid offerings
- Ecosystem Alignment: Integration with established frameworks (DeepEval, RAGAS) provides familiar patterns
- Auditability: Enterprise customers can demonstrate due diligence for AI-assisted decisions
Negative¶
- Complexity: Additional computation per request (mitigated by optional flags)
- Latency: External validation adds ~500-2000ms (Tier 3 only, async option available)
- Dependencies: External frameworks introduce versioning concerns
- Storage: Historical quality data requires persistence infrastructure
Risks¶
- Metric Gaming: Users may over-optimize for metrics rather than actual quality
-
Mitigation: Multiple complementary metrics, emphasize holistic interpretation
-
False Confidence: High scores on flawed queries could mislead users
-
Mitigation: Clear documentation of metric limitations, confidence intervals
-
Framework Churn: External evaluation frameworks evolve rapidly
- Mitigation: Abstraction layer, version pinning, graceful degradation
References¶
- DeepEval Documentation
- RAGAS Metrics
- LLM-as-a-Judge: MT-Bench and Chatbot Arena
- G-Eval: NLG Evaluation using GPT-4
- ADR-015: Per-Session Bias Audit
- ADR-016: Structured Rubric Scoring
- ADR-018: Cross-Session Bias Aggregation
- ADR-025b: Jury Mode (Binary Verdicts)
- ADR-030: Scoring Refinements
Appendix A: Example Quality Report¶
╔══════════════════════════════════════════════════════════════════╗
║ LLM Council Quality Report ║
║ Session: 2025-12-29T14:32:00Z ║
╠══════════════════════════════════════════════════════════════════╣
║ CORE METRICS (Tier 1) ║
╠═══════════════════════╦══════════╦═══════════════════════════════╣
║ Consensus Strength ║ 0.84 ║ ████████░░ Strong consensus ║
║ Deliberation Depth ║ 0.71 ║ ███████░░░ Adequate depth ║
║ Winner Alignment ║ 0.89 ║ █████████░ Well-grounded ║
║ Hallucination Risk ║ 0.06 ║ █░░░░░░░░░ Low risk ║
╠═══════════════════════╩══════════╩═══════════════════════════════╣
║ RUBRIC BREAKDOWN (Tier 2) ║
╠═══════════════════════╦══════════╦═══════════════════════════════╣
║ Accuracy (35%) ║ 8.4 ║ ████████░░ ║
║ Relevance (10%) ║ 9.1 ║ █████████░ ║
║ Completeness (20%) ║ 7.2 ║ ███████░░░ ⚠ High variance ║
║ Conciseness (15%) ║ 6.9 ║ ███████░░░ ║
║ Clarity (20%) ║ 8.0 ║ ████████░░ ║
╠═══════════════════════╬══════════╬═══════════════════════════════╣
║ Weighted Composite ║ 7.82 ║ Above baseline (+0.3) ║
╠═══════════════════════╩══════════╩═══════════════════════════════╣
║ CALIBRATION NOTES ║
║ • claude-opus-4.5 scored 0.4 below peer average (harsh reviewer) ║
║ • No position bias detected (variance: 0.12) ║
║ • Length correlation: r=0.18 (acceptable) ║
╠══════════════════════════════════════════════════════════════════╣
║ ALERTS: None ║
╚══════════════════════════════════════════════════════════════════╝
Appendix B: Environment Variables¶
| Variable | Description | Default |
|---|---|---|
LLM_COUNCIL_QUALITY_METRICS |
Enable quality metrics in responses | true |
LLM_COUNCIL_QUALITY_TIER |
Quality tier (core, standard, enterprise) |
core |
LLM_COUNCIL_EMBEDDING_MODEL |
Model for DDI/SAS embeddings | text-embedding-3-small |
LLM_COUNCIL_EXTERNAL_VALIDATION |
Enable DeepEval/RAGAS validation | false |
LLM_COUNCIL_GOLDEN_DATASET_PATH |
Path to golden dataset YAML | - |
LLM_COUNCIL_QUALITY_WEBHOOK_URL |
Monitoring webhook endpoint | - |
LLM_COUNCIL_QUALITY_RETENTION_DAYS |
Historical data retention | 7 (OSS) |