ADR-010: Consensus Mechanism - Normalized Score Averaging¶
Status: Proposed (Revised) Date: 2024-12-12 Deciders: LLM Council (Unanimous on revision) Technical Story: Select the optimal ranking aggregation for 3-5 LLM reviewers
Context and Problem Statement¶
The council currently uses Normalized Borda Count to aggregate peer rankings: - Each reviewer ranks all responses (1st, 2nd, 3rd...) - Points assigned: 1st = (N-1)/(N-1) = 1.0, last = 0 - Self-votes excluded to prevent bias - Average Borda score determines final ranking
Critical insight: We also collect 1-10 scores from each reviewer, but currently discard this data by converting to ranks.
The Real Problems (Not Theoretical Voting Issues)¶
| Problem | Description |
|---|---|
| LLM biases | Models prefer verbose responses, familiar styles |
| Score calibration | GPT scores harshly (avg 6), Claude generously (avg 8) |
| Small sample size | 3-5 voters means high statistical noise |
| Close decisions | Need to know when top responses are effectively tied |
Why Ranks Are Wrong¶
Converting scores to ranks deletes information:
In Scenario A, the top two are effectively tied. In Scenario B, there's a clear winner. Rank-based methods (Borda, Schulze) treat these identically.
Decision Drivers¶
- Simplicity: Minimize implementation and maintenance cost
- Use available data: We already collect scores - use them
- Handle calibration: Different LLMs score differently
- Detect ties: Know when decisions are too close to call
- Solve actual problems: LLM biases, not strategic voting
Considered Options¶
Option A: Copeland's Method¶
Count pairwise wins (how many other responses each beats head-to-head).
Pros: - Simple: "Response A beat 7 of 9 competitors head-to-head" - Low complexity: O(N²R) where R = reviewers
Cons: - Collapses margin information (5-4 win = 9-0 win) - Frequently produces ties with few voters - Worse than Borda for close decisions
Verdict: Good as tiebreaker, not primary mechanism.
Option B: Schulze Method (Beatpath)¶
Build pairwise preference graph, find strongest paths via Floyd-Warshall.
Pros: - Condorcet-consistent (respects pairwise majority) - Clone-proof, monotonic, excellent strategic robustness - O(N³) complexity - trivial for N≤10 (~1000 ops, sub-millisecond) - Path strengths encode margin information
Cons: - Internals (strongest paths) harder to explain - Still purely ordinal (no score magnitude)
Verdict: Strong candidate for primary ranking.
Option C: Kemeny-Young¶
Find ranking that minimizes total disagreement (Kendall tau distance) with all reviewers.
Pros: - "Most consensus ranking" - very interpretable - Captures nuanced trade-offs in close calls - Hard to manipulate strategically
Cons: - NP-hard: O(N!) in brute force - N=10 → 3.6M permutations (feasible but requires optimization) - More implementation complexity than Schulze
Verdict: Theoretically excellent, but Schulze achieves similar results with less complexity.
Option D: Instant Runoff Voting (IRV)¶
Eliminate lowest first-preference candidate iteratively.
Pros: - Intuitive for users familiar with elections - Low complexity: O(N²R)
Cons: - Non-monotonic (improving rank can hurt you) - Ignores depth of rankings - Designed for large electorates; fails with 3-10 voters
Verdict: Not recommended for this use case.
Option E: Range/Score Voting¶
Use raw 1-10 scores instead of rankings.
Pros: - Captures intensity of preference - Can detect when all responses are poor - Very interpretable: "average score 8.3/10"
Cons: - Score calibration varies dramatically between models - Vulnerable to min/max strategic voting - Requires normalization (z-score per reviewer)
Verdict: Good supplementary signal, not standalone.
Option F: Bradley-Terry Model¶
Probabilistic model estimating "strength" from pairwise comparisons.
Pros: - Outputs probabilities and confidence intervals - Quantifies "how close" the decision was - Handles missing comparisons naturally - O(N² × iterations), converges quickly
Cons: - Statistical interpretation may confuse users - Requires iterative fitting (MLE)
Verdict: Excellent for uncertainty quantification; use as secondary layer.
Option G: Weighted Borda¶
Same as Borda, but weight votes by reviewer reliability.
Pros: - Incremental improvement to current system - Can incorporate reviewer quality signals - Same O(NR) complexity
Cons: - Weight computation creates feedback loops - Risks entrenching biases if weights are wrong
Verdict: Easy upgrade path if reliability metrics available.
Option H: Bucket Consensus (Tiers)¶
Group responses into quality buckets (Excellent/Good/Poor) instead of strict ordering.
Pros: - Reduces noise from artificial fine-grained distinctions - Natural for LLM outputs ("good enough" vs "bad") - Very interpretable: "3 excellent, 2 good, 1 poor"
Cons: - Loses within-tier ordering - Bucket boundaries are arbitrary
Verdict: Excellent for user-facing presentation layer.
Option I: Hybrid (Rank + Score)¶
Combine ordinal ranking with cardinal score magnitude.
Pros: - Uses all available information - Distinguishes "strong 2nd" from "weak 2nd"
Cons: - Inherits weaknesses of both - Requires tuning α parameter
Verdict: Principled but adds complexity.
Decision Outcome¶
Chosen: Normalized Score Averaging
After critical re-evaluation, the council unanimously rejected the complex tiered architecture (Schulze + Bradley-Terry + Buckets) as "engineering theater" - solving theoretical problems we don't have while ignoring our actual challenges.
Why Complex Voting Methods Are Wrong Here¶
| Method | What It Solves | Why It's Irrelevant |
|---|---|---|
| Schulze | Strategic voting, clone attacks | LLMs don't strategize |
| Bradley-Terry | Uncertainty from limited pairwise data | We have full scores already |
| Condorcet methods | Rock-paper-scissors cycles | Quality is transitive in LLM evals |
With 3-5 voters, Schulze is more sensitive to noise than Borda, not less. A single outlier can flip pairwise majorities unpredictably.
The Recommended Mechanism¶
Normalized Score Averaging with Confidence-Based Tie Detection
import numpy as np
from collections import defaultdict
from dataclasses import dataclass
from typing import Dict, List, Tuple
@dataclass
class AggregateResult:
model: str
mean_score: float # Normalized mean (z-score scale)
std_error: float # Standard error of mean
vote_count: int
is_tied_with_next: bool = False
def aggregate_scores(
scores_by_reviewer: Dict[str, Dict[str, float]],
exclude_self_votes: bool = True
) -> List[AggregateResult]:
"""
Aggregate reviewer scores using z-score normalization.
Args:
scores_by_reviewer: {reviewer_model: {candidate_model: score}}
exclude_self_votes: Whether to exclude self-evaluations
Returns:
List of results sorted by mean score (best first)
"""
# Step 1: Z-normalize per reviewer (fixes calibration bias)
normalized = {}
for reviewer, scores in scores_by_reviewer.items():
# Exclude self-vote if configured
if exclude_self_votes:
scores = {k: v for k, v in scores.items() if k != reviewer}
if not scores:
continue
values = list(scores.values())
mean = np.mean(values)
std = np.std(values)
# Fallback if no variance (all same score)
if std < 0.001:
normalized[reviewer] = {k: 0.0 for k in scores}
else:
normalized[reviewer] = {
k: (v - mean) / std for k, v in scores.items()
}
# Step 2: Aggregate normalized scores per candidate
candidate_scores = defaultdict(list)
for reviewer, scores in normalized.items():
for candidate, score in scores.items():
candidate_scores[candidate].append(score)
# Step 3: Calculate mean, standard error, and rank
results = []
for candidate, scores in candidate_scores.items():
n = len(scores)
mean = np.mean(scores)
std_error = np.std(scores) / np.sqrt(n) if n > 1 else 0
results.append(AggregateResult(
model=candidate,
mean_score=round(mean, 3),
std_error=round(std_error, 3),
vote_count=n
))
# Sort by mean score (highest first)
results.sort(key=lambda x: -x.mean_score)
# Step 4: Flag statistical ties (overlapping 95% confidence intervals)
for i in range(len(results) - 1):
curr, next_ = results[i], results[i + 1]
# 95% CI uses ~1.96 * std_error
curr_lower = curr.mean_score - 1.96 * curr.std_error
next_upper = next_.mean_score + 1.96 * next_.std_error
if curr_lower < next_upper:
results[i].is_tied_with_next = True
return results
How This Solves Our Actual Problems¶
| Problem | Solution |
|---|---|
| Score calibration | Z-normalization: harsh reviewer (avg 6) and generous reviewer (avg 8) both center to 0 |
| LLM biases | Normalization spreads preferences; biases become noise that averages out |
| Small sample size | Standard error tells you when N is too small to decide |
| Close decisions | Overlapping confidence intervals explicitly flag ties |
Configuration¶
# config.py additions
DEFAULT_RANKING_METHOD = "normalized_scores" # "borda", "normalized_scores"
DEFAULT_TIE_THRESHOLD = 1.96 # Z-score for 95% confidence interval
DEFAULT_FALLBACK_TO_BORDA = True # Use Borda as tiebreaker
Example Output¶
{
"rankings": [
{"model": "gpt-4o", "mean_score": 0.82, "std_error": 0.15, "tied": false},
{"model": "claude-opus", "mean_score": 0.45, "std_error": 0.22, "tied": true},
{"model": "gemini-pro", "mean_score": 0.31, "std_error": 0.18, "tied": false}
],
"interpretation": "gpt-4o is the clear winner. claude-opus and gemini-pro are statistically tied."
}
Migration Path¶
- Phase 1: Collect scores alongside ranks (already done)
- Phase 2: Implement normalized score averaging in parallel with Borda
- Phase 3: Compare results, validate on historical data
- Phase 4: Switch default to normalized scores
- Phase 5: Keep Borda as optional tiebreaker
What to Invest In Instead¶
The council recommends spending the "complexity budget" saved from not implementing Schulze on:
- Better prompts: Explicitly instruct reviewers to "penalize unnecessary verbosity"
- Bias audits: Track correlation between scores and response length
- Rubrics: Score on specific criteria (accuracy, conciseness, helpfulness) not holistic vibes
- Response order randomization: Mitigate positional bias
Consequences¶
Positive¶
- Simpler: ~30 lines vs. hundreds for Schulze
- Uses all data: Scores contain magnitude information ranks discard
- Built-in confidence: Know when decisions are uncertain
- Interpretable: "Model A scored 0.8σ above mean" is clear
- Handles calibration: Z-scores fix harsh/generous reviewers automatically
Negative¶
- Requires scores (we already have them)
- Z-scores can be unstable with very low variance (handled by fallback)
Risks¶
- If all reviewers give identical scores, z-normalization fails → fallback to Borda
- Systematic biases (all LLMs prefer verbosity) still need prompt engineering to fix
Complexity Comparison¶
| Method | Implementation | Solves Calibration? | Detects Ties? | Uses Score Magnitude? |
|---|---|---|---|---|
| Borda | Simple | No | Poorly | No |
| Schulze | Complex | No | No | No |
| Normalized Scores | Simple | Yes | Yes | Yes |