ADR-010: Consensus Mechanism - Normalized Score Averaging¶

Status: Proposed (Revised) Date: 2024-12-12 Deciders: LLM Council (Unanimous on revision) Technical Story: Select the optimal ranking aggregation for 3-5 LLM reviewers

Context and Problem Statement¶

The council currently uses Normalized Borda Count to aggregate peer rankings: - Each reviewer ranks all responses (1st, 2nd, 3rd...) - Points assigned: 1st = (N-1)/(N-1) = 1.0, last = 0 - Self-votes excluded to prevent bias - Average Borda score determines final ranking

Critical insight: We also collect 1-10 scores from each reviewer, but currently discard this data by converting to ranks.

The Real Problems (Not Theoretical Voting Issues)¶

Problem	Description
LLM biases	Models prefer verbose responses, familiar styles
Score calibration	GPT scores harshly (avg 6), Claude generously (avg 8)
Small sample size	3-5 voters means high statistical noise
Close decisions	Need to know when top responses are effectively tied

Why Ranks Are Wrong¶

Converting scores to ranks deletes information:

Scenario A: Scores [10, 9.9, 2] → Ranks [1, 2, 3]
Scenario B: Scores [6, 3, 1]   → Ranks [1, 2, 3]

In Scenario A, the top two are effectively tied. In Scenario B, there's a clear winner. Rank-based methods (Borda, Schulze) treat these identically.

Decision Drivers¶

Simplicity: Minimize implementation and maintenance cost
Use available data: We already collect scores - use them
Handle calibration: Different LLMs score differently
Detect ties: Know when decisions are too close to call
Solve actual problems: LLM biases, not strategic voting

Considered Options¶

Option A: Copeland's Method¶

Count pairwise wins (how many other responses each beats head-to-head).

Pros: - Simple: "Response A beat 7 of 9 competitors head-to-head" - Low complexity: O(N²R) where R = reviewers

Cons: - Collapses margin information (5-4 win = 9-0 win) - Frequently produces ties with few voters - Worse than Borda for close decisions

Verdict: Good as tiebreaker, not primary mechanism.

Option B: Schulze Method (Beatpath)¶

Build pairwise preference graph, find strongest paths via Floyd-Warshall.

Pros: - Condorcet-consistent (respects pairwise majority) - Clone-proof, monotonic, excellent strategic robustness - O(N³) complexity - trivial for N≤10 (~1000 ops, sub-millisecond) - Path strengths encode margin information

Cons: - Internals (strongest paths) harder to explain - Still purely ordinal (no score magnitude)

Verdict: Strong candidate for primary ranking.

Option C: Kemeny-Young¶

Find ranking that minimizes total disagreement (Kendall tau distance) with all reviewers.

Pros: - "Most consensus ranking" - very interpretable - Captures nuanced trade-offs in close calls - Hard to manipulate strategically

Cons: - NP-hard: O(N!) in brute force - N=10 → 3.6M permutations (feasible but requires optimization) - More implementation complexity than Schulze

Verdict: Theoretically excellent, but Schulze achieves similar results with less complexity.

Option D: Instant Runoff Voting (IRV)¶

Eliminate lowest first-preference candidate iteratively.

Pros: - Intuitive for users familiar with elections - Low complexity: O(N²R)

Cons: - Non-monotonic (improving rank can hurt you) - Ignores depth of rankings - Designed for large electorates; fails with 3-10 voters

Verdict: Not recommended for this use case.

Option E: Range/Score Voting¶

Use raw 1-10 scores instead of rankings.

Pros: - Captures intensity of preference - Can detect when all responses are poor - Very interpretable: "average score 8.3/10"

Cons: - Score calibration varies dramatically between models - Vulnerable to min/max strategic voting - Requires normalization (z-score per reviewer)

Verdict: Good supplementary signal, not standalone.

Option F: Bradley-Terry Model¶

Probabilistic model estimating "strength" from pairwise comparisons.

Pros: - Outputs probabilities and confidence intervals - Quantifies "how close" the decision was - Handles missing comparisons naturally - O(N² × iterations), converges quickly

Cons: - Statistical interpretation may confuse users - Requires iterative fitting (MLE)

Verdict: Excellent for uncertainty quantification; use as secondary layer.

Option G: Weighted Borda¶

Same as Borda, but weight votes by reviewer reliability.

Pros: - Incremental improvement to current system - Can incorporate reviewer quality signals - Same O(NR) complexity

Cons: - Weight computation creates feedback loops - Risks entrenching biases if weights are wrong

Verdict: Easy upgrade path if reliability metrics available.

Option H: Bucket Consensus (Tiers)¶

Group responses into quality buckets (Excellent/Good/Poor) instead of strict ordering.

Pros: - Reduces noise from artificial fine-grained distinctions - Natural for LLM outputs ("good enough" vs "bad") - Very interpretable: "3 excellent, 2 good, 1 poor"

Cons: - Loses within-tier ordering - Bucket boundaries are arbitrary

Verdict: Excellent for user-facing presentation layer.

Option I: Hybrid (Rank + Score)¶

Combine ordinal ranking with cardinal score magnitude.

Pros: - Uses all available information - Distinguishes "strong 2nd" from "weak 2nd"

Cons: - Inherits weaknesses of both - Requires tuning α parameter

Verdict: Principled but adds complexity.

Decision Outcome¶

Chosen: Normalized Score Averaging

After critical re-evaluation, the council unanimously rejected the complex tiered architecture (Schulze + Bradley-Terry + Buckets) as "engineering theater" - solving theoretical problems we don't have while ignoring our actual challenges.

Why Complex Voting Methods Are Wrong Here¶

Method	What It Solves	Why It's Irrelevant
Schulze	Strategic voting, clone attacks	LLMs don't strategize
Bradley-Terry	Uncertainty from limited pairwise data	We have full scores already
Condorcet methods	Rock-paper-scissors cycles	Quality is transitive in LLM evals

With 3-5 voters, Schulze is more sensitive to noise than Borda, not less. A single outlier can flip pairwise majorities unpredictably.

The Recommended Mechanism¶

Normalized Score Averaging with Confidence-Based Tie Detection

import numpy as np
from collections import defaultdict
from dataclasses import dataclass
from typing import Dict, List, Tuple

@dataclass
class AggregateResult:
    model: str
    mean_score: float      # Normalized mean (z-score scale)
    std_error: float       # Standard error of mean
    vote_count: int
    is_tied_with_next: bool = False

def aggregate_scores(
    scores_by_reviewer: Dict[str, Dict[str, float]],
    exclude_self_votes: bool = True
) -> List[AggregateResult]:
    """
    Aggregate reviewer scores using z-score normalization.

    Args:
        scores_by_reviewer: {reviewer_model: {candidate_model: score}}
        exclude_self_votes: Whether to exclude self-evaluations

    Returns:
        List of results sorted by mean score (best first)
    """
    # Step 1: Z-normalize per reviewer (fixes calibration bias)
    normalized = {}
    for reviewer, scores in scores_by_reviewer.items():
        # Exclude self-vote if configured
        if exclude_self_votes:
            scores = {k: v for k, v in scores.items() if k != reviewer}

        if not scores:
            continue

        values = list(scores.values())
        mean = np.mean(values)
        std = np.std(values)

        # Fallback if no variance (all same score)
        if std < 0.001:
            normalized[reviewer] = {k: 0.0 for k in scores}
        else:
            normalized[reviewer] = {
                k: (v - mean) / std for k, v in scores.items()
            }

    # Step 2: Aggregate normalized scores per candidate
    candidate_scores = defaultdict(list)
    for reviewer, scores in normalized.items():
        for candidate, score in scores.items():
            candidate_scores[candidate].append(score)

    # Step 3: Calculate mean, standard error, and rank
    results = []
    for candidate, scores in candidate_scores.items():
        n = len(scores)
        mean = np.mean(scores)
        std_error = np.std(scores) / np.sqrt(n) if n > 1 else 0

        results.append(AggregateResult(
            model=candidate,
            mean_score=round(mean, 3),
            std_error=round(std_error, 3),
            vote_count=n
        ))

    # Sort by mean score (highest first)
    results.sort(key=lambda x: -x.mean_score)

    # Step 4: Flag statistical ties (overlapping 95% confidence intervals)
    for i in range(len(results) - 1):
        curr, next_ = results[i], results[i + 1]
        # 95% CI uses ~1.96 * std_error
        curr_lower = curr.mean_score - 1.96 * curr.std_error
        next_upper = next_.mean_score + 1.96 * next_.std_error

        if curr_lower < next_upper:
            results[i].is_tied_with_next = True

    return results

How This Solves Our Actual Problems¶

Problem	Solution
Score calibration	Z-normalization: harsh reviewer (avg 6) and generous reviewer (avg 8) both center to 0
LLM biases	Normalization spreads preferences; biases become noise that averages out
Small sample size	Standard error tells you when N is too small to decide
Close decisions	Overlapping confidence intervals explicitly flag ties

Configuration¶

# config.py additions
DEFAULT_RANKING_METHOD = "normalized_scores"  # "borda", "normalized_scores"
DEFAULT_TIE_THRESHOLD = 1.96  # Z-score for 95% confidence interval
DEFAULT_FALLBACK_TO_BORDA = True  # Use Borda as tiebreaker

Example Output¶

{
  "rankings": [
    {"model": "gpt-4o", "mean_score": 0.82, "std_error": 0.15, "tied": false},
    {"model": "claude-opus", "mean_score": 0.45, "std_error": 0.22, "tied": true},
    {"model": "gemini-pro", "mean_score": 0.31, "std_error": 0.18, "tied": false}
  ],
  "interpretation": "gpt-4o is the clear winner. claude-opus and gemini-pro are statistically tied."
}

Migration Path¶

Phase 1: Collect scores alongside ranks (already done)
Phase 2: Implement normalized score averaging in parallel with Borda
Phase 3: Compare results, validate on historical data
Phase 4: Switch default to normalized scores
Phase 5: Keep Borda as optional tiebreaker

What to Invest In Instead¶

The council recommends spending the "complexity budget" saved from not implementing Schulze on:

Better prompts: Explicitly instruct reviewers to "penalize unnecessary verbosity"
Bias audits: Track correlation between scores and response length
Rubrics: Score on specific criteria (accuracy, conciseness, helpfulness) not holistic vibes
Response order randomization: Mitigate positional bias

Consequences¶

Positive¶

Simpler: ~30 lines vs. hundreds for Schulze
Uses all data: Scores contain magnitude information ranks discard
Built-in confidence: Know when decisions are uncertain
Interpretable: "Model A scored 0.8σ above mean" is clear
Handles calibration: Z-scores fix harsh/generous reviewers automatically

Negative¶

Requires scores (we already have them)
Z-scores can be unstable with very low variance (handled by fallback)

Risks¶

If all reviewers give identical scores, z-normalization fails → fallback to Borda
Systematic biases (all LLMs prefer verbosity) still need prompt engineering to fix

Complexity Comparison¶

Method	Implementation	Solves Calibration?	Detects Ties?	Uses Score Magnitude?
Borda	Simple	No	Poorly	No
Schulze	Complex	No	No	No
Normalized Scores	Simple	Yes	Yes	Yes