ADR-007: Council Scoring Methodology¶
Status: Proposed Date: 2024-11-29 Deciders: LLM Council Technical Story: Define the scoring algorithm for aggregating peer rankings into leaderboard positions
Context and Problem Statement¶
The council collects rankings from multiple LLM reviewers. We need a robust methodology to:
- Aggregate individual rankings into a final score
- Handle ties, abstentions, and partial rankings
- Produce consistent leaderboard positions
- Be resistant to gaming/manipulation
- Scale across different council sizes
Current implementation uses Borda Count, but we need to formalize edge cases.
Decision Drivers¶
- Robustness: Handle missing votes, ties, abstentions gracefully
- Fairness: No systematic bias toward any model
- Interpretability: Users should understand why Model X ranks higher
- Resistance to Gaming: Hard to manipulate rankings
- Consistency: Same inputs should always produce same outputs
Considered Options¶
Option A: Simple Average Rank¶
Average the position each model receives across all reviewers.
Formula: Score = Σ(position) / N
Pros: - Simple to understand - Easy to compute
Cons: - Sensitive to outliers - Doesn't capture magnitude of preference - Abstentions break the math
Option B: Borda Count (Current)¶
Assign points based on position: 1st = (N-1) points, last = 0.
Formula: Score = Σ(N - 1 - position) / votes_received
Pros: - Well-established in voting theory - Handles different council sizes naturally - Rewards consensus
Cons: - Sensitive to number of candidates - Doesn't use raw scores
Option C: Borda Count + Score Weighting (Hybrid)¶
Combine Borda position with raw reviewer scores.
Formula: Score = α * normalized_borda + (1-α) * normalized_raw_score
Pros: - Uses all available information - Captures both ranking and magnitude - More nuanced than pure Borda
Cons: - Raw scores poorly calibrated across models - More complex to explain - Requires tuning α parameter
Option D: Elo Rating System¶
Treat each pairwise comparison as a "match" and compute Elo ratings.
Pros: - Excellent for tracking skill over time - Well-understood in competitive domains - Handles transitive preferences well
Cons: - Complex to implement correctly - Requires match history (not just single queries) - May be overkill for this use case
Decision Outcome¶
Chosen option: Option B (Borda Count) with formalized edge case handling.
Rationale¶
- Borda Count is proven and well-understood
- LLMs are better at relative ranking than absolute scoring
- Raw scores are poorly calibrated across different reviewers
- Simpler to explain to users
- Matches our current implementation (minimal changes)
Formalized Algorithm¶
Core Borda Calculation¶
def calculate_borda_score(
rankings: List[Dict], # List of reviewer rankings
label_to_model: Dict[str, str], # Map labels to model names
exclude_self_votes: bool = True
) -> Dict[str, BordaResult]:
"""
Calculate Borda scores for each model.
Returns dict mapping model -> BordaResult with:
- borda_score: float (average Borda points)
- vote_count: int
- win_count: int (times ranked #1)
- final_rank: int
"""
N = len(label_to_model) # Number of candidates
model_points = defaultdict(list)
model_wins = defaultdict(int)
for ranking in rankings:
reviewer = ranking['model']
parsed = ranking['parsed_ranking']
# Skip abstentions
if parsed.get('abstained'):
continue
ranking_list = parsed.get('ranking', [])
for position, label in enumerate(ranking_list):
if label not in label_to_model:
continue
author_model = label_to_model[label]
# Exclude self-votes if configured
if exclude_self_votes and reviewer == author_model:
continue
# Borda points: 1st = (N-1), last = 0
points = (N - 1) - position
model_points[author_model].append(points)
if position == 0:
model_wins[author_model] += 1
# Calculate averages and rank
results = {}
for model, points in model_points.items():
results[model] = BordaResult(
borda_score=sum(points) / len(points) if points else 0,
vote_count=len(points),
win_count=model_wins[model]
)
# Assign final ranks (handle ties)
sorted_models = sorted(
results.items(),
key=lambda x: (-x[1].borda_score, -x[1].win_count, x[0]) # Tiebreaker: wins, then alphabetical
)
current_rank = 1
prev_score = None
for i, (model, result) in enumerate(sorted_models):
if prev_score is not None and result.borda_score < prev_score:
current_rank = i + 1
result.final_rank = current_rank
prev_score = result.borda_score
return {model: result for model, result in sorted_models}
Edge Case Handling¶
| Edge Case | Handling | Rationale |
|---|---|---|
| Self-vote | Exclude from aggregation | Prevents self-preference bias |
| Abstention | Skip entirely | Don't penalize models for reviewer refusals |
| Partial ranking | Use available positions only | Some reviewers may only rank top 3 |
| Tie in scores | Use win_count as tiebreaker | More #1 votes = higher rank |
| Tie in wins | Alphabetical by model name | Deterministic ordering |
| No votes received | Score = 0, rank = last | Model must have at least 1 vote |
| Single reviewer | Return rankings as-is | Mark as "low confidence" |
| Score/rank mismatch | Trust ranking, ignore scores | Ranking is more reliable |
Score/Rank Mismatch Resolution (Issue #13)¶
When a reviewer's ranking order doesn't match their scores:
Resolution: Always use the explicit ranking order. Scores are supplementary.
def resolve_mismatch(parsed_ranking: Dict) -> List[str]:
"""
If ranking and scores conflict, trust ranking.
Scores are only used for display/debugging.
"""
ranking = parsed_ranking.get('ranking', [])
scores = parsed_ranking.get('scores', {})
if ranking:
return ranking # Always prefer explicit ranking
# Fallback: derive ranking from scores
if scores:
sorted_scores = sorted(scores.items(), key=lambda x: -x[1])
return [label for label, _ in sorted_scores]
return []
Confidence Indicators¶
Add confidence metadata to help users interpret results:
@dataclass
class LeaderboardEntry:
model: str
borda_score: float
final_rank: int
vote_count: int
win_count: int
confidence: str # "high", "medium", "low"
def calculate_confidence(vote_count: int, total_possible: int) -> str:
coverage = vote_count / total_possible
if coverage >= 0.8:
return "high"
elif coverage >= 0.5:
return "medium"
else:
return "low"
Leaderboard Aggregation (Cross-Query)¶
For the public leaderboard, aggregate across all telemetry:
-- Aggregate Borda scores across all queries in time window
SELECT
model,
category,
-- Use mean of means (each query contributes equally)
AVG(borda_score) as aggregate_borda,
SUM(vote_count) as total_votes,
SUM(win_count) as total_wins,
COUNT(DISTINCT event_id) as appearances
FROM event_rankings er
JOIN telemetry_events te ON er.event_id = te.event_id
WHERE te.timestamp > NOW() - INTERVAL '30 days'
GROUP BY model, category
ORDER BY aggregate_borda DESC;
Consequences¶
Positive¶
- Clear, deterministic algorithm
- Handles all edge cases explicitly
- Resistant to score manipulation (uses rankings)
- Consistent with existing implementation
Negative¶
- Doesn't use raw scores (potentially useful signal)
- May not capture close decisions well
Risks¶
- Borda Count can favor "safe" answers that don't offend any reviewer
- May need to revisit if gaming is detected