Detecting Evaluator Bias¶
GPT-4 scores harshly (avg 6.2). Claude scores generously (avg 7.8). Here's how to detect and account for it.
When multiple LLMs evaluate each other's work, they don't grade on the same curve. Some models are harsh critics; others give everyone gold stars. If you don't account for this, your "consensus" is just noise.
We built a bias auditing system to detect these patterns. Here's what we learned.
The Three Biases¶
1. Reviewer Calibration Bias¶
Different models have different scoring baselines:
GPT-4 scores: [6, 7, 5, 6] mean: 6.0
Claude scores: [8, 9, 8, 7] mean: 8.0
Gemini scores: [7, 7, 8, 7] mean: 7.25
If you average these raw scores, Claude's 4th-place candidate (score 7) ties with GPT's 1st-place candidate (score 7). That's not consensus—that's calibration noise.
Detection:
import statistics
from typing import Dict
def audit_reviewer_calibration(
scores: Dict[str, Dict[str, float]]
) -> Dict[str, Dict[str, float]]:
"""
Detect harsh and generous reviewers.
IMPORTANT: This assumes all reviewers graded the same set of responses.
If reviewers grade different subsets, this comparison is invalid.
"""
calibration = {}
for reviewer, reviewer_scores in scores.items():
values = list(reviewer_scores.values())
calibration[reviewer] = {
"mean": statistics.mean(values),
"std": statistics.stdev(values) if len(values) > 1 else 0,
}
# Find median baseline
all_means = [c["mean"] for c in calibration.values()]
median_mean = statistics.median(all_means)
std_of_means = statistics.stdev(all_means) if len(all_means) > 2 else 1
# Flag outliers (z-score relative to other reviewers)
for reviewer, stats in calibration.items():
z_score = (stats["mean"] - median_mean) / std_of_means if std_of_means > 0 else 0
stats["z_score"] = round(z_score, 2)
stats["classification"] = (
"harsh" if z_score < -1 else
"generous" if z_score > 1 else
"neutral"
)
return calibration
Example output (with means 6.0, 7.25, 8.0 → median 7.25, std ≈ 1.0):
{
"openai/gpt-4": {"mean": 6.0, "z_score": -1.25, "classification": "harsh"},
"anthropic/claude": {"mean": 8.0, "z_score": 0.75, "classification": "neutral"},
"google/gemini": {"mean": 7.25, "z_score": 0.0, "classification": "neutral"}
}
Note: With only 3 reviewers, you need a z-score magnitude > 1.0 to be flagged. Claude at z=0.75 is within one standard deviation of the median.
2. Length-Score Correlation¶
Verbose responses often score higher, regardless of quality:
import statistics
import math
from typing import Dict, List, Tuple
def _pearson_correlation(x: List[float], y: List[float]) -> float:
"""Pure Python Pearson correlation coefficient."""
n = len(x)
if n < 3:
return 0.0
mean_x = statistics.mean(x)
mean_y = statistics.mean(y)
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
sum_sq_x = sum((xi - mean_x) ** 2 for xi in x)
sum_sq_y = sum((yi - mean_y) ** 2 for yi in y)
denominator = math.sqrt(sum_sq_x * sum_sq_y)
if denominator == 0:
return 0.0
return numerator / denominator
def calculate_length_correlation(
responses: List[Dict],
scores: Dict[str, Dict[str, float]]
) -> Tuple[float, str]:
"""Calculate Pearson correlation between length and score."""
# Get word counts
word_counts = {r["model"]: len(r["response"].split()) for r in responses}
# Get average scores per response
avg_scores = {}
for model in word_counts:
model_scores = [s[model] for s in scores.values() if model in s]
avg_scores[model] = statistics.mean(model_scores) if model_scores else 0
# Calculate correlation
models = list(avg_scores.keys())
x = [word_counts[m] for m in models]
y = [avg_scores[m] for m in models]
if len(models) < 3:
return 0.0, "insufficient_data"
r = _pearson_correlation(x, y)
interpretation = (
"strong_positive" if r > 0.7 else
"moderate_positive" if r > 0.3 else
"weak" if r > -0.3 else
"moderate_negative" if r > -0.7 else
"strong_negative"
)
return round(r, 3), interpretation
Healthy range: -0.2 to 0.2 (weak correlation)
Warning sign: r > 0.7 means reviewers are rewarding verbosity, not quality.
3. Position Bias¶
The first response shown often gets an unfair advantage. Detecting this requires tracking the display order for each review session, not a fixed model-to-position mapping:
from collections import defaultdict
from typing import Dict, List, Tuple
def calculate_position_bias(
session_data: List[Dict]
) -> Tuple[float, bool]:
"""
Detect if presentation order affects scores.
Each session_data entry must include:
- display_order: List[str] # Models in order shown to reviewer
- scores: Dict[str, float] # Reviewer's scores for each model
"""
position_scores = defaultdict(list)
for session in session_data:
display_order = session["display_order"]
scores = session["scores"]
for position, model in enumerate(display_order):
if model in scores:
position_scores[position].append(scores[model])
if len(position_scores) < 2:
return 0.0, False
# Calculate mean score per position
position_means = [
statistics.mean(scores)
for scores in position_scores.values()
if scores
]
# Variance of position means indicates bias
variance = statistics.variance(position_means) if len(position_means) > 1 else 0
# High variance = position affects scores
bias_detected = variance > 0.5
return round(variance, 3), bias_detected
If Position 0 averages 7.5 and Position 3 averages 6.2 across many sessions, you have position bias.
Mitigation: Randomize response order for each reviewer. Track the randomization and analyze cross-session.
The Statistical Honesty Problem¶
Here's the uncomfortable truth: with 4-5 models, single-session bias detection lacks statistical power.
| Metric | Data Points | Minimum for Significance |
|---|---|---|
| Length correlation | 4-5 pairs | 30+ pairs |
| Position bias | 1 ordering | 20+ orderings |
| Reviewer calibration | ~12 scores | 50+ scores |
A single session can detect extreme anomalies (r > 0.9), but cannot provide statistical proof of systematic bias. These are indicators, not evidence.
Cross-Session Aggregation¶
Real insights require aggregating across sessions:
import math
import statistics
from dataclasses import dataclass
from typing import List, Optional, Tuple
@dataclass
class BiasMetricRecord:
session_id: str
length_correlation: Optional[float]
@dataclass
class AggregatedBiasResult:
length_correlation: float
length_correlation_ci: Tuple[float, float] # 95% confidence interval
sample_size: int
confidence_level: str # "insufficient", "preliminary", "moderate", "high"
def run_aggregated_bias_audit(
records: List[BiasMetricRecord],
min_sessions: int = 10
) -> Optional[AggregatedBiasResult]:
"""Aggregate bias metrics across multiple sessions."""
# Filter valid correlations (must be in range (-1, 1), exclusive)
correlations = [
r.length_correlation
for r in records
if r.length_correlation is not None and -1 < r.length_correlation < 1
]
if len(correlations) < min_sessions:
return AggregatedBiasResult(
length_correlation=0,
length_correlation_ci=(0, 0),
sample_size=len(correlations),
confidence_level="insufficient"
)
# Fisher z-transform for pooling correlations
z_values = [0.5 * math.log((1 + r) / (1 - r)) for r in correlations]
pooled_z = statistics.mean(z_values)
pooled_r = (math.exp(2 * pooled_z) - 1) / (math.exp(2 * pooled_z) + 1)
# 95% CI using Fisher z standard error
# Note: For meta-analysis, SE = 1/sqrt(n-3) per correlation
# With small per-session n, we use session count as proxy
n = len(z_values)
if n <= 3:
# Not enough data for CI
return AggregatedBiasResult(
length_correlation=round(pooled_r, 3),
length_correlation_ci=(-1, 1),
sample_size=n,
confidence_level="insufficient"
)
se = 1 / math.sqrt(n - 3)
z_lower = pooled_z - 1.96 * se
z_upper = pooled_z + 1.96 * se
ci_lower = (math.exp(2 * z_lower) - 1) / (math.exp(2 * z_lower) + 1)
ci_upper = (math.exp(2 * z_upper) - 1) / (math.exp(2 * z_upper) + 1)
# Confidence level based on session count
confidence = (
"high" if n >= 50 else
"moderate" if n >= 20 else
"preliminary"
)
return AggregatedBiasResult(
length_correlation=round(pooled_r, 3),
length_correlation_ci=(round(ci_lower, 3), round(ci_upper, 3)),
sample_size=n,
confidence_level=confidence
)
Key insight: We store bias metrics from every session to a JSONL file. After 50+ sessions, we can make statistically valid claims about reviewer behavior.
Reviewer Profiles¶
Over time, you build profiles of each reviewer:
$ llm-council bias-report
=== Reviewer Profiles (50 sessions) ===
openai/gpt-4o
Mean score: 6.2 (harsh, z=-1.3)
Score variance: 1.8 (discriminating)
Reliability: high (50 samples)
anthropic/claude-3-5-sonnet
Mean score: 7.8 (generous, z=+1.1)
Score variance: 0.9 (compressed range)
Reliability: high (50 samples)
google/gemini-1.5-pro
Mean score: 7.1 (neutral, z=+0.2)
Score variance: 1.4 (balanced)
Reliability: high (50 samples)
What this tells you: - GPT-4 is a harsh grader (good for catching errors) - Claude compresses scores toward the top (less discriminating) - Gemini is your neutral baseline
What We Don't Do¶
We don't auto-adjust scores. This was a deliberate decision:
"If a reviewer is 'harsh,' they might simply be the domain expert holding the standard high. Automatically penalizing their scores is a UX minefield."
Instead, we: 1. Report bias indicators in metadata 2. Flag extreme anomalies 3. Let users decide how to respond
The bias audit is diagnostic, not corrective.
Privacy Considerations¶
We never store raw queries. Bias records contain:
{
"schema_version": "1.1.0",
"session_id": "uuid",
"reviewer_id": "google/gemini-3-pro",
"model_id": "anthropic/claude-opus-4.5",
"position": 2,
"response_length_chars": 1200,
"score_value": 8.5,
"query_hash": null
}
Query hashes are opt-in (for grouping similar queries) and use salted HMAC—you can't reverse them to get the original query.
Practical Takeaways¶
-
Expect calibration differences. GPT and Claude don't grade the same. Use rankings, not raw scores.
-
Watch for length bias. If r > 0.7, your reviewers are rewarding verbosity. Add explicit instructions: "Penalize unnecessary wordiness."
-
Randomize presentation order. Position bias is real. Shuffle responses before each review.
-
Aggregate across sessions. Single-session metrics are indicators. 50+ sessions give you statistical confidence.
-
Don't auto-correct. Report bias, let humans decide what to do with it.
This is post 5 of 7. Next: The Accuracy Ceiling
LLM Council is open source: github.com/amiable-dev/llm-council