The Accuracy Ceiling¶
A beautifully written hallucination scored 7.2/10. We capped it at 4.0. Here's why.
We learned this the hard way: a confident lie can outscore a hesitant truth.
When we first implemented multi-dimensional scoring, a response with fabricated citations, perfect grammar, and clear organization scored 7.2/10. It was completely wrong, but it sounded authoritative.
That's when we introduced the accuracy ceiling.
The Problem: Weighted Averages Fail¶
Our rubric scores responses on five dimensions:
| Dimension | Weight |
|---|---|
| Accuracy | 35% |
| Relevance | 10% |
| Completeness | 20% |
| Conciseness | 15% |
| Clarity | 20% |
A well-written hallucination might score:
Accuracy: 3/10 (fabricated facts)
Relevance: 10/10 (directly addressed question)
Completeness: 9/10 (comprehensive, just wrong)
Conciseness: 9/10 (efficiently written)
Clarity: 10/10 (beautifully organized)
Weighted average: 0.35(3) + 0.10(10) + 0.20(9) + 0.15(9) + 0.20(10)
= 1.05 + 1.0 + 1.8 + 1.35 + 2.0
= 7.2/10
A 7.2 would rank this response in the top half. That's a problem.
The Solution: Accuracy as a Ceiling¶
Instead of treating accuracy as just another weighted dimension, we make it a ceiling on the overall score:
from typing import Dict
def calculate_score_with_ceiling(scores: Dict[str, int]) -> float:
"""
Calculate weighted score with accuracy acting as a ceiling.
Raises KeyError if accuracy is missing - fail-safe by design.
"""
weights = {
"accuracy": 0.35,
"relevance": 0.10,
"completeness": 0.20,
"conciseness": 0.15,
"clarity": 0.20,
}
# Require accuracy - missing accuracy is an error, not a default of 10
if "accuracy" not in scores:
raise KeyError("accuracy score is required")
accuracy = scores["accuracy"]
# Calculate base weighted score (only for present dimensions)
base_score = sum(
scores[dim] * weights[dim]
for dim in weights
if dim in scores
)
# Apply accuracy ceiling
if accuracy < 5:
ceiling = 4.0 # Max 40%
elif accuracy < 7:
ceiling = 7.0 # Max 70%
else:
ceiling = 10.0 # No ceiling
return round(min(base_score, ceiling), 2)
Now our confident hallucination:
The beautiful lie drops from 7.2 to 4.0. It cannot rank in the top half.
The Threshold Logic¶
The ceiling thresholds map to behavioral definitions:
| Accuracy | Meaning | Ceiling | Rationale |
|---|---|---|---|
| < 5 | Significant errors or hallucinations | 4.0 | Fundamentally unreliable |
| 5-6 | Mixed accuracy, some errors | 7.0 | Useful but flawed |
| 7+ | Mostly or completely accurate | None | Quality differentiates |
These aren't arbitrary. They align with our scoring anchors:
Accuracy 3-4: "Significant errors"
Major misconceptions or outdated information. Response cannot be trusted.
Accuracy 5-6: "Mixed accuracy"
Several minor factual errors, but main point may be valid.
Accuracy 7-8: "Mostly accurate"
One date slightly off, but core message correct.
If a reviewer scores accuracy below 5, they're saying "this response has significant errors." Such a response should not rank highly, regardless of how well it's written.
Safety Gate: Hard Failures¶
Some content should never rank, period. We add a safety gate before rubric scoring:
import re
from typing import List, Tuple
SAFETY_PATTERNS = {
"dangerous_instructions": r"(how to|instructions for).*(bomb|explosive|weapon)",
"malware_hacking": r"(hack into|exploit|bypass).*(account|system|security)",
"pii_exposure": r"\b\d{3}-\d{2}-\d{4}\b", # SSN pattern
}
EXCLUSION_CONTEXTS = [
"to prevent this attack",
"for educational purposes",
"i cannot provide",
"this is dangerous and",
"security researchers",
"defensive measures",
]
def is_defensive_context(response: str) -> bool:
"""Check if harmful pattern appears in defensive/educational context."""
lower_response = response.lower()
return any(ctx in lower_response for ctx in EXCLUSION_CONTEXTS)
def check_response_safety(response: str) -> Tuple[bool, List[str]]:
"""
Check response for harmful content.
Returns (passed, flagged_patterns).
Educational/defensive content is allowed.
"""
# Skip safety check for defensive content
if is_defensive_context(response):
return True, []
flagged = []
for pattern_name, pattern in SAFETY_PATTERNS.items():
# re.DOTALL allows .* to match across newlines
if re.search(pattern, response, re.IGNORECASE | re.DOTALL):
flagged.append(pattern_name)
return len(flagged) == 0, flagged
def apply_safety_gate(score: float, safety_result: Tuple[bool, List[str]]) -> float:
"""Cap score if safety check fails."""
passed, _ = safety_result
if not passed:
return 0.0 # Hard cap
return score
A bomb-making guide with perfect clarity and organization still scores 0.
Real Example¶
Query: "What's the capital of Australia?"
Response A:
Canberra is the capital of Australia. It was chosen as a compromise between Sydney and Melbourne in 1908.
Scores: Accuracy=10, Relevance=10, Completeness=9, Conciseness=10, Clarity=10
Weighted: 0.35(10) + 0.10(10) + 0.20(9) + 0.15(10) + 0.20(10)
= 3.5 + 1.0 + 1.8 + 1.5 + 2.0
= 9.8 → Final: 9.8 (no ceiling)
Response B:
Sydney is the capital of Australia and its largest city, known for the Opera House.
Scores: Accuracy=2, Relevance=10, Completeness=8, Conciseness=10, Clarity=10
Weighted: 0.35(2) + 0.10(10) + 0.20(8) + 0.15(10) + 0.20(10)
= 0.7 + 1.0 + 1.6 + 1.5 + 2.0
= 6.8 → Final: 4.0 (ceiling applied, accuracy < 5)
Response B is wrong (Sydney is not the capital). Without the ceiling, it would score 6.8. With the ceiling, it drops to 4.0—properly penalized for fundamental incorrectness.
The Five Dimensions¶
With the ceiling mechanism, our rubric becomes:
| Dimension | Weight | Role |
|---|---|---|
| Accuracy | 35% + ceiling | Primary gate |
| Relevance | 10% | Stays on topic |
| Completeness | 20% | Addresses all parts |
| Conciseness | 15% | Efficient |
| Clarity | 20% | Well-organized |
Why these weights?
- Accuracy dominates because wrong answers are worse than incomplete ones
- Relevance is low because off-topic but accurate is better than on-topic hallucination
- Conciseness is lower than completeness because missing information is worse than extra information
- Clarity matters because a correct but confusing answer isn't useful
Scoring Anchors¶
To reduce inter-reviewer noise, we define what each score means:
Accuracy Anchors:
| Score | Definition | Example |
|---|---|---|
| 9-10 | Completely accurate, no errors | All facts verifiable |
| 7-8 | Mostly accurate, minor imprecisions | One date slightly off |
| 5-6 | Mixed accuracy, some errors | Several minor errors |
| 3-4 | Significant errors | Major misconceptions |
| 1-2 | Mostly incorrect or hallucinated | Fabricated facts |
Conciseness Anchors:
| Score | Definition | Example |
|---|---|---|
| 9-10 | Every word adds value | Dense, efficient |
| 7-8 | Mostly efficient, minor padding | Slight verbosity |
| 5-6 | Some unnecessary content | Redundant explanations |
| 3-4 | Significant padding | Filler phrases |
| 1-2 | Extremely verbose | Bloated, buries the answer |
These anchors give reviewers concrete targets. "Is this a 7 or an 8?" becomes answerable.
Configuration¶
# Enable rubric scoring (off by default)
export LLM_COUNCIL_RUBRIC_SCORING=true
# Enable safety gate
export LLM_COUNCIL_SAFETY_GATE=true
# Custom weights (must sum to 1.0)
export LLM_COUNCIL_WEIGHT_ACCURACY=0.35
export LLM_COUNCIL_WEIGHT_RELEVANCE=0.10
export LLM_COUNCIL_WEIGHT_COMPLETENESS=0.20
export LLM_COUNCIL_WEIGHT_CONCISENESS=0.15
export LLM_COUNCIL_WEIGHT_CLARITY=0.20
Fallback Behavior¶
If a reviewer ignores the rubric and gives a single holistic score, we fall back to the original ranking-based aggregation. The system gracefully degrades.
The Principle¶
A confident lie is worse than a hesitant truth.
The accuracy ceiling encodes this principle into the scoring algorithm. No amount of eloquence can overcome fundamental incorrectness.
This is post 6 of 7. Next: Shadow Mode & Model Auditions
LLM Council is open source: github.com/amiable-dev/llm-council