ADR-030: Scoring Refinements¶
Status: ACCEPTED (Revised per Council Review 2025-12-24) Date: 2025-12-24 Decision Makers: Engineering, Architecture Extends: ADR-026 (Dynamic Model Intelligence) Council Review: Reasoning tier (gpt-5.2-pro, claude-opus-4.5, gemini-3-pro-preview, grok-4.1-fast)
Context¶
The current scoring implementation (ADR-026 Phase 1) has limitations identified during council review:
- Linear cost scoring fails for exponential price differences
- Quality tier floors need benchmark evidence
- No circuit breaker for failing models
Decision¶
1. Log-Ratio Cost Scoring (Council-Revised Formula)¶
Council Feedback: log(price + 1) is effectively linear for small values (< 0.1).
Problem Analysis:
# Original formula behavior for typical API prices:
log(0.001 + 1) = 0.000999... # Nearly linear
log(0.01 + 1) = 0.00995... # Still linear
log(0.1 + 1) = 0.0953... # Starting to curve
Solution: Log-Ratio Normalization
import math
def get_cost_score(price: float, reference_high: float = 0.015) -> float:
"""
Log-ratio scoring for exponential pricing differences.
Uses log(price/reference) which properly handles small values.
Council-recommended formula.
Args:
price: Cost per 1K tokens
reference_high: Reference "expensive" price (high-tier average)
Returns:
Score between 0.0 (expensive) and 1.0 (cheap/free)
"""
if price <= 0:
return 1.0 # Free models get perfect cost score
if reference_high <= 0:
return 0.5 # Invalid reference, neutral score
# Minimum price floor to avoid log(0)
MIN_PRICE = 0.0001 # $0.0001 per 1K tokens
effective_price = max(price, MIN_PRICE)
# Log-ratio: how many orders of magnitude from reference?
# log(price/ref) = log(price) - log(ref)
# Normalized to [0, 1] where cheaper = higher score
log_ratio = math.log10(effective_price / reference_high)
# Map log ratio to score:
# - price == reference_high → log_ratio = 0 → score = 0.5
# - price == reference_high / 10 → log_ratio = -1 → score = 0.75
# - price == reference_high * 10 → log_ratio = 1 → score = 0.25
score = 0.5 - (log_ratio * 0.25)
return max(0.0, min(1.0, score))
# Alternative: Exponential decay (also council-approved)
def get_cost_score_exponential(price: float, reference_high: float = 0.015) -> float:
"""
Exponential decay scoring.
score = exp(-price / reference_high)
Simpler formula, natural decay curve.
"""
if price <= 0:
return 1.0
decay_rate = 1.0 / reference_high
score = math.exp(-price * decay_rate)
return max(0.0, min(1.0, score))
Comparison Table (reference_high = $0.015):
| Price | Linear | log(price+1) | Log-Ratio | Exp Decay |
|---|---|---|---|---|
| $0.000 | 1.00 | 1.00 | 1.00 | 1.00 |
| $0.001 | 0.93 | 0.96 | 0.79 | 0.94 |
| $0.003 | 0.80 | 0.87 | 0.68 | 0.82 |
| $0.015 | 0.00 | 0.52 | 0.50 | 0.37 |
| $0.030 | -1.00* | 0.32 | 0.43 | 0.14 |
| $0.150 | -9.00* | 0.00 | 0.25 | 0.00 |
*Linear formula breaks for prices > reference
Rationale: Log-ratio properly reflects that the difference between $0.001 and $0.003 (3x) is as significant as between $0.010 and $0.030 (3x).
2. Quality Tier Scores with Benchmark Evidence (Council Requirement)¶
Council Feedback: Quality tier floors must be justified with benchmark data.
# Updated with benchmark citations
QUALITY_TIER_SCORES = {
# FRONTIER: Top-tier models (GPT-4o, Claude Opus 4, Gemini Ultra)
# Benchmark: MMLU 87-90%, HumanEval 90%+
QualityTier.FRONTIER: 0.95,
# STANDARD: Strong models (GPT-4o-mini, Claude Sonnet 3.5, Gemini Pro)
# Benchmark: MMLU 80-86%, HumanEval 85-90%
# Justification: GPT-4o-mini matches GPT-4 (2023) on most tasks
QualityTier.STANDARD: 0.85, # +0.10 from original 0.75
# ECONOMY: Cost-optimized (GPT-3.5-turbo, Claude Haiku, Gemini Flash)
# Benchmark: MMLU 70-79%, HumanEval 70-85%
# Justification: Flash models now rival previous-gen standards
QualityTier.ECONOMY: 0.70, # +0.15 from original 0.55
# LOCAL: Self-hosted models (Llama, Mistral, Qwen)
# Benchmark: Varies widely (MMLU 55-80%, HumanEval 40-80%)
# Justification: Upper bound LOCAL models match ECONOMY
QualityTier.LOCAL: 0.50, # +0.10 from original 0.40
}
# Benchmark sources (per Council requirement)
QUALITY_TIER_BENCHMARK_SOURCES = {
QualityTier.FRONTIER: [
"https://openai.com/index/gpt-4o-system-card",
"https://www.anthropic.com/news/claude-3-5-sonnet",
"https://deepmind.google/technologies/gemini/ultra/",
],
QualityTier.STANDARD: [
"https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/",
"https://www.anthropic.com/news/claude-3-haiku",
],
QualityTier.ECONOMY: [
"https://openai.com/blog/chatgpt-turbo",
"https://deepmind.google/technologies/gemini/flash/",
],
QualityTier.LOCAL: [
"https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard",
],
}
3. Circuit Breaker with State Machine (Council-Revised)¶
Council Feedback: - Lower threshold from 50% to 20-30% - Add minimum request count before tripping - Implement proper Closed → Open → Half-Open pattern
from enum import Enum, auto
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional, Deque
from collections import deque
import threading
class CircuitState(Enum):
"""Standard circuit breaker states."""
CLOSED = auto() # Normal operation, tracking failures
OPEN = auto() # Tripped, rejecting requests
HALF_OPEN = auto() # Testing recovery, limited requests
@dataclass
class CircuitBreakerConfig:
"""Configuration for circuit breaker behavior."""
failure_threshold: float = 0.25 # 25% failure rate (Council: 20-30%)
min_requests: int = 5 # Minimum requests before evaluation
window_seconds: int = 600 # 10 minute sliding window
cooldown_seconds: int = 1800 # 30 minute cooldown when OPEN
half_open_max_requests: int = 3 # Probes before closing
half_open_success_threshold: float = 0.67 # 2/3 success to close
@dataclass
class CircuitBreaker:
"""
Per-model circuit breaker with proper state machine.
Implements standard pattern:
- CLOSED: Normal operation, counts failures in sliding window
- OPEN: Rejects all requests, waits for cooldown
- HALF-OPEN: Allows limited probes, closes on success or reopens on failure
"""
model_id: str
config: CircuitBreakerConfig = field(default_factory=CircuitBreakerConfig)
state: CircuitState = CircuitState.CLOSED
# Sliding window for CLOSED state
_request_times: Deque[tuple[datetime, bool]] = field(
default_factory=lambda: deque(maxlen=1000)
)
# Half-open tracking
_half_open_requests: int = 0
_half_open_successes: int = 0
# State transition timestamps
_opened_at: Optional[datetime] = None
_lock: threading.Lock = field(default_factory=threading.Lock)
def is_available(self) -> tuple[bool, Optional[str]]:
"""
Check if model is available for selection.
Returns:
(is_available, reason_if_unavailable)
"""
with self._lock:
now = datetime.utcnow()
if self.state == CircuitState.CLOSED:
return (True, None)
if self.state == CircuitState.OPEN:
# Check if cooldown has elapsed
if self._opened_at and now >= self._opened_at + timedelta(
seconds=self.config.cooldown_seconds
):
self._transition_to_half_open()
return (True, None) # Allow probe request
remaining = (
self._opened_at + timedelta(seconds=self.config.cooldown_seconds) - now
).seconds if self._opened_at else 0
return (False, f"circuit_open (cooldown: {remaining}s)")
if self.state == CircuitState.HALF_OPEN:
if self._half_open_requests < self.config.half_open_max_requests:
return (True, None) # Allow probe
return (False, "circuit_half_open (probes exhausted)")
return (True, None) # Default: available
def record_result(self, success: bool) -> None:
"""Record request outcome and evaluate state transitions."""
with self._lock:
now = datetime.utcnow()
if self.state == CircuitState.CLOSED:
self._request_times.append((now, success))
self._prune_old_requests(now)
self._evaluate_closed_state()
elif self.state == CircuitState.HALF_OPEN:
self._half_open_requests += 1
if success:
self._half_open_successes += 1
self._evaluate_half_open_state()
def _prune_old_requests(self, now: datetime) -> None:
"""Remove requests outside the sliding window."""
cutoff = now - timedelta(seconds=self.config.window_seconds)
while self._request_times and self._request_times[0][0] < cutoff:
self._request_times.popleft()
def _evaluate_closed_state(self) -> None:
"""Check if circuit should trip to OPEN."""
total = len(self._request_times)
if total < self.config.min_requests:
return # Not enough data to evaluate
failures = sum(1 for _, success in self._request_times if not success)
failure_rate = failures / total
if failure_rate >= self.config.failure_threshold:
self._transition_to_open(failure_rate)
def _evaluate_half_open_state(self) -> None:
"""Check if circuit should close or reopen."""
if self._half_open_requests >= self.config.half_open_max_requests:
success_rate = self._half_open_successes / self._half_open_requests
if success_rate >= self.config.half_open_success_threshold:
self._transition_to_closed()
else:
self._transition_to_open(1.0 - success_rate)
def _transition_to_open(self, failure_rate: float) -> None:
"""Trip the circuit breaker."""
self.state = CircuitState.OPEN
self._opened_at = datetime.utcnow()
self._half_open_requests = 0
self._half_open_successes = 0
# Emit metric
_emit_circuit_event("circuit_opened", self.model_id, failure_rate)
def _transition_to_half_open(self) -> None:
"""Enter half-open state for recovery testing."""
self.state = CircuitState.HALF_OPEN
self._half_open_requests = 0
self._half_open_successes = 0
_emit_circuit_event("circuit_half_open", self.model_id, None)
def _transition_to_closed(self) -> None:
"""Close the circuit (recovery complete)."""
self.state = CircuitState.CLOSED
self._request_times.clear()
self._opened_at = None
_emit_circuit_event("circuit_closed", self.model_id, None)
# Global circuit breaker registry
_circuit_breakers: dict[str, CircuitBreaker] = {}
_registry_lock = threading.Lock()
def get_circuit_breaker(model_id: str) -> CircuitBreaker:
"""Get or create circuit breaker for model."""
with _registry_lock:
if model_id not in _circuit_breakers:
_circuit_breakers[model_id] = CircuitBreaker(model_id=model_id)
return _circuit_breakers[model_id]
def check_circuit_breaker(model_id: str) -> tuple[bool, Optional[str]]:
"""
Check if model is available (circuit not open).
Returns:
(is_available, unavailable_reason)
"""
breaker = get_circuit_breaker(model_id)
return breaker.is_available()
def _emit_circuit_event(event: str, model_id: str, failure_rate: Optional[float]) -> None:
"""Emit observability event for circuit state change."""
import logging
logger = logging.getLogger(__name__)
logger.warning(
f"Circuit breaker: {event}",
extra={
"event": event,
"model_id": model_id,
"failure_rate": failure_rate,
}
)
# Metrics export: emit_layer_event() automatically notifies subscribed
# MetricsAdapters (see observability/metrics_adapter.py)
Configuration¶
council:
model_intelligence:
scoring:
# Cost scoring algorithm
cost_scale: log_ratio # 'linear', 'log_ratio', or 'exponential'
cost_reference_high: 0.015 # Reference expensive price
# Quality tier scores (with benchmark justification)
quality_tier_scores:
frontier: 0.95 # MMLU 87-90%
standard: 0.85 # MMLU 80-86%
economy: 0.70 # MMLU 70-79%
local: 0.50 # MMLU 55-80%
circuit_breaker:
enabled: true
failure_threshold: 0.25 # 25% (Council: 20-30%)
min_requests: 5 # Minimum before evaluation
window_seconds: 600 # 10 minute window
cooldown_seconds: 1800 # 30 minute cooldown
half_open_max_requests: 3 # Probes before closing
half_open_success_threshold: 0.67
Environment Variables¶
| Variable | Type | Default | Purpose |
|---|---|---|---|
LLM_COUNCIL_COST_SCALE |
str | log_ratio | Cost scoring algorithm |
LLM_COUNCIL_CIRCUIT_BREAKER |
bool | true | Enable circuit breaker |
LLM_COUNCIL_CIRCUIT_THRESHOLD |
float | 0.25 | Failure rate threshold |
LLM_COUNCIL_CIRCUIT_MIN_REQUESTS |
int | 5 | Min requests before trip |
Observability (Council Requirement)¶
# Metrics to emit
scoring.cost_score{model_id, algorithm}
scoring.quality_score{model_id, tier}
circuit.state_change{model_id, from_state, to_state, failure_rate}
circuit.request_blocked{model_id, state, cooldown_remaining}
circuit.probe_result{model_id, success}
# Structured logging
{
"event": "circuit_state_change",
"model_id": "openai/gpt-4o",
"from_state": "CLOSED",
"to_state": "OPEN",
"failure_rate": 0.28,
"requests_in_window": 25,
"cooldown_seconds": 1800
}
Consequences¶
Positive¶
- Log-ratio accurately reflects order-of-magnitude price differences
- Quality scores backed by benchmark evidence
- Circuit breaker prevents cascading failures
- Proper state machine enables safe recovery testing
- Min-requests prevents false-positive tripping
Negative¶
- Log-ratio less intuitive than linear
- Quality benchmarks may become stale
- Circuit breaker adds latency (lock contention)
- Half-open probes may fail on unlucky requests
Risks & Mitigations¶
| Risk | Mitigation |
|---|---|
| Log scoring edge cases | MIN_PRICE floor, clamp to [0, 1] |
| Circuit breaker too sensitive | min_requests requirement, configurable threshold |
| Stale quality benchmarks | Document sources, quarterly review |
| Half-open probe bias | Multiple probes (3), high success threshold (67%) |
Testing Strategy¶
class TestScoringRefinements:
def test_log_ratio_cost_scoring(self):
"""Log-ratio properly handles order-of-magnitude differences."""
assert get_cost_score(0.001, 0.015) > get_cost_score(0.003, 0.015)
assert get_cost_score(0.015, 0.015) == pytest.approx(0.5, abs=0.01)
def test_cost_score_free_models(self):
"""Free models get perfect cost score."""
assert get_cost_score(0.0, 0.015) == 1.0
def test_circuit_breaker_min_requests(self):
"""Circuit doesn't trip below min_requests."""
breaker = CircuitBreaker(model_id="test")
for _ in range(4): # Below min_requests=5
breaker.record_result(False)
assert breaker.state == CircuitState.CLOSED
def test_circuit_breaker_trips_at_threshold(self):
"""Circuit trips at failure threshold."""
breaker = CircuitBreaker(model_id="test")
for _ in range(3):
breaker.record_result(True)
for _ in range(2):
breaker.record_result(False) # 2/5 = 40% > 25%
assert breaker.state == CircuitState.OPEN
def test_circuit_breaker_half_open_recovery(self):
"""Half-open state allows recovery."""
breaker = CircuitBreaker(model_id="test")
breaker._transition_to_open(0.5)
breaker._transition_to_half_open()
for _ in range(3):
breaker.record_result(True) # 3/3 = 100% > 67%
assert breaker.state == CircuitState.CLOSED
def test_circuit_breaker_half_open_reopen(self):
"""Half-open reopens on continued failures."""
breaker = CircuitBreaker(model_id="test")
breaker._transition_to_half_open()
breaker.record_result(True)
breaker.record_result(False)
breaker.record_result(False) # 1/3 = 33% < 67%
assert breaker.state == CircuitState.OPEN
Implementation Plan¶
- [x] Implement log-ratio cost scoring function (Issue #138)
- [x] Update QUALITY_TIER_SCORES with benchmark sources (Issue #139)
- [x] Implement CircuitBreaker class with state machine (Issue #140)
- [x] Add circuit breaker registry (per-model) (Issue #141)
- [x] Integrate circuit breaker with selection pipeline (Issue #142)
- [x] Add metrics and structured logging (L4_CIRCUIT_BREAKER_OPEN/CLOSE events)
- [x] Make all parameters configurable via YAML (ScoringConfig, CircuitBreakerConfig)
- [x] Add comprehensive tests (126 new tests across 4 test files)