ADR-026: Dynamic Model Intelligence and Benchmark-Driven Selection¶
Status: APPROVED (Blocking Conditions Implemented) Date: 2025-12-23 Decision Makers: Engineering, Architecture Council Review: 2025-12-23 (Strategic + Technical Reviews) Layer Assignment: Cross-cutting (L1-L4 integration) Implementation: 2025-12-23 (Blocking Conditions 1-3)
⚠️ CRITICAL: Strategic Council Review - Vendor Dependency Risk¶
Verdict: CONDITIONAL APPROVAL¶
ADR-026 is NOT APPROVED in its current form. The council identified critical vendor dependency risks that must be addressed before implementation.
"We cannot build the core 'brain' of an open-source project on proprietary APIs that we do not control." — Council Consensus
The "Sovereign Orchestrator" Philosophy¶
The council unanimously adopts this architectural principle:
The open-source version of LLM Council must function as a complete, independent utility. External services (like OpenRouter or Not Diamond) must be treated as PLUGINS, not foundations.
If the internet is disconnected or if an API key is revoked, the software must still boot, run, and perform its core function (orchestrating LLMs), even if quality is degraded.
Blocking Conditions for Approval¶
| # | Condition | Status | Priority |
|---|---|---|---|
| 1 | Add ModelMetadataProvider abstraction interface |
✅ COMPLETED | BLOCKING |
| 2 | Implement StaticRegistryProvider (30+ models) |
✅ COMPLETED (31 models) | BLOCKING |
| 3 | Add offline mode (LLM_COUNCIL_OFFLINE=true) |
✅ COMPLETED | BLOCKING |
| 4 | Evaluate LiteLLM as unified abstraction | ✅ COMPLETED (as fallback) | High |
| 5 | Document degraded vs. enhanced feature matrix | 📋 Required | Medium |
Implementation Notes (2025-12-23)¶
The blocking conditions were implemented using TDD (Test-Driven Development) with 86 passing tests.
Module Structure: src/llm_council/metadata/
| File | Purpose |
|---|---|
types.py |
ModelInfo frozen dataclass, QualityTier enum, Modality enum |
protocol.py |
MetadataProvider @runtime_checkable Protocol |
static_registry.py |
StaticRegistryProvider with YAML + LiteLLM fallback |
litellm_adapter.py |
Lazy LiteLLM import for metadata extraction |
offline.py |
is_offline_mode() and check_offline_mode_startup() |
__init__.py |
get_provider() singleton factory, module exports |
Bundled Registry: src/llm_council/models/registry.yaml
31 models from 8 providers: - OpenAI (7): gpt-4o, gpt-4o-mini, gpt-5.2-pro, o1, o1-preview, o1-mini, o3-mini - Anthropic (5): claude-opus-4.5, claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus, claude-3-sonnet - Google (5): gemini-3-pro-preview, gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash - xAI (2): grok-4, grok-4.1-fast - DeepSeek (2): deepseek-r1, deepseek-chat - Meta (2): llama-3.3-70b, llama-3.1-405b - Mistral (2): mistral-large-2411, mistral-medium - Ollama (6): llama3.2, mistral, qwen2.5:14b, codellama, phi3, deepseek-r1:8b
LiteLLM Integration: Used as fallback in the priority chain (local registry > LiteLLM > 4096 default). Lazy import prevents startup failures when LiteLLM is not installed.
GitHub Issues: #89-#92 (all completed)
Strategic Decision: Option C+D (Hybrid + Abstraction)¶
| Feature | OSS (Self-Hosted) | Council Cloud (Commercial) |
|---|---|---|
| Model Metadata | Static library (LiteLLM) + Manual YAML config | Real-time dynamic sync via OpenRouter |
| Routing | Heuristic rules (latency/cost-based) | Intelligent ML-based (Not Diamond) |
| Integrations | BYOK (Bring Your Own Keys) | Managed Fleet (one bill, instant access) |
| Operations | localhost / Individual instance |
Team governance, analytics, SSO |
Vendor Dependency Analysis¶
| Service | Current Role | Risk Level | Required Mitigation |
|---|---|---|---|
| OpenRouter | Metadata API, Gateway | HIGH | Static fallback + LiteLLM |
| Not Diamond | Model routing, Classification | MEDIUM | Heuristic fallback (exists) |
| Requesty | Alternative gateway | LOW | Already optional |
Affiliate/Reseller Model: NOT VIABLE¶
"Reliance on affiliate revenue or tight coupling creates Platform Risk. If OpenRouter releases 'OpenRouter Agents,' Council becomes obsolete instantly. Furthermore, council-cloud cannot withstand margin compression." — Council
Decision: Use external services to lower the User's barrier to entry, not as the backbone of the Product's value.
Required Abstraction Architecture¶
MetadataProvider Interface (MANDATORY)¶
from typing import Protocol, Optional, Dict, List
from dataclasses import dataclass
@dataclass
class ModelInfo:
id: str
context_window: int
pricing: Dict[str, float] # {"prompt": 0.01, "completion": 0.03}
supported_parameters: List[str]
modalities: List[str]
quality_tier: str # "frontier" | "standard" | "economy"
class MetadataProvider(Protocol):
"""Abstract interface for model metadata sources."""
def get_model_info(self, model_id: str) -> Optional[ModelInfo]: ...
def get_context_window(self, model_id: str) -> int: ...
def get_pricing(self, model_id: str) -> Dict[str, float]: ...
def supports_reasoning(self, model_id: str) -> bool: ...
def list_available_models(self) -> List[str]: ...
class StaticRegistryProvider(MetadataProvider):
"""Default: Offline-safe provider using bundled registry + LiteLLM."""
def __init__(self, registry_path: Path = None):
self.registry = self._load_registry(registry_path)
self.litellm_data = self._load_litellm_model_map()
def get_context_window(self, model_id: str) -> int:
# 1. Check local config override
if model_id in self.registry:
return self.registry[model_id].context_window
# 2. Check LiteLLM library
if model_id in self.litellm_data:
return self.litellm_data[model_id].context_window
# 3. Safe default
return 4096
class DynamicMetadataProvider(MetadataProvider):
"""Optional: Real-time metadata from OpenRouter API."""
async def refresh(self) -> None:
"""Fetch latest model data - requires API key."""
...
Static Registry Schema (MANDATORY)¶
# models/registry.yaml - Shipped with OSS
version: "1.0"
updated: "2025-12-23"
models:
- id: "openai/gpt-4o"
context_window: 128000
pricing:
prompt: 0.0025
completion: 0.01
supported_parameters: ["temperature", "top_p", "tools"]
modalities: ["text", "vision"]
quality_tier: "frontier"
- id: "anthropic/claude-opus-4.5"
context_window: 200000
pricing:
prompt: 0.015
completion: 0.075
supported_parameters: ["temperature", "top_p", "tools", "reasoning"]
modalities: ["text", "vision"]
quality_tier: "frontier"
- id: "ollama/llama3.2"
provider: "ollama"
context_window: 128000
pricing:
prompt: 0
completion: 0
modalities: ["text"]
quality_tier: "local"
Offline Mode (MANDATORY)¶
When offline mode is enabled:
1. Use StaticRegistryProvider exclusively
2. Disable all external metadata/routing calls
3. Log INFO message about limited/stale metadata
4. All core council operations MUST succeed
Technical Council Review Summary¶
Technical Review (2025-12-23) - Full Quorum¶
| Model | Verdict | Rank | Response Time |
|---|---|---|---|
| Claude Opus 4.5 | CONDITIONAL APPROVAL | #1 | 23.4s |
| Gemini 3 Pro | APPROVE | #2 | 31.4s |
| Grok 4 | APPROVE | #3 | 59.6s |
| GPT-4o | APPROVE | #4 | 9.8s |
"The council successfully identified Response C (Claude) as the superior review, noting its crucial detection of mathematical flaws (Borda normalization with variable pool sizes) and logical gaps (Cold Start) missed by other responses."
First Technical Review (2025-12-23, 3/4 models)¶
Approved Components:
- Dynamic metadata integration via OpenRouter API (pricing, availability, capability detection)
- Reasoning parameter optimization (reasoning_effort, budget_tokens)
- Integration points with existing L1-L4 architecture
Returned for Revision (Now Resolved): - ~~Benchmark scraping strategy~~ → Deferred to Phase 4, use Internal Performance Tracker - ~~Single scoring algorithm with "magic number" weights~~ → Tier-Specific Weighting Matrices
Key Technical Recommendations¶
| Recommendation | Status | Priority |
|---|---|---|
| Add Context Window as hard constraint | ✅ Incorporated | Critical |
| Replace single scoring with Tier-Specific Weighting | ✅ Incorporated | High |
| Defer benchmark scraping to optional Phase 4 | ✅ Incorporated | High |
| Add Anti-Herding logic | ✅ Incorporated | Medium |
| Implement Internal Performance Tracker | ✅ Incorporated | Medium |
| Cold Start handling for new models | 📋 Documented | Medium |
| Borda score normalization | 📋 Documented | Medium |
| Anti-Herding edge case (<3 models) | 📋 Documented | Low |
Council Consensus Points¶
- Context Window is a hard pass/fail constraint - must filter before scoring, not weight
- Tier-specific weighting is essential - quick tier prioritizes speed, reasoning tier prioritizes quality
- Benchmark scraping is high-risk - external APIs change frequently, creates maintenance nightmare
- Internal performance data is more valuable - track actual council session outcomes
- Phased approach required - decouple metadata (proven value) from benchmark intelligence (speculative)
- Cold Start needs exploration strategy - new models need "audition" mechanism (Phase 3)
- LiteLLM strongly recommended - use as library for metadata, not just proxy
Context¶
Problem Statement¶
The LLM Council's current model selection relies on static configuration that quickly becomes stale in the rapidly evolving model landscape. December 2025 alone saw major releases from all frontier labs:
| Release Date | Model | Provider |
|---|---|---|
| Nov 17, 2025 | Grok 4.1 | xAI |
| Nov 18, 2025 | Gemini 3 Pro | |
| Nov 24, 2025 | Claude Opus 4.5 | Anthropic |
| Dec 11, 2025 | GPT-5.2 | OpenAI |
Our tier pools in config.py reference models that may be:
- Deprecated or renamed (model identifiers change)
- Outperformed by newer models (benchmarks shift monthly)
- Suboptimally configured (missing reasoning parameters)
- Unavailable or rate-limited (provider status changes)
Current Architecture Gaps¶
| Gap | Impact | Current State |
|---|---|---|
| Static tier pools | Stale model selection | Hardcoded in config.py |
| No benchmark integration | Suboptimal model-task matching | Manual updates |
| No model metadata | Missing capabilities detection | Assumed uniform |
| No reasoning parameters | Underutilized model capabilities | Default parameters only |
| No availability tracking | Failures on unavailable models | Reactive error handling |
Existing Foundation (ADRs 020, 022, 024)¶
The architecture already supports dynamic model selection:
| ADR | Component | Opportunity |
|---|---|---|
| ADR-020 | Not Diamond integration | Model routing API exists but uses static candidates |
| ADR-022 | Tier contracts | allowed_models field could be dynamically populated |
| ADR-024 | Layer architecture | L1 tier selection could query external data sources |
Decision¶
Implement a Model Intelligence Layer that provides real-time model metadata, benchmark rankings, and dynamic pool management to all routing layers.
Architecture Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ MODEL INTELLIGENCE LAYER (New) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Model Registry │ │ Benchmark Index │ │ Availability │ │
│ │ (OpenRouter API)│ │ (Leaderboards) │ │ Monitor │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┴────────────────────┘ │
│ │ │
│ ┌───────────▼───────────┐ │
│ │ Model Selector API │ │
│ │ - get_tier_models() │ │
│ │ - get_best_for_task()│ │
│ │ - get_model_params() │ │
│ └───────────┬───────────┘ │
│ │ │
└────────────────────────────────┼─────────────────────────────────────────────┘
│
┌────────────────────────┼────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ L1: Tier │ │ L2: Query │ │ L4: Gateway │
│ Selection │ │ Triage │ │ Routing │
│ (ADR-022) │ │ (ADR-020) │ │ (ADR-023) │
└───────────────┘ └───────────────┘ └───────────────┘
Data Sources¶
1. OpenRouter Models API¶
Endpoint: GET https://openrouter.ai/api/v1/models
Provides real-time model metadata:
{
"id": "anthropic/claude-opus-4-5-20250514",
"name": "Claude Opus 4.5",
"pricing": {
"prompt": "0.000015",
"completion": "0.000075"
},
"context_length": 200000,
"architecture": {
"input_modalities": ["text", "image"],
"output_modalities": ["text"]
},
"supported_parameters": ["temperature", "top_p", "reasoning"],
"top_provider": {
"is_moderated": true
}
}
Key Fields for Selection:
- pricing - Cost optimization
- context_length - Long document handling
- supported_parameters - Reasoning mode detection
- input_modalities - Multimodal capability
2. Benchmark Leaderboards¶
| Source | Data | Update Frequency | API |
|---|---|---|---|
| LMArena | Elo ratings from 5M+ votes | Real-time | Public |
| LiveBench | Monthly contamination-free benchmarks | Monthly | Public |
| Artificial Analysis | Speed, cost, quality metrics | Weekly | Public |
| LLM Stats | Aggregated performance data | Daily | Public |
Benchmark Categories: - Reasoning: GPQA Diamond, AIME 2025, ARC-AGI-2 - Coding: SWE-bench, LiveCodeBench, Terminal-Bench - General: MMLU-Pro, Humanity's Last Exam - Speed: Tokens/second, time-to-first-token
3. OpenRouter Rankings¶
Endpoint: GET https://openrouter.ai/rankings
Usage-based popularity metrics (tokens served, request count).
Model Parameter Optimization¶
Reasoning Mode Parameters¶
OpenRouter supports unified reasoning parameters:
# For reasoning-capable models (o1, o3, GPT-5, Claude with thinking)
request_params = {
"reasoning": {
"effort": "high", # "minimal"|"low"|"medium"|"high"|"xhigh"
"max_tokens": 32000, # Budget for reasoning tokens
"exclude": False, # Include reasoning in response
}
}
Effort Level Budget Calculation:
budget_tokens = max(min(max_tokens * effort_ratio, 32000), 1024)
effort_ratio:
xhigh: 0.95
high: 0.80
medium: 0.50
low: 0.20
minimal: 0.10
Parameter Detection¶
def get_model_params(model_id: str, task_type: str) -> dict:
"""Get optimized parameters for model and task."""
model_info = model_registry.get(model_id)
params = {}
# Enable reasoning for supported models on complex tasks
if "reasoning" in model_info.supported_parameters:
if task_type in ["reasoning", "math", "coding"]:
params["reasoning"] = {
"effort": "high" if task_type == "reasoning" else "medium"
}
# Adjust temperature for task type
if task_type == "creative":
params["temperature"] = 0.9
elif task_type in ["coding", "math"]:
params["temperature"] = 0.2
return params
Dynamic Tier Pool Management¶
Tier Requirements Matrix¶
| Tier | Latency Budget | Cost Ceiling | Min Models | Required Capabilities |
|---|---|---|---|---|
| quick | P95 < 10s | < $0.001/req | 3 | Fast inference |
| balanced | P95 < 45s | < $0.01/req | 3-4 | Good reasoning |
| high | P95 < 120s | < $0.10/req | 4-5 | Full capability |
| reasoning | P95 < 300s | < $1.00/req | 3-4 | Extended thinking |
Dynamic Pool Selection Algorithm¶
Council Revision: Algorithm updated per council feedback to: 1. Add Context Window as hard pass/fail constraint 2. Replace global weights with Tier-Specific Weighting Matrices 3. Add Anti-Herding logic to prevent traffic concentration
@dataclass
class ModelScore:
model_id: str
benchmark_score: float # Normalized 0-100 (optional, from internal tracker)
latency_p95: float # Seconds
cost_per_request: float # USD
availability: float # 0-1
diversity_score: float # Provider diversity
context_window: int # Token limit (HARD CONSTRAINT)
recent_traffic: float # 0-1, for anti-herding
# COUNCIL RECOMMENDATION: Tier-Specific Weighting Matrices
# Replaces "magic number" global weights (0.4/0.2/0.2/0.1/0.1)
TIER_WEIGHTS = {
"quick": {
"latency": 0.45, # Speed is primary concern
"cost": 0.25, # Budget-conscious
"quality": 0.15, # Acceptable quality
"availability": 0.10,
"diversity": 0.05,
},
"balanced": {
"quality": 0.35, # Better quality
"latency": 0.25, # Still matters
"cost": 0.20, # Cost-aware
"availability": 0.10,
"diversity": 0.10,
},
"high": {
"quality": 0.50, # Quality is paramount
"availability": 0.20, # Must be reliable
"latency": 0.15, # Acceptable wait
"diversity": 0.10, # Multiple perspectives
"cost": 0.05, # Cost secondary
},
"reasoning": {
"quality": 0.60, # Best possible quality
"availability": 0.20, # Critical reliability
"diversity": 0.10, # Diverse reasoning
"latency": 0.05, # Patience for quality
"cost": 0.05, # Cost not a factor
},
}
def select_tier_models(
tier: str,
task_domain: Optional[str] = None,
count: int = 4,
required_context: Optional[int] = None, # NEW: context requirement
) -> List[str]:
"""Select optimal models for tier using multi-criteria scoring.
Council-Validated Algorithm:
1. Apply HARD CONSTRAINTS (pass/fail)
2. Score using TIER-SPECIFIC weights
3. Apply ANTI-HERDING penalty
4. Ensure PROVIDER DIVERSITY
"""
candidates = model_registry.get_available_models()
tier_config = TIER_REQUIREMENTS[tier]
weights = TIER_WEIGHTS[tier]
# ===== HARD CONSTRAINTS (Pass/Fail) =====
# Council Critical: Context window MUST be hard constraint, not weighted
eligible = [
m for m in candidates
if m.latency_p95 <= tier_config.latency_budget
and m.cost_per_request <= tier_config.cost_ceiling
and m.availability >= 0.95
# COUNCIL ADDITION: Context window as hard constraint
and (required_context is None or m.context_window >= required_context)
]
if not eligible:
logger.warning(f"No models meet hard constraints for tier={tier}")
return fallback_to_static_config(tier)
# ===== SOFT SCORING (Tier-Specific Weights) =====
scored = []
for model in eligible:
# Normalize scores to 0-1 range
latency_score = 1 - (model.latency_p95 / tier_config.latency_budget)
cost_score = 1 - (model.cost_per_request / tier_config.cost_ceiling)
quality_score = model.benchmark_score / 100 if model.benchmark_score else 0.5
score = (
quality_score * weights["quality"] +
latency_score * weights["latency"] +
cost_score * weights["cost"] +
model.availability * weights["availability"] +
model.diversity_score * weights["diversity"]
)
# Domain boost (task-specific enhancement)
if task_domain and task_domain in model.strengths:
score *= 1.15
# COUNCIL ADDITION: Anti-Herding Penalty
# Prevent traffic concentration on popular models
if model.recent_traffic > 0.3: # More than 30% of recent traffic
score *= (1 - (model.recent_traffic - 0.3) * 0.5) # Up to 35% penalty
scored.append((model.model_id, score))
# ===== DIVERSITY ENFORCEMENT =====
selected = select_with_diversity(scored, count, min_providers=2)
return selected
Benchmark Score Normalization (DEFERRED - Phase 4)¶
Council Warning: This section describes external benchmark integration which is DEFERRED to Phase 4. Use Internal Performance Tracker (Phase 3) for quality scoring in initial releases.
# DEFERRED: Only implement after Internal Performance Tracker validates value
def normalize_benchmark_scores(model_id: str) -> float:
"""Aggregate benchmark scores into single quality metric.
WARNING: External benchmark scraping is high-maintenance.
Prefer Internal Performance Tracker for quality scoring.
Only implement if internal metrics prove insufficient.
"""
# Start with manual JSON snapshots, NOT automated scrapers
scores = load_manual_benchmark_snapshot(model_id)
if not scores:
return None # Fall back to internal metrics
# Weighted aggregation (emphasize reasoning and coding)
weights = {
"lmarena_elo": 0.3, # Human preference
"livebench": 0.2, # Contamination-free
"gpqa_diamond": 0.25, # Science reasoning
"swe_bench": 0.25, # Coding capability
}
normalized = sum(
normalize_to_100(scores[k]) * weights[k]
for k in weights
if scores.get(k) is not None
)
return normalized
Integration Points¶
1. Layer 1 Enhancement (ADR-022)¶
# tier_contract.py modification
def create_tier_contract(tier: str, task_domain: Optional[str] = None) -> TierContract:
"""Create tier contract with dynamically selected models."""
# Use Model Intelligence Layer instead of static config
models = model_intelligence.select_tier_models(
tier=tier,
task_domain=task_domain,
count=TIER_MODEL_COUNTS[tier],
)
# Get tier-appropriate aggregator
aggregator = model_intelligence.get_aggregator_for_tier(tier)
return TierContract(
tier=tier,
allowed_models=models,
aggregator_model=aggregator,
**get_tier_timeout(tier),
)
2. Layer 2 Enhancement (ADR-020)¶
# not_diamond.py modification
async def route_with_intelligence(
query: str,
tier_contract: TierContract,
) -> RouteResult:
"""Route using Not Diamond + Model Intelligence."""
# Get task-appropriate candidates from intelligence layer
candidates = model_intelligence.select_tier_models(
tier=tier_contract.tier,
task_domain=classify_domain(query),
)
# Get optimized parameters for each candidate
params = {
model: model_intelligence.get_model_params(model, query)
for model in candidates
}
# Route using Not Diamond (with enriched candidates)
if is_not_diamond_available():
result = await not_diamond.route(query, candidates)
return RouteResult(
model=result.model,
params=params[result.model],
confidence=result.confidence,
)
# Fallback to intelligence-based selection
return RouteResult(
model=candidates[0],
params=params[candidates[0]],
confidence=0.7,
)
3. Gateway Enhancement (ADR-023)¶
# gateway/types.py modification
@dataclass
class GatewayRequest:
model: str
messages: List[CanonicalMessage]
# New: Model-specific parameters from intelligence layer
model_params: Optional[Dict[str, Any]] = None
def apply_model_params(self) -> Dict[str, Any]:
"""Apply optimized parameters to request."""
request = self.to_openai_format()
if self.model_params:
request.update(self.model_params)
return request
Caching and Refresh Strategy¶
Cache Layers¶
| Data | Cache TTL | Refresh Trigger |
|---|---|---|
| Model registry | 1 hour | API call / manual |
| Benchmark scores | 24 hours | Daily cron |
| Availability status | 5 minutes | Health check failures |
| Latency metrics | 15 minutes | Rolling window |
Implementation¶
class ModelIntelligenceCache:
def __init__(self):
self.registry_cache = TTLCache(maxsize=500, ttl=3600)
self.benchmark_cache = TTLCache(maxsize=100, ttl=86400)
self.availability_cache = TTLCache(maxsize=500, ttl=300)
async def refresh_registry(self):
"""Fetch latest model data from OpenRouter."""
async with httpx.AsyncClient() as client:
response = await client.get(
"https://openrouter.ai/api/v1/models",
headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
)
models = response.json()["data"]
for model in models:
self.registry_cache[model["id"]] = ModelInfo.from_api(model)
async def refresh_benchmarks(self):
"""Fetch latest benchmark data from leaderboards."""
# LMArena Elo
lmarena = await fetch_lmarena_leaderboard()
# LiveBench
livebench = await fetch_livebench_scores()
# Artificial Analysis
aa = await fetch_artificial_analysis()
# Merge and normalize
for model_id in self.registry_cache:
self.benchmark_cache[model_id] = BenchmarkData(
lmarena_elo=lmarena.get(model_id),
livebench=livebench.get(model_id),
artificial_analysis=aa.get(model_id),
)
Configuration¶
Environment Variables¶
# Model Intelligence Layer
LLM_COUNCIL_MODEL_INTELLIGENCE=true|false # Enable dynamic selection
LLM_COUNCIL_BENCHMARK_SOURCE=lmarena|livebench|artificial_analysis|aggregate
LLM_COUNCIL_REFRESH_INTERVAL=3600 # Registry refresh interval (seconds)
# Fallback to static config if intelligence unavailable
LLM_COUNCIL_STATIC_FALLBACK=true|false
# Minimum benchmark score thresholds
LLM_COUNCIL_MIN_BENCHMARK_SCORE=60 # 0-100 normalized
LLM_COUNCIL_MIN_AVAILABILITY=0.95 # 0-1
# Provider diversity
LLM_COUNCIL_MIN_PROVIDERS=2 # Minimum distinct providers per tier
YAML Configuration¶
Council Revision: Updated to use tier-specific weights instead of global weights.
council:
model_intelligence:
enabled: true
sources:
openrouter_api: true
# DEFERRED: External benchmark sources (Phase 4)
# lmarena: false
# livebench: false
# artificial_analysis: false
internal_performance: true # Phase 3: Use council session outcomes
refresh:
registry_ttl: 3600
# benchmark_ttl: 86400 # DEFERRED
availability_ttl: 300
performance_ttl: 3600 # Internal performance cache
selection:
# COUNCIL REVISION: Tier-specific weights instead of global weights
tier_weights:
quick:
latency: 0.45
cost: 0.25
quality: 0.15
availability: 0.10
diversity: 0.05
balanced:
quality: 0.35
latency: 0.25
cost: 0.20
availability: 0.10
diversity: 0.10
high:
quality: 0.50
availability: 0.20
latency: 0.15
diversity: 0.10
cost: 0.05
reasoning:
quality: 0.60
availability: 0.20
diversity: 0.10
latency: 0.05
cost: 0.05
constraints:
min_providers: 2
min_availability: 0.95
max_cost_multiplier: 10 # vs cheapest option
# COUNCIL ADDITION: Anti-Herding
anti_herding:
enabled: true
traffic_threshold: 0.3 # 30% of recent traffic
max_penalty: 0.35 # Up to 35% score reduction
parameters:
auto_reasoning: true # Enable reasoning params when appropriate
reasoning_effort_by_tier:
quick: minimal
balanced: low
high: medium
reasoning: high
# COUNCIL ADDITION: Internal Performance Tracker
performance_tracker:
enabled: true
store_path: "${HOME}/.llm-council/performance.jsonl"
decay_days: 30
min_samples_preliminary: 10
min_samples_moderate: 30
min_samples_high: 100
Risks and Mitigations¶
Council-Identified Risks (High Priority)¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Benchmark scraper breakage | HIGH | HIGH | DEFER to Phase 4; use manual snapshots, not scrapers |
| Traffic herding | Medium | High | Anti-Herding penalty in selection algorithm |
| Context window violations | Medium | High | Hard constraint filter (not weighted) |
| Magic number weights | N/A | Medium | Tier-specific weight matrices |
Original Risks (Updated)¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| External API unavailability | Medium | High | Static fallback, aggressive caching |
| ~~Benchmark data staleness~~ | ~~Medium~~ | ~~Medium~~ | DEFERRED: Internal Performance Tracker instead |
| Model identifier changes | High | Medium | Fuzzy matching, alias tracking |
| Over-optimization | Medium | Medium | Diversity constraints, Anti-Herding logic |
| Cold start latency | Low | Medium | Pre-warm cache on startup |
| ~~Provider bias in benchmarks~~ | ~~Medium~~ | ~~Low~~ | DEFERRED: Internal metrics not susceptible |
| Internal metric bias | Medium | Medium | Minimum sample size requirements, decay weighting |
Success Metrics¶
Phase 1 Success Metrics (Model Metadata Layer)¶
| Metric | Target | Measurement |
|---|---|---|
| Registry availability | > 99% uptime | Track OpenRouter API failures |
| Context window violations | 0 errors | Monitor "context exceeded" errors |
| Static fallback activation | < 1% of requests | Track fallback usage |
| Model freshness | < 1 hour stale | Track registry refresh success |
Phase 2 Success Metrics (Reasoning Parameters)¶
| Metric | Target | Measurement |
|---|---|---|
| Parameter utilization | 100% for reasoning tier | Track reasoning param usage |
| Budget token efficiency | > 80% utilization | Compare budget vs actual tokens |
| Reasoning quality | No regression | Compare rubric scores before/after |
Phase 3 Success Metrics (Internal Performance Tracker)¶
| Metric | Target | Measurement |
|---|---|---|
| Session coverage | > 95% tracked | Count sessions with metrics |
| Internal metric correlation | > 0.6 with Borda | Validate internal scores vs outcomes |
| Model ranking stability | < 10% weekly variance | Track rank position changes |
| Selection improvement | > 5% higher Borda | Compare dynamic vs static selection |
Overall Success Metrics¶
| Metric | Target | Measurement |
|---|---|---|
| ~~Benchmark correlation~~ | ~~> 0.8~~ | DEFERRED: Internal metrics instead |
| Cost optimization | -15% vs static | Compare equivalent quality |
| Tier pool diversity | ≥ 2 providers | Track provider distribution |
| Anti-Herding effectiveness | No model > 40% traffic | Monitor traffic distribution |
Implementation Phases¶
Council Recommendation: Decouple proven value (metadata) from speculative value (benchmark intelligence). Implement in strict phases with validation gates.
Phase 1: Model Metadata Layer (v0.15.x) ✅ IMPLEMENTED¶
Goal: Dynamic model discovery and capability detection via OpenRouter API.
Status: ✅ COMPLETE (2025-12-23) GitHub Issues: #93, #94, #95 Tests: 79 TDD tests (cache: 20, client: 20, provider: 24, selection: 35)
- [x] Implement OpenRouter API client (
src/llm_council/metadata/openrouter_client.py) - [x] Cache model metadata with TTL (1 hour registry, 5 min availability)
src/llm_council/metadata/cache.py: TTLCache, ModelIntelligenceCache- [x] Add model capability detection (context window, reasoning support, modalities)
src/llm_council/metadata/dynamic_provider.py: DynamicMetadataProvider- [x] Add Context Window as hard constraint in tier filtering
src/llm_council/metadata/selection.py: _meets_context_requirement()- [x] Update
get_tier_models()to use registry with static fallback src/llm_council/metadata/selection.py: select_tier_models()- [x] Implement Anti-Herding logic with traffic tracking
src/llm_council/metadata/selection.py: apply_anti_herding_penalty()- [x] Add ModelIntelligenceConfig to unified_config.py
- [x] Add task_domain parameter to tier_contract.py
Environment Variables:
- LLM_COUNCIL_MODEL_INTELLIGENCE=true enables dynamic selection
- LLM_COUNCIL_OFFLINE=true forces static provider (takes precedence)
Validation Gate: ✅ PASSED - OpenRouter API client with timeout/error handling - Static fallback activates when API unavailable or offline mode enabled - All 1206 tests pass
Phase 1 "Hollow" Fix (2025-12-24):
Initial Phase 1 implementation used regex pattern matching ("hollow" implementation). Fixed to use real metadata from providers (Issues #105-#108).
| Function | Before | After |
|---|---|---|
_get_provider_safe() |
N/A | Returns provider or None gracefully |
_get_quality_score_from_metadata() |
Regex patterns | Real QualityTier lookup |
_get_cost_score_from_metadata() |
Regex patterns | Real pricing data |
_meets_context_requirement() |
Always True | Real context window filtering |
Quality Tier Scores: - FRONTIER: 0.95 - STANDARD: 0.75 - ECONOMY: 0.55 - LOCAL: 0.40
Graceful Degradation: When metadata unavailable, falls back to heuristic estimates.
Phase 2: Reasoning Parameter Optimization (v0.16.x) ✅ IMPLEMENTED¶
Goal: Automatic reasoning parameter configuration for capable models.
- [x] Detect reasoning-capable models from registry metadata
- [x] Apply
reasoning_effortparameter based on tier (quick=minimal, reasoning=high) - [x] Calculate
budget_tokensper effort level - [x] Add task-specific parameter profiles (math→high effort, creative→minimal)
- [x] Update gateway to pass reasoning parameters to OpenRouter
- [x] Track reasoning token usage for cost optimization
Implementation Details (2025-12-24):
Implemented using TDD with 80 new tests (1299 total tests pass).
Module Structure: src/llm_council/reasoning/
| File | Purpose |
|---|---|
types.py |
ReasoningEffort enum, ReasoningConfig frozen dataclass, should_apply_reasoning() |
tracker.py |
ReasoningUsage, AggregatedUsage, extract_reasoning_usage(), aggregate_reasoning_usage() |
__init__.py |
Module exports |
Tier-Effort Mapping: - quick → MINIMAL (10%) - balanced → LOW (20%) - high → MEDIUM (50%) - reasoning → HIGH (80%)
Domain Overrides: math→HIGH, coding→MEDIUM, creative→MINIMAL
Stage Configuration:
- stage1: true (primary responses)
- stage2: false (peer reviews)
- stage3: true (synthesis)
GitHub Issues: #97-#100 (all completed)
Validation Gate: ✅ PASSED - Reasoning parameters correctly applied for all reasoning-tier queries - Token usage tracking shows expected budget allocation - No regressions in non-reasoning tiers (1299 tests pass)
Phase 3: Internal Performance Tracking (v0.17.x) ✅ IMPLEMENTED¶
Council Recommendation: Instead of scraping external benchmarks (high maintenance risk), implement internal performance tracking based on actual council session outcomes.
- [x] Track model performance per council session:
- Borda score received (
ModelSessionMetric.borda_score) - Response latency (
ModelSessionMetric.latency_ms) - Parse success rate (
ModelSessionMetric.parse_success) - Reasoning quality (optional
reasoning_tokens_used) - [x] Build Internal Performance Index from historical sessions
InternalPerformanceTrackerwith rolling window aggregationModelPerformanceIndexwith mean_borda_score, p50/p95_latency, parse_success_rate- [x] Use internal metrics for quality scoring (replaces external benchmarks)
get_quality_score()returns 0-100 normalized score- Cold start: unknown models get neutral score (50)
- [x] Implement rolling window decay (recent sessions weighted higher)
- Exponential decay:
weight = exp(-days_ago / decay_days) - Default decay_days = 30
Implementation Details:
- src/llm_council/performance/ module (4 files, ~700 lines)
- 70 TDD tests in tests/test_performance_*.py
- JSONL storage pattern (follows bias_persistence.py)
- Configuration via PerformanceTrackerConfig in unified_config.py
Validation Gate: Phase 3 complete when:
- 100+ sessions tracked with metrics (tracked via confidence_level=HIGH)
- Internal quality scores correlate with Borda outcomes (by design)
- Model selection uses quality_score from tracker
Phase 4: External Benchmark Integration (DEFERRED) ⏸️¶
Council Warning: External benchmark scraping is HIGH-RISK due to: - API instability (LMArena, LiveBench change formats frequently) - Maintenance burden (scrapers break silently) - Data staleness (monthly updates don't reflect rapid model changes)
Deferred until: Internal Performance Tracking validates the value of quality metrics.
If implemented: - [ ] Start with manual JSON snapshots (not automated scrapers) - [ ] Implement LMArena Elo as optional quality boost (not required) - [ ] LiveBench for contamination-free validation only - [ ] Create benchmark staleness alerts (>30 days = warning)
Internal Performance Tracker¶
Council Recommendation: Build quality metrics from actual council session outcomes rather than external benchmarks.
Performance Metrics Schema¶
@dataclass
class ModelSessionMetric:
"""Performance data from a single council session."""
session_id: str
model_id: str
timestamp: datetime
# Stage 1 metrics
response_latency_ms: int
response_length: int
parse_success: bool
# Stage 2 metrics (from peer review)
borda_score: float # 0.0 - N (N = council size)
normalized_rank: float # 0.0 - 1.0 (1.0 = best)
rubric_scores: Optional[Dict[str, float]] # If rubric scoring enabled
# Stage 3 metrics (from chairman selection)
selected_for_synthesis: bool # Was this response referenced?
@dataclass
class ModelPerformanceIndex:
"""Aggregated performance for a model."""
model_id: str
sample_size: int
last_updated: datetime
# Aggregated metrics
mean_borda_score: float
mean_normalized_rank: float
p50_latency_ms: int
p95_latency_ms: int
parse_success_rate: float
selection_rate: float # How often selected for synthesis
# Confidence
confidence: str # INSUFFICIENT (<10), PRELIMINARY (10-30), MODERATE (30-100), HIGH (>100)
class InternalPerformanceTracker:
"""Track and aggregate model performance from council sessions."""
def __init__(self, store_path: Path, decay_days: int = 30):
self.store_path = store_path
self.decay_days = decay_days
def record_session(self, session_metrics: List[ModelSessionMetric]) -> None:
"""Record metrics from a completed council session."""
# Atomic append to JSONL store
...
def get_model_index(self, model_id: str) -> ModelPerformanceIndex:
"""Get aggregated performance for a model with rolling window."""
# Apply exponential decay to older sessions
# Recent sessions weighted higher
...
def get_quality_score(self, model_id: str) -> float:
"""Get normalized quality score (0-100) for model selection."""
index = self.get_model_index(model_id)
if index.confidence == "INSUFFICIENT":
return 50.0 # Default neutral score
return index.mean_normalized_rank * 100
Integration with Selection Algorithm¶
def select_tier_models(tier: str, ...) -> List[str]:
# ... hard constraints ...
for model in eligible:
# Use INTERNAL performance tracker instead of external benchmarks
quality_score = performance_tracker.get_quality_score(model.model_id)
# ... rest of scoring with tier-specific weights ...
Open Questions (Council Addressed)¶
Resolved by Council Review¶
| Question | Council Answer |
|---|---|
| Should benchmark scores override tier selection? | No. Tiers represent user intent (speed vs quality tradeoff). Benchmarks inform selection within tier. |
| How to handle new models with no data? | Default neutral score (50). Use provider metadata only until internal performance data accumulates. |
| Balance between performance and cost? | Tier-specific. Quick tier: yes, select cheaper. Reasoning tier: never compromise on quality. |
| Auto-apply reasoning parameters? | Yes, by tier. Reasoning tier = high effort, quick tier = minimal effort. |
| Handle benchmark gaming? | Use internal metrics. Council session outcomes are harder to game than public benchmarks. |
Remaining Open Questions¶
- What sample size validates Internal Performance Index?
- Council suggested 100+ sessions for HIGH confidence
-
Is 30+ sessions sufficient for MODERATE confidence?
-
Should models with LOW internal scores be automatically demoted?
- Threshold for exclusion from tier pools?
-
Grace period for new models?
-
How to bootstrap Internal Performance Tracker?
- Run shadow sessions with all available models?
- Start with static config and learn incrementally?
Issues Identified in Full Quorum Review¶
A. Cold Start Problem (Claude, Gemini)
"When a new model appears in OpenRouter, it has zero internal performance data."
Recommended Solutions: - Assign temporary "phantom score" equivalent to tier average until 10+ samples - Implement Epsilon-Greedy exploration (small % of requests try new models) - Minimum sessions required before model enters regular rotation - Manual allowlist for high-profile new releases
B. Borda Score Normalization (Claude)
"A 5-model session gives max score of 4; an 8-model session gives max of 7."
Solution: Normalize to percentile rank (0.0-1.0) rather than raw Borda counts:
C. Parse Success Definition (Claude) Define parse success as ALL of: - Valid JSON returned (if JSON expected) - Schema-compliant response - Extractable vote/rationale for Stage 2
D. Anti-Herding Edge Case (Gemini)
"If only 2 models pass hard constraints, the system might oscillate wildly."
Solution: Disable Anti-Herding when eligible model count < 3.
E. Degradation Behavior (Claude)
"What happens when ALL eligible models for a tier fall below acceptable thresholds?"
Fallback Chain: 1. Warn user and proceed with best-available 2. Escalate to adjacent tier (quick→balanced, balanced→high) 3. Fall back to static config as last resort
References¶
External Sources¶
- OpenRouter Models API
- OpenRouter Rankings
- LMArena Leaderboard
- LiveBench
- Artificial Analysis
- LLM Stats
- Not Diamond Model Routing
Related ADRs¶
- ADR-020: Not Diamond Integration Strategy
- ADR-022: Tiered Model Selection
- ADR-023: Multi-Router Gateway Support
- ADR-024: Unified Routing Architecture