ADR-026: Dynamic Model Intelligence and Benchmark-Driven Selection¶

Status: APPROVED (Blocking Conditions Implemented) Date: 2025-12-23 Decision Makers: Engineering, Architecture Council Review: 2025-12-23 (Strategic + Technical Reviews) Layer Assignment: Cross-cutting (L1-L4 integration) Implementation: 2025-12-23 (Blocking Conditions 1-3)

⚠️ CRITICAL: Strategic Council Review - Vendor Dependency Risk¶

Verdict: CONDITIONAL APPROVAL¶

ADR-026 is NOT APPROVED in its current form. The council identified critical vendor dependency risks that must be addressed before implementation.

"We cannot build the core 'brain' of an open-source project on proprietary APIs that we do not control." — Council Consensus

The "Sovereign Orchestrator" Philosophy¶

The council unanimously adopts this architectural principle:

The open-source version of LLM Council must function as a complete, independent utility. External services (like OpenRouter or Not Diamond) must be treated as PLUGINS, not foundations.

If the internet is disconnected or if an API key is revoked, the software must still boot, run, and perform its core function (orchestrating LLMs), even if quality is degraded.

Blocking Conditions for Approval¶

#	Condition	Status	Priority
1	Add `ModelMetadataProvider` abstraction interface	✅ COMPLETED	BLOCKING
2	Implement `StaticRegistryProvider` (30+ models)	✅ COMPLETED (31 models)	BLOCKING
3	Add offline mode (`LLM_COUNCIL_OFFLINE=true`)	✅ COMPLETED	BLOCKING
4	Evaluate LiteLLM as unified abstraction	✅ COMPLETED (as fallback)	High
5	Document degraded vs. enhanced feature matrix	📋 Required	Medium

Implementation Notes (2025-12-23)¶

The blocking conditions were implemented using TDD (Test-Driven Development) with 86 passing tests.

Module Structure: src/llm_council/metadata/

File	Purpose
`types.py`	`ModelInfo` frozen dataclass, `QualityTier` enum, `Modality` enum
`protocol.py`	`MetadataProvider` `@runtime_checkable` Protocol
`static_registry.py`	`StaticRegistryProvider` with YAML + LiteLLM fallback
`litellm_adapter.py`	Lazy LiteLLM import for metadata extraction
`offline.py`	`is_offline_mode()` and `check_offline_mode_startup()`
`__init__.py`	`get_provider()` singleton factory, module exports

Bundled Registry: src/llm_council/models/registry.yaml

31 models from 8 providers: - OpenAI (7): gpt-4o, gpt-4o-mini, gpt-5.2-pro, o1, o1-preview, o1-mini, o3-mini - Anthropic (5): claude-opus-4.5, claude-3-5-sonnet, claude-3-5-haiku, claude-3-opus, claude-3-sonnet - Google (5): gemini-3-pro-preview, gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash - xAI (2): grok-4, grok-4.1-fast - DeepSeek (2): deepseek-r1, deepseek-chat - Meta (2): llama-3.3-70b, llama-3.1-405b - Mistral (2): mistral-large-2411, mistral-medium - Ollama (6): llama3.2, mistral, qwen2.5:14b, codellama, phi3, deepseek-r1:8b

LiteLLM Integration: Used as fallback in the priority chain (local registry > LiteLLM > 4096 default). Lazy import prevents startup failures when LiteLLM is not installed.

GitHub Issues: #89-#92 (all completed)

Strategic Decision: Option C+D (Hybrid + Abstraction)¶

Feature	OSS (Self-Hosted)	Council Cloud (Commercial)
Model Metadata	Static library (LiteLLM) + Manual YAML config	Real-time dynamic sync via OpenRouter
Routing	Heuristic rules (latency/cost-based)	Intelligent ML-based (Not Diamond)
Integrations	BYOK (Bring Your Own Keys)	Managed Fleet (one bill, instant access)
Operations	`localhost` / Individual instance	Team governance, analytics, SSO

Vendor Dependency Analysis¶

Service	Current Role	Risk Level	Required Mitigation
OpenRouter	Metadata API, Gateway	HIGH	Static fallback + LiteLLM
Not Diamond	Model routing, Classification	MEDIUM	Heuristic fallback (exists)
Requesty	Alternative gateway	LOW	Already optional

Affiliate/Reseller Model: NOT VIABLE¶

"Reliance on affiliate revenue or tight coupling creates Platform Risk. If OpenRouter releases 'OpenRouter Agents,' Council becomes obsolete instantly. Furthermore, council-cloud cannot withstand margin compression." — Council

Decision: Use external services to lower the User's barrier to entry, not as the backbone of the Product's value.

Required Abstraction Architecture¶

MetadataProvider Interface (MANDATORY)¶

from typing import Protocol, Optional, Dict, List
from dataclasses import dataclass

@dataclass
class ModelInfo:
    id: str
    context_window: int
    pricing: Dict[str, float]  # {"prompt": 0.01, "completion": 0.03}
    supported_parameters: List[str]
    modalities: List[str]
    quality_tier: str  # "frontier" | "standard" | "economy"

class MetadataProvider(Protocol):
    """Abstract interface for model metadata sources."""

    def get_model_info(self, model_id: str) -> Optional[ModelInfo]: ...
    def get_context_window(self, model_id: str) -> int: ...
    def get_pricing(self, model_id: str) -> Dict[str, float]: ...
    def supports_reasoning(self, model_id: str) -> bool: ...
    def list_available_models(self) -> List[str]: ...

class StaticRegistryProvider(MetadataProvider):
    """Default: Offline-safe provider using bundled registry + LiteLLM."""

    def __init__(self, registry_path: Path = None):
        self.registry = self._load_registry(registry_path)
        self.litellm_data = self._load_litellm_model_map()

    def get_context_window(self, model_id: str) -> int:
        # 1. Check local config override
        if model_id in self.registry:
            return self.registry[model_id].context_window
        # 2. Check LiteLLM library
        if model_id in self.litellm_data:
            return self.litellm_data[model_id].context_window
        # 3. Safe default
        return 4096

class DynamicMetadataProvider(MetadataProvider):
    """Optional: Real-time metadata from OpenRouter API."""

    async def refresh(self) -> None:
        """Fetch latest model data - requires API key."""
        ...

Static Registry Schema (MANDATORY)¶

# models/registry.yaml - Shipped with OSS
version: "1.0"
updated: "2025-12-23"
models:
  - id: "openai/gpt-4o"
    context_window: 128000
    pricing:
      prompt: 0.0025
      completion: 0.01
    supported_parameters: ["temperature", "top_p", "tools"]
    modalities: ["text", "vision"]
    quality_tier: "frontier"

  - id: "anthropic/claude-opus-4.5"
    context_window: 200000
    pricing:
      prompt: 0.015
      completion: 0.075
    supported_parameters: ["temperature", "top_p", "tools", "reasoning"]
    modalities: ["text", "vision"]
    quality_tier: "frontier"

  - id: "ollama/llama3.2"
    provider: "ollama"
    context_window: 128000
    pricing:
      prompt: 0
      completion: 0
    modalities: ["text"]
    quality_tier: "local"

Offline Mode (MANDATORY)¶

# Force offline operation - MUST work without any external calls
export LLM_COUNCIL_OFFLINE=true

When offline mode is enabled: 1. Use StaticRegistryProvider exclusively 2. Disable all external metadata/routing calls 3. Log INFO message about limited/stale metadata 4. All core council operations MUST succeed

Technical Council Review Summary¶

Technical Review (2025-12-23) - Full Quorum¶

Model	Verdict	Rank	Response Time
Claude Opus 4.5	CONDITIONAL APPROVAL	#1	23.4s
Gemini 3 Pro	APPROVE	#2	31.4s
Grok 4	APPROVE	#3	59.6s
GPT-4o	APPROVE	#4	9.8s

"The council successfully identified Response C (Claude) as the superior review, noting its crucial detection of mathematical flaws (Borda normalization with variable pool sizes) and logical gaps (Cold Start) missed by other responses."

First Technical Review (2025-12-23, 3/4 models)¶

Approved Components: - Dynamic metadata integration via OpenRouter API (pricing, availability, capability detection) - Reasoning parameter optimization (reasoning_effort, budget_tokens) - Integration points with existing L1-L4 architecture

Returned for Revision (Now Resolved): - ~~Benchmark scraping strategy~~ → Deferred to Phase 4, use Internal Performance Tracker - ~~Single scoring algorithm with "magic number" weights~~ → Tier-Specific Weighting Matrices

Key Technical Recommendations¶

Recommendation	Status	Priority
Add Context Window as hard constraint	✅ Incorporated	Critical
Replace single scoring with Tier-Specific Weighting	✅ Incorporated	High
Defer benchmark scraping to optional Phase 4	✅ Incorporated	High
Add Anti-Herding logic	✅ Incorporated	Medium
Implement Internal Performance Tracker	✅ Incorporated	Medium
Cold Start handling for new models	📋 Documented	Medium
Borda score normalization	📋 Documented	Medium
Anti-Herding edge case (<3 models)	📋 Documented	Low

Council Consensus Points¶

Context Window is a hard pass/fail constraint - must filter before scoring, not weight
Tier-specific weighting is essential - quick tier prioritizes speed, reasoning tier prioritizes quality
Benchmark scraping is high-risk - external APIs change frequently, creates maintenance nightmare
Internal performance data is more valuable - track actual council session outcomes
Phased approach required - decouple metadata (proven value) from benchmark intelligence (speculative)
Cold Start needs exploration strategy - new models need "audition" mechanism (Phase 3)
LiteLLM strongly recommended - use as library for metadata, not just proxy

Context¶

Problem Statement¶

The LLM Council's current model selection relies on static configuration that quickly becomes stale in the rapidly evolving model landscape. December 2025 alone saw major releases from all frontier labs:

Release Date	Model	Provider
Nov 17, 2025	Grok 4.1	xAI
Nov 18, 2025	Gemini 3 Pro	Google
Nov 24, 2025	Claude Opus 4.5	Anthropic
Dec 11, 2025	GPT-5.2	OpenAI

Our tier pools in config.py reference models that may be: - Deprecated or renamed (model identifiers change) - Outperformed by newer models (benchmarks shift monthly) - Suboptimally configured (missing reasoning parameters) - Unavailable or rate-limited (provider status changes)

Current Architecture Gaps¶

Gap	Impact	Current State
Static tier pools	Stale model selection	Hardcoded in `config.py`
No benchmark integration	Suboptimal model-task matching	Manual updates
No model metadata	Missing capabilities detection	Assumed uniform
No reasoning parameters	Underutilized model capabilities	Default parameters only
No availability tracking	Failures on unavailable models	Reactive error handling

Existing Foundation (ADRs 020, 022, 024)¶

The architecture already supports dynamic model selection:

ADR	Component	Opportunity
ADR-020	Not Diamond integration	Model routing API exists but uses static candidates
ADR-022	Tier contracts	`allowed_models` field could be dynamically populated
ADR-024	Layer architecture	L1 tier selection could query external data sources

Decision¶

Implement a Model Intelligence Layer that provides real-time model metadata, benchmark rankings, and dynamic pool management to all routing layers.

Architecture Overview¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MODEL INTELLIGENCE LAYER (New)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │ Model Registry  │  │ Benchmark Index │  │ Availability    │              │
│  │ (OpenRouter API)│  │ (Leaderboards)  │  │ Monitor         │              │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘              │
│           │                    │                    │                        │
│           └────────────────────┴────────────────────┘                        │
│                                │                                             │
│                    ┌───────────▼───────────┐                                 │
│                    │   Model Selector API   │                                │
│                    │   - get_tier_models()  │                                │
│                    │   - get_best_for_task()│                                │
│                    │   - get_model_params() │                                │
│                    └───────────┬───────────┘                                 │
│                                │                                             │
└────────────────────────────────┼─────────────────────────────────────────────┘
                                 │
        ┌────────────────────────┼────────────────────────────────┐
        │                        │                                │
        ▼                        ▼                                ▼
┌───────────────┐       ┌───────────────┐                ┌───────────────┐
│ L1: Tier      │       │ L2: Query     │                │ L4: Gateway   │
│ Selection     │       │ Triage        │                │ Routing       │
│ (ADR-022)     │       │ (ADR-020)     │                │ (ADR-023)     │
└───────────────┘       └───────────────┘                └───────────────┘

Data Sources¶

1. OpenRouter Models API¶

Endpoint: GET https://openrouter.ai/api/v1/models

Provides real-time model metadata:

{
  "id": "anthropic/claude-opus-4-5-20250514",
  "name": "Claude Opus 4.5",
  "pricing": {
    "prompt": "0.000015",
    "completion": "0.000075"
  },
  "context_length": 200000,
  "architecture": {
    "input_modalities": ["text", "image"],
    "output_modalities": ["text"]
  },
  "supported_parameters": ["temperature", "top_p", "reasoning"],
  "top_provider": {
    "is_moderated": true
  }
}

Key Fields for Selection: - pricing - Cost optimization - context_length - Long document handling - supported_parameters - Reasoning mode detection - input_modalities - Multimodal capability

2. Benchmark Leaderboards¶

Source	Data	Update Frequency	API
LMArena	Elo ratings from 5M+ votes	Real-time	Public
LiveBench	Monthly contamination-free benchmarks	Monthly	Public
Artificial Analysis	Speed, cost, quality metrics	Weekly	Public
LLM Stats	Aggregated performance data	Daily	Public

Benchmark Categories: - Reasoning: GPQA Diamond, AIME 2025, ARC-AGI-2 - Coding: SWE-bench, LiveCodeBench, Terminal-Bench - General: MMLU-Pro, Humanity's Last Exam - Speed: Tokens/second, time-to-first-token

3. OpenRouter Rankings¶

Endpoint: GET https://openrouter.ai/rankings

Usage-based popularity metrics (tokens served, request count).

Model Parameter Optimization¶

Reasoning Mode Parameters¶

OpenRouter supports unified reasoning parameters:

# For reasoning-capable models (o1, o3, GPT-5, Claude with thinking)
request_params = {
    "reasoning": {
        "effort": "high",  # "minimal"|"low"|"medium"|"high"|"xhigh"
        "max_tokens": 32000,  # Budget for reasoning tokens
        "exclude": False,  # Include reasoning in response
    }
}

Effort Level Budget Calculation:

budget_tokens = max(min(max_tokens * effort_ratio, 32000), 1024)

effort_ratio:
  xhigh: 0.95
  high: 0.80
  medium: 0.50
  low: 0.20
  minimal: 0.10

Parameter Detection¶

def get_model_params(model_id: str, task_type: str) -> dict:
    """Get optimized parameters for model and task."""
    model_info = model_registry.get(model_id)

    params = {}

    # Enable reasoning for supported models on complex tasks
    if "reasoning" in model_info.supported_parameters:
        if task_type in ["reasoning", "math", "coding"]:
            params["reasoning"] = {
                "effort": "high" if task_type == "reasoning" else "medium"
            }

    # Adjust temperature for task type
    if task_type == "creative":
        params["temperature"] = 0.9
    elif task_type in ["coding", "math"]:
        params["temperature"] = 0.2

    return params

Dynamic Tier Pool Management¶

Tier Requirements Matrix¶

Tier	Latency Budget	Cost Ceiling	Min Models	Required Capabilities
quick	P95 < 10s	< $0.001/req	3	Fast inference
balanced	P95 < 45s	< $0.01/req	3-4	Good reasoning
high	P95 < 120s	< $0.10/req	4-5	Full capability
reasoning	P95 < 300s	< $1.00/req	3-4	Extended thinking

Dynamic Pool Selection Algorithm¶

Council Revision: Algorithm updated per council feedback to: 1. Add Context Window as hard pass/fail constraint 2. Replace global weights with Tier-Specific Weighting Matrices 3. Add Anti-Herding logic to prevent traffic concentration

@dataclass
class ModelScore:
    model_id: str
    benchmark_score: float  # Normalized 0-100 (optional, from internal tracker)
    latency_p95: float      # Seconds
    cost_per_request: float # USD
    availability: float     # 0-1
    diversity_score: float  # Provider diversity
    context_window: int     # Token limit (HARD CONSTRAINT)
    recent_traffic: float   # 0-1, for anti-herding

# COUNCIL RECOMMENDATION: Tier-Specific Weighting Matrices
# Replaces "magic number" global weights (0.4/0.2/0.2/0.1/0.1)
TIER_WEIGHTS = {
    "quick": {
        "latency": 0.45,      # Speed is primary concern
        "cost": 0.25,         # Budget-conscious
        "quality": 0.15,      # Acceptable quality
        "availability": 0.10,
        "diversity": 0.05,
    },
    "balanced": {
        "quality": 0.35,      # Better quality
        "latency": 0.25,      # Still matters
        "cost": 0.20,         # Cost-aware
        "availability": 0.10,
        "diversity": 0.10,
    },
    "high": {
        "quality": 0.50,      # Quality is paramount
        "availability": 0.20, # Must be reliable
        "latency": 0.15,      # Acceptable wait
        "diversity": 0.10,    # Multiple perspectives
        "cost": 0.05,         # Cost secondary
    },
    "reasoning": {
        "quality": 0.60,      # Best possible quality
        "availability": 0.20, # Critical reliability
        "diversity": 0.10,    # Diverse reasoning
        "latency": 0.05,      # Patience for quality
        "cost": 0.05,         # Cost not a factor
    },
}

def select_tier_models(
    tier: str,
    task_domain: Optional[str] = None,
    count: int = 4,
    required_context: Optional[int] = None,  # NEW: context requirement
) -> List[str]:
    """Select optimal models for tier using multi-criteria scoring.

    Council-Validated Algorithm:
    1. Apply HARD CONSTRAINTS (pass/fail)
    2. Score using TIER-SPECIFIC weights
    3. Apply ANTI-HERDING penalty
    4. Ensure PROVIDER DIVERSITY
    """

    candidates = model_registry.get_available_models()
    tier_config = TIER_REQUIREMENTS[tier]
    weights = TIER_WEIGHTS[tier]

    # ===== HARD CONSTRAINTS (Pass/Fail) =====
    # Council Critical: Context window MUST be hard constraint, not weighted
    eligible = [
        m for m in candidates
        if m.latency_p95 <= tier_config.latency_budget
        and m.cost_per_request <= tier_config.cost_ceiling
        and m.availability >= 0.95
        # COUNCIL ADDITION: Context window as hard constraint
        and (required_context is None or m.context_window >= required_context)
    ]

    if not eligible:
        logger.warning(f"No models meet hard constraints for tier={tier}")
        return fallback_to_static_config(tier)

    # ===== SOFT SCORING (Tier-Specific Weights) =====
    scored = []
    for model in eligible:
        # Normalize scores to 0-1 range
        latency_score = 1 - (model.latency_p95 / tier_config.latency_budget)
        cost_score = 1 - (model.cost_per_request / tier_config.cost_ceiling)
        quality_score = model.benchmark_score / 100 if model.benchmark_score else 0.5

        score = (
            quality_score * weights["quality"] +
            latency_score * weights["latency"] +
            cost_score * weights["cost"] +
            model.availability * weights["availability"] +
            model.diversity_score * weights["diversity"]
        )

        # Domain boost (task-specific enhancement)
        if task_domain and task_domain in model.strengths:
            score *= 1.15

        # COUNCIL ADDITION: Anti-Herding Penalty
        # Prevent traffic concentration on popular models
        if model.recent_traffic > 0.3:  # More than 30% of recent traffic
            score *= (1 - (model.recent_traffic - 0.3) * 0.5)  # Up to 35% penalty

        scored.append((model.model_id, score))

    # ===== DIVERSITY ENFORCEMENT =====
    selected = select_with_diversity(scored, count, min_providers=2)

    return selected

Benchmark Score Normalization (DEFERRED - Phase 4)¶

Council Warning: This section describes external benchmark integration which is DEFERRED to Phase 4. Use Internal Performance Tracker (Phase 3) for quality scoring in initial releases.

# DEFERRED: Only implement after Internal Performance Tracker validates value
def normalize_benchmark_scores(model_id: str) -> float:
    """Aggregate benchmark scores into single quality metric.

    WARNING: External benchmark scraping is high-maintenance.
    Prefer Internal Performance Tracker for quality scoring.
    Only implement if internal metrics prove insufficient.
    """

    # Start with manual JSON snapshots, NOT automated scrapers
    scores = load_manual_benchmark_snapshot(model_id)

    if not scores:
        return None  # Fall back to internal metrics

    # Weighted aggregation (emphasize reasoning and coding)
    weights = {
        "lmarena_elo": 0.3,      # Human preference
        "livebench": 0.2,        # Contamination-free
        "gpqa_diamond": 0.25,    # Science reasoning
        "swe_bench": 0.25,       # Coding capability
    }

    normalized = sum(
        normalize_to_100(scores[k]) * weights[k]
        for k in weights
        if scores.get(k) is not None
    )

    return normalized

Integration Points¶

1. Layer 1 Enhancement (ADR-022)¶

# tier_contract.py modification
def create_tier_contract(tier: str, task_domain: Optional[str] = None) -> TierContract:
    """Create tier contract with dynamically selected models."""

    # Use Model Intelligence Layer instead of static config
    models = model_intelligence.select_tier_models(
        tier=tier,
        task_domain=task_domain,
        count=TIER_MODEL_COUNTS[tier],
    )

    # Get tier-appropriate aggregator
    aggregator = model_intelligence.get_aggregator_for_tier(tier)

    return TierContract(
        tier=tier,
        allowed_models=models,
        aggregator_model=aggregator,
        **get_tier_timeout(tier),
    )

2. Layer 2 Enhancement (ADR-020)¶

# not_diamond.py modification
async def route_with_intelligence(
    query: str,
    tier_contract: TierContract,
) -> RouteResult:
    """Route using Not Diamond + Model Intelligence."""

    # Get task-appropriate candidates from intelligence layer
    candidates = model_intelligence.select_tier_models(
        tier=tier_contract.tier,
        task_domain=classify_domain(query),
    )

    # Get optimized parameters for each candidate
    params = {
        model: model_intelligence.get_model_params(model, query)
        for model in candidates
    }

    # Route using Not Diamond (with enriched candidates)
    if is_not_diamond_available():
        result = await not_diamond.route(query, candidates)
        return RouteResult(
            model=result.model,
            params=params[result.model],
            confidence=result.confidence,
        )

    # Fallback to intelligence-based selection
    return RouteResult(
        model=candidates[0],
        params=params[candidates[0]],
        confidence=0.7,
    )

3. Gateway Enhancement (ADR-023)¶

# gateway/types.py modification
@dataclass
class GatewayRequest:
    model: str
    messages: List[CanonicalMessage]
    # New: Model-specific parameters from intelligence layer
    model_params: Optional[Dict[str, Any]] = None

    def apply_model_params(self) -> Dict[str, Any]:
        """Apply optimized parameters to request."""
        request = self.to_openai_format()
        if self.model_params:
            request.update(self.model_params)
        return request

Caching and Refresh Strategy¶

Cache Layers¶

Data	Cache TTL	Refresh Trigger
Model registry	1 hour	API call / manual
Benchmark scores	24 hours	Daily cron
Availability status	5 minutes	Health check failures
Latency metrics	15 minutes	Rolling window

Implementation¶

class ModelIntelligenceCache:
    def __init__(self):
        self.registry_cache = TTLCache(maxsize=500, ttl=3600)
        self.benchmark_cache = TTLCache(maxsize=100, ttl=86400)
        self.availability_cache = TTLCache(maxsize=500, ttl=300)

    async def refresh_registry(self):
        """Fetch latest model data from OpenRouter."""
        async with httpx.AsyncClient() as client:
            response = await client.get(
                "https://openrouter.ai/api/v1/models",
                headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"}
            )
            models = response.json()["data"]

            for model in models:
                self.registry_cache[model["id"]] = ModelInfo.from_api(model)

    async def refresh_benchmarks(self):
        """Fetch latest benchmark data from leaderboards."""
        # LMArena Elo
        lmarena = await fetch_lmarena_leaderboard()
        # LiveBench
        livebench = await fetch_livebench_scores()
        # Artificial Analysis
        aa = await fetch_artificial_analysis()

        # Merge and normalize
        for model_id in self.registry_cache:
            self.benchmark_cache[model_id] = BenchmarkData(
                lmarena_elo=lmarena.get(model_id),
                livebench=livebench.get(model_id),
                artificial_analysis=aa.get(model_id),
            )

Configuration¶

Environment Variables¶

# Model Intelligence Layer
LLM_COUNCIL_MODEL_INTELLIGENCE=true|false  # Enable dynamic selection
LLM_COUNCIL_BENCHMARK_SOURCE=lmarena|livebench|artificial_analysis|aggregate
LLM_COUNCIL_REFRESH_INTERVAL=3600  # Registry refresh interval (seconds)

# Fallback to static config if intelligence unavailable
LLM_COUNCIL_STATIC_FALLBACK=true|false

# Minimum benchmark score thresholds
LLM_COUNCIL_MIN_BENCHMARK_SCORE=60  # 0-100 normalized
LLM_COUNCIL_MIN_AVAILABILITY=0.95   # 0-1

# Provider diversity
LLM_COUNCIL_MIN_PROVIDERS=2  # Minimum distinct providers per tier

YAML Configuration¶

Council Revision: Updated to use tier-specific weights instead of global weights.

council:
  model_intelligence:
    enabled: true
    sources:
      openrouter_api: true
      # DEFERRED: External benchmark sources (Phase 4)
      # lmarena: false
      # livebench: false
      # artificial_analysis: false
      internal_performance: true  # Phase 3: Use council session outcomes

    refresh:
      registry_ttl: 3600
      # benchmark_ttl: 86400  # DEFERRED
      availability_ttl: 300
      performance_ttl: 3600  # Internal performance cache

    selection:
      # COUNCIL REVISION: Tier-specific weights instead of global weights
      tier_weights:
        quick:
          latency: 0.45
          cost: 0.25
          quality: 0.15
          availability: 0.10
          diversity: 0.05
        balanced:
          quality: 0.35
          latency: 0.25
          cost: 0.20
          availability: 0.10
          diversity: 0.10
        high:
          quality: 0.50
          availability: 0.20
          latency: 0.15
          diversity: 0.10
          cost: 0.05
        reasoning:
          quality: 0.60
          availability: 0.20
          diversity: 0.10
          latency: 0.05
          cost: 0.05

      constraints:
        min_providers: 2
        min_availability: 0.95
        max_cost_multiplier: 10  # vs cheapest option

      # COUNCIL ADDITION: Anti-Herding
      anti_herding:
        enabled: true
        traffic_threshold: 0.3  # 30% of recent traffic
        max_penalty: 0.35       # Up to 35% score reduction

    parameters:
      auto_reasoning: true  # Enable reasoning params when appropriate
      reasoning_effort_by_tier:
        quick: minimal
        balanced: low
        high: medium
        reasoning: high

    # COUNCIL ADDITION: Internal Performance Tracker
    performance_tracker:
      enabled: true
      store_path: "${HOME}/.llm-council/performance.jsonl"
      decay_days: 30
      min_samples_preliminary: 10
      min_samples_moderate: 30
      min_samples_high: 100

Risks and Mitigations¶

Council-Identified Risks (High Priority)¶

Risk	Likelihood	Impact	Mitigation
Benchmark scraper breakage	HIGH	HIGH	DEFER to Phase 4; use manual snapshots, not scrapers
Traffic herding	Medium	High	Anti-Herding penalty in selection algorithm
Context window violations	Medium	High	Hard constraint filter (not weighted)
Magic number weights	N/A	Medium	Tier-specific weight matrices

Original Risks (Updated)¶

Risk	Likelihood	Impact	Mitigation
External API unavailability	Medium	High	Static fallback, aggressive caching
~~Benchmark data staleness~~	~~Medium~~	~~Medium~~	DEFERRED: Internal Performance Tracker instead
Model identifier changes	High	Medium	Fuzzy matching, alias tracking
Over-optimization	Medium	Medium	Diversity constraints, Anti-Herding logic
Cold start latency	Low	Medium	Pre-warm cache on startup
~~Provider bias in benchmarks~~	~~Medium~~	~~Low~~	DEFERRED: Internal metrics not susceptible
Internal metric bias	Medium	Medium	Minimum sample size requirements, decay weighting

Success Metrics¶

Phase 1 Success Metrics (Model Metadata Layer)¶

Metric	Target	Measurement
Registry availability	> 99% uptime	Track OpenRouter API failures
Context window violations	0 errors	Monitor "context exceeded" errors
Static fallback activation	< 1% of requests	Track fallback usage
Model freshness	< 1 hour stale	Track registry refresh success

Phase 2 Success Metrics (Reasoning Parameters)¶

Metric	Target	Measurement
Parameter utilization	100% for reasoning tier	Track reasoning param usage
Budget token efficiency	> 80% utilization	Compare budget vs actual tokens
Reasoning quality	No regression	Compare rubric scores before/after

Phase 3 Success Metrics (Internal Performance Tracker)¶

Metric	Target	Measurement
Session coverage	> 95% tracked	Count sessions with metrics
Internal metric correlation	> 0.6 with Borda	Validate internal scores vs outcomes
Model ranking stability	< 10% weekly variance	Track rank position changes
Selection improvement	> 5% higher Borda	Compare dynamic vs static selection

Overall Success Metrics¶

Metric	Target	Measurement
~~Benchmark correlation~~	~~> 0.8~~	DEFERRED: Internal metrics instead
Cost optimization	-15% vs static	Compare equivalent quality
Tier pool diversity	≥ 2 providers	Track provider distribution
Anti-Herding effectiveness	No model > 40% traffic	Monitor traffic distribution

Implementation Phases¶

Council Recommendation: Decouple proven value (metadata) from speculative value (benchmark intelligence). Implement in strict phases with validation gates.

Phase 1: Model Metadata Layer (v0.15.x) ✅ IMPLEMENTED¶

Goal: Dynamic model discovery and capability detection via OpenRouter API.

Status: ✅ COMPLETE (2025-12-23) GitHub Issues: #93, #94, #95 Tests: 79 TDD tests (cache: 20, client: 20, provider: 24, selection: 35)

[x] Implement OpenRouter API client (src/llm_council/metadata/openrouter_client.py)
[x] Cache model metadata with TTL (1 hour registry, 5 min availability)
src/llm_council/metadata/cache.py: TTLCache, ModelIntelligenceCache
[x] Add model capability detection (context window, reasoning support, modalities)
src/llm_council/metadata/dynamic_provider.py: DynamicMetadataProvider
[x] Add Context Window as hard constraint in tier filtering
src/llm_council/metadata/selection.py: _meets_context_requirement()
[x] Update get_tier_models() to use registry with static fallback
src/llm_council/metadata/selection.py: select_tier_models()
[x] Implement Anti-Herding logic with traffic tracking
src/llm_council/metadata/selection.py: apply_anti_herding_penalty()
[x] Add ModelIntelligenceConfig to unified_config.py
[x] Add task_domain parameter to tier_contract.py

Environment Variables: - LLM_COUNCIL_MODEL_INTELLIGENCE=true enables dynamic selection - LLM_COUNCIL_OFFLINE=true forces static provider (takes precedence)

Validation Gate: ✅ PASSED - OpenRouter API client with timeout/error handling - Static fallback activates when API unavailable or offline mode enabled - All 1206 tests pass

Phase 1 "Hollow" Fix (2025-12-24):

Initial Phase 1 implementation used regex pattern matching ("hollow" implementation). Fixed to use real metadata from providers (Issues #105-#108).

Function	Before	After
`_get_provider_safe()`	N/A	Returns provider or None gracefully
`_get_quality_score_from_metadata()`	Regex patterns	Real QualityTier lookup
`_get_cost_score_from_metadata()`	Regex patterns	Real pricing data
`_meets_context_requirement()`	Always True	Real context window filtering

Quality Tier Scores: - FRONTIER: 0.95 - STANDARD: 0.75 - ECONOMY: 0.55 - LOCAL: 0.40

Graceful Degradation: When metadata unavailable, falls back to heuristic estimates.

Phase 2: Reasoning Parameter Optimization (v0.16.x) ✅ IMPLEMENTED¶

Goal: Automatic reasoning parameter configuration for capable models.

[x] Detect reasoning-capable models from registry metadata
[x] Apply reasoning_effort parameter based on tier (quick=minimal, reasoning=high)
[x] Calculate budget_tokens per effort level
[x] Add task-specific parameter profiles (math→high effort, creative→minimal)
[x] Update gateway to pass reasoning parameters to OpenRouter
[x] Track reasoning token usage for cost optimization

Implementation Details (2025-12-24):

Implemented using TDD with 80 new tests (1299 total tests pass).

Module Structure: src/llm_council/reasoning/

File	Purpose
`types.py`	`ReasoningEffort` enum, `ReasoningConfig` frozen dataclass, `should_apply_reasoning()`
`tracker.py`	`ReasoningUsage`, `AggregatedUsage`, `extract_reasoning_usage()`, `aggregate_reasoning_usage()`
`__init__.py`	Module exports

Tier-Effort Mapping: - quick → MINIMAL (10%) - balanced → LOW (20%) - high → MEDIUM (50%) - reasoning → HIGH (80%)

Domain Overrides: math→HIGH, coding→MEDIUM, creative→MINIMAL

Stage Configuration: - stage1: true (primary responses) - stage2: false (peer reviews) - stage3: true (synthesis)

GitHub Issues: #97-#100 (all completed)

Validation Gate: ✅ PASSED - Reasoning parameters correctly applied for all reasoning-tier queries - Token usage tracking shows expected budget allocation - No regressions in non-reasoning tiers (1299 tests pass)

Phase 3: Internal Performance Tracking (v0.17.x) ✅ IMPLEMENTED¶

Council Recommendation: Instead of scraping external benchmarks (high maintenance risk), implement internal performance tracking based on actual council session outcomes.

[x] Track model performance per council session:
Borda score received (ModelSessionMetric.borda_score)
Response latency (ModelSessionMetric.latency_ms)
Parse success rate (ModelSessionMetric.parse_success)
Reasoning quality (optional reasoning_tokens_used)
[x] Build Internal Performance Index from historical sessions
InternalPerformanceTracker with rolling window aggregation
ModelPerformanceIndex with mean_borda_score, p50/p95_latency, parse_success_rate
[x] Use internal metrics for quality scoring (replaces external benchmarks)
get_quality_score() returns 0-100 normalized score
Cold start: unknown models get neutral score (50)
[x] Implement rolling window decay (recent sessions weighted higher)
Exponential decay: weight = exp(-days_ago / decay_days)
Default decay_days = 30

Implementation Details: - src/llm_council/performance/ module (4 files, ~700 lines) - 70 TDD tests in tests/test_performance_*.py - JSONL storage pattern (follows bias_persistence.py) - Configuration via PerformanceTrackerConfig in unified_config.py

Validation Gate: Phase 3 complete when: - 100+ sessions tracked with metrics (tracked via confidence_level=HIGH) - Internal quality scores correlate with Borda outcomes (by design) - Model selection uses quality_score from tracker

Phase 4: External Benchmark Integration (DEFERRED) ⏸️¶

Council Warning: External benchmark scraping is HIGH-RISK due to: - API instability (LMArena, LiveBench change formats frequently) - Maintenance burden (scrapers break silently) - Data staleness (monthly updates don't reflect rapid model changes)

Deferred until: Internal Performance Tracking validates the value of quality metrics.

If implemented: - [ ] Start with manual JSON snapshots (not automated scrapers) - [ ] Implement LMArena Elo as optional quality boost (not required) - [ ] LiveBench for contamination-free validation only - [ ] Create benchmark staleness alerts (>30 days = warning)

Internal Performance Tracker¶

Council Recommendation: Build quality metrics from actual council session outcomes rather than external benchmarks.

Performance Metrics Schema¶

@dataclass
class ModelSessionMetric:
    """Performance data from a single council session."""
    session_id: str
    model_id: str
    timestamp: datetime

    # Stage 1 metrics
    response_latency_ms: int
    response_length: int
    parse_success: bool

    # Stage 2 metrics (from peer review)
    borda_score: float              # 0.0 - N (N = council size)
    normalized_rank: float          # 0.0 - 1.0 (1.0 = best)
    rubric_scores: Optional[Dict[str, float]]  # If rubric scoring enabled

    # Stage 3 metrics (from chairman selection)
    selected_for_synthesis: bool    # Was this response referenced?

@dataclass
class ModelPerformanceIndex:
    """Aggregated performance for a model."""
    model_id: str
    sample_size: int
    last_updated: datetime

    # Aggregated metrics
    mean_borda_score: float
    mean_normalized_rank: float
    p50_latency_ms: int
    p95_latency_ms: int
    parse_success_rate: float
    selection_rate: float           # How often selected for synthesis

    # Confidence
    confidence: str  # INSUFFICIENT (<10), PRELIMINARY (10-30), MODERATE (30-100), HIGH (>100)

class InternalPerformanceTracker:
    """Track and aggregate model performance from council sessions."""

    def __init__(self, store_path: Path, decay_days: int = 30):
        self.store_path = store_path
        self.decay_days = decay_days

    def record_session(self, session_metrics: List[ModelSessionMetric]) -> None:
        """Record metrics from a completed council session."""
        # Atomic append to JSONL store
        ...

    def get_model_index(self, model_id: str) -> ModelPerformanceIndex:
        """Get aggregated performance for a model with rolling window."""
        # Apply exponential decay to older sessions
        # Recent sessions weighted higher
        ...

    def get_quality_score(self, model_id: str) -> float:
        """Get normalized quality score (0-100) for model selection."""
        index = self.get_model_index(model_id)
        if index.confidence == "INSUFFICIENT":
            return 50.0  # Default neutral score
        return index.mean_normalized_rank * 100

Integration with Selection Algorithm¶

def select_tier_models(tier: str, ...) -> List[str]:
    # ... hard constraints ...

    for model in eligible:
        # Use INTERNAL performance tracker instead of external benchmarks
        quality_score = performance_tracker.get_quality_score(model.model_id)
        # ... rest of scoring with tier-specific weights ...

Open Questions (Council Addressed)¶

Resolved by Council Review¶

Question	Council Answer
Should benchmark scores override tier selection?	No. Tiers represent user intent (speed vs quality tradeoff). Benchmarks inform selection within tier.
How to handle new models with no data?	Default neutral score (50). Use provider metadata only until internal performance data accumulates.
Balance between performance and cost?	Tier-specific. Quick tier: yes, select cheaper. Reasoning tier: never compromise on quality.
Auto-apply reasoning parameters?	Yes, by tier. Reasoning tier = high effort, quick tier = minimal effort.
Handle benchmark gaming?	Use internal metrics. Council session outcomes are harder to game than public benchmarks.

Remaining Open Questions¶

What sample size validates Internal Performance Index?
Council suggested 100+ sessions for HIGH confidence
Is 30+ sessions sufficient for MODERATE confidence?
Should models with LOW internal scores be automatically demoted?
Threshold for exclusion from tier pools?
Grace period for new models?
How to bootstrap Internal Performance Tracker?
Run shadow sessions with all available models?
Start with static config and learn incrementally?

Issues Identified in Full Quorum Review¶

A. Cold Start Problem (Claude, Gemini)

"When a new model appears in OpenRouter, it has zero internal performance data."

Recommended Solutions: - Assign temporary "phantom score" equivalent to tier average until 10+ samples - Implement Epsilon-Greedy exploration (small % of requests try new models) - Minimum sessions required before model enters regular rotation - Manual allowlist for high-profile new releases

B. Borda Score Normalization (Claude)

"A 5-model session gives max score of 4; an 8-model session gives max of 7."

Solution: Normalize to percentile rank (0.0-1.0) rather than raw Borda counts:

normalized_rank = (council_size - borda_position) / council_size

C. Parse Success Definition (Claude) Define parse success as ALL of: - Valid JSON returned (if JSON expected) - Schema-compliant response - Extractable vote/rationale for Stage 2

D. Anti-Herding Edge Case (Gemini)

"If only 2 models pass hard constraints, the system might oscillate wildly."

Solution: Disable Anti-Herding when eligible model count < 3.

E. Degradation Behavior (Claude)

"What happens when ALL eligible models for a tier fall below acceptable thresholds?"

Fallback Chain: 1. Warn user and proceed with best-available 2. Escalate to adjacent tier (quick→balanced, balanced→high) 3. Fall back to static config as last resort