ADR-027: Frontier Tier¶

Status: ACCEPTED (Revised per Council Review 2025-12-24) Date: 2025-12-24 Decision Makers: Engineering, Architecture Extends: ADR-022 (Tier System) Council Review: Reasoning tier (gpt-5.2-pro, claude-opus-4.5, gemini-3-pro-preview, grok-4.1-fast)

Context¶

The current tier system (ADR-022) defines four confidence tiers: - quick - Fast responses, low latency priority - balanced - General use, balanced priorities - high - Quality deliberation, proven stable models - reasoning - Deep analysis with extended thinking

Gap Identified: There is no tier for evaluating cutting-edge, preview, or beta models before they are promoted to production use in high tier.

Problem: New models (e.g., GPT-5.2-pro, Gemini 3 Pro Preview) cannot be safely tested in council deliberations without risking production stability. The high tier explicitly requires proven stable models (30+ days), creating a chicken-and-egg problem.

Decision¶

Introduce a new confidence tier called frontier for cutting-edge/preview model evaluation.

Tier Definition¶

Attribute	`high`	`frontier`
Purpose	Production deliberation	Max-capability evaluation
Stability	Proven (30+ days)	New/beta accepted
Preview models	Prohibited	Allowed
Rate limits	Standard	May be restricted
Pricing	Known/stable	May fluctuate
Risk tolerance	Low	High
Voting Authority	Full	Advisory only (Shadow Mode)

Shadow Mode (Council Recommendation)¶

Critical Design Decision: Frontier models operate in Shadow Mode by default.

class VotingAuthority(Enum):
    FULL = "full"           # Vote counts in consensus
    ADVISORY = "advisory"   # Logged/evaluated, vote weight = 0.0
    EXCLUDED = "excluded"   # Not included in deliberation

# Default voting authority by tier
TIER_VOTING_AUTHORITY = {
    "quick": VotingAuthority.FULL,
    "balanced": VotingAuthority.FULL,
    "high": VotingAuthority.FULL,
    "reasoning": VotingAuthority.FULL,
    "frontier": VotingAuthority.ADVISORY,  # Shadow mode by default
}

Rationale: An experimental, hallucinating model could break a tie or poison the context of a production workflow. Shadow Mode ensures frontier models can be evaluated without affecting council decisions.

Override: Operators may explicitly enable full voting for frontier models via configuration:

council:
  tiers:
    frontier:
      voting_authority: full  # Override shadow mode

Tier Intersection: Reasoning vs Frontier¶

Conflict Resolution: Models can belong to multiple conceptual categories (e.g., o1-preview is both "reasoning" and "frontier").

Precedence Rule: 1. If user requests frontier, reasoning models ARE included (frontier is capability-focused) 2. If user requests reasoning, preview/beta models ARE excluded unless allow_preview: true 3. frontier acts as an override flag that permits preview models within other tier requests

def resolve_tier_intersection(
    requested_tier: str,
    model_info: ModelInfo,
    allow_preview: bool = False
) -> bool:
    """Determine if model qualifies for requested tier."""
    if requested_tier == "frontier":
        # Frontier accepts all capable models including previews
        return model_info.quality_tier == QualityTier.FRONTIER

    if requested_tier == "reasoning":
        # Reasoning excludes previews by default
        if model_info.is_preview and not allow_preview:
            return False
        return model_info.supports_reasoning

    # Other tiers: standard logic
    return _standard_tier_qualification(requested_tier, model_info)

Tier Weights (Revised per Council)¶

TIER_WEIGHTS = {
    # ... existing tiers ...
    "frontier": {
        "quality": 0.85,      # INCREASED: Intelligence is the primary driver
        "diversity": 0.05,    # DECREASED: Don't rotate for rotation's sake
        "availability": 0.05, # DECREASED: Accept instability in beta
        "latency": 0.00,      # Irrelevant for capability testing
        "cost": 0.05,         # Minor guardrail against extreme pricing
    },
}

Rationale (Council Feedback): - Quality 85%: When testing the frontier, you want the absolute smartest model available - Diversity 5%: You often want to test one specific breakthrough model, not load-balance - Availability 5%: Preview APIs often have aggressive rate limits or outages - Latency 0%: Willing to wait for cutting-edge responses - Cost 5%: Minor guardrail to prevent extreme cost surprises

Graduation Criteria: Frontier → High¶

Council Requirement: Explicit metrics for model promotion.

@dataclass
class GraduationCriteria:
    """Criteria for promoting model from frontier to high tier."""
    min_age_days: int = 30
    min_completed_sessions: int = 100
    max_error_rate: float = 0.02        # < 2% errors
    min_quality_percentile: float = 0.75  # >= 75th percentile vs high-tier baseline
    api_stability: bool = True           # No breaking changes in evaluation period
    provider_ga_status: bool = True      # Provider removed "preview/beta" label

def should_graduate(
    model_id: str,
    tracker: PerformanceTracker,
    criteria: GraduationCriteria
) -> Tuple[bool, List[str]]:
    """Check if model meets graduation criteria."""
    stats = tracker.get_model_stats(model_id)
    failures = []

    if stats.days_tracked < criteria.min_age_days:
        failures.append(f"age: {stats.days_tracked} < {criteria.min_age_days} days")

    if stats.completed_sessions < criteria.min_completed_sessions:
        failures.append(f"sessions: {stats.completed_sessions} < {criteria.min_completed_sessions}")

    if stats.error_rate > criteria.max_error_rate:
        failures.append(f"error_rate: {stats.error_rate:.1%} > {criteria.max_error_rate:.1%}")

    if stats.quality_percentile < criteria.min_quality_percentile:
        failures.append(f"quality: {stats.quality_percentile:.0%} < {criteria.min_quality_percentile:.0%}")

    return (len(failures) == 0, failures)

Cost Ceiling Protection¶

Council Requirement: Prevent runaway costs from volatile preview pricing.

def apply_cost_ceiling(
    model_id: str,
    model_cost: float,
    tier: str,
    high_tier_avg_cost: float
) -> Tuple[bool, Optional[str]]:
    """Check if model cost exceeds tier ceiling."""
    if tier != "frontier":
        return (True, None)

    # Frontier allows up to 5x high-tier average
    FRONTIER_COST_MULTIPLIER = 5.0
    ceiling = high_tier_avg_cost * FRONTIER_COST_MULTIPLIER

    if model_cost > ceiling:
        return (False, f"cost ${model_cost:.4f} exceeds ceiling ${ceiling:.4f}")

    return (True, None)

Hard Fallback¶

Council Requirement: Define behavior when frontier model fails.

async def execute_with_fallback(
    query: str,
    frontier_model: str,
    fallback_tier: str = "high"
) -> ModelResponse:
    """Execute frontier model with automatic fallback."""
    try:
        response = await query_model(frontier_model, query, timeout=300)
        return response
    except (RateLimitError, TimeoutError, APIError) as e:
        logger.warning(f"Frontier model {frontier_model} failed: {e}. Falling back to {fallback_tier}")

        # Automatic degradation to high tier
        fallback_models = get_tier_models(fallback_tier)
        return await query_model(fallback_models[0], query)

Privacy & Compliance Warning¶

Council Requirement: Document data handling differences for preview models.

**Privacy Notice:** Preview and beta models may have different data retention
policies than production models. Providers often use beta API inputs for
model training.

**Requirement:** PII must be scrubbed before sending prompts to frontier tier
unless the operator has verified the provider's data handling policy.

Static Pool (Fallback)¶

DEFAULT_TIER_MODEL_POOLS = {
    # ... existing tiers ...
    "frontier": [
        "openai/gpt-5.2-pro",
        "anthropic/claude-opus-4.5",
        "google/gemini-3-pro-preview",
        "x-ai/grok-4",
        "deepseek/deepseek-r1",
    ],
}

Configuration¶

council:
  tiers:
    pools:
      frontier:
        models:
          - openai/gpt-5.2-pro
          - anthropic/claude-opus-4.5
          - google/gemini-3-pro-preview
        timeout_seconds: 300
        allow_preview: true
        allow_beta: true
        voting_authority: advisory  # Shadow mode default
        cost_ceiling_multiplier: 5.0
        fallback_tier: high

    graduation:
      min_age_days: 30
      min_completed_sessions: 100
      max_error_rate: 0.02
      min_quality_percentile: 0.75

Consequences¶

Positive¶

Safe environment for evaluating new models before production use
Clear promotion path: frontier → high with explicit criteria
Enables early adoption of cutting-edge capabilities
Separates experimentation from production
Shadow Mode protects council consensus from experimental failures

Negative¶

Additional tier to maintain
Frontier results may be less reliable
Users must understand tier semantics
Shadow Mode means frontier responses don't influence final decisions

Risks & Mitigations¶

Risk	Mitigation
Hallucinating model poisons consensus	Shadow Mode (advisory only)
Cost overruns from volatile pricing	Cost ceiling (5x high-tier avg)
Preview model deprecation mid-evaluation	Hard fallback to high tier
Data privacy with beta APIs	PII scrubbing requirement
Reasoning/frontier tier confusion	Explicit precedence rules

Implementation¶

Files to Modify¶

src/llm_council/config.py - Add frontier to DEFAULT_TIER_MODEL_POOLS
src/llm_council/metadata/selection.py - Add frontier to TIER_WEIGHTS (revised values)
src/llm_council/tier_contract.py - Support frontier tier contracts
src/llm_council/council.py - Implement Shadow Mode voting authority
src/llm_council/metadata/intersection.py - NEW: Tier intersection logic
src/llm_council/metadata/types.py - Add is_preview, supports_reasoning fields
src/llm_council/frontier_fallback.py - Add event emission for fallbacks

Validation¶

[x] Tests for select_tier_models(tier="frontier")
[x] Tests for frontier tier weights
[x] Tests for frontier tier contract creation
[x] Tests for Shadow Mode voting (Issue #110, #111)
[x] Tests for graduation criteria (Issue #112)
[x] Tests for cost ceiling (Issue #113)
[x] Tests for hard fallback (Issue #114)
[x] Document frontier tier in CLAUDE.md

Gap Remediation (Peer Review 2025-12-24)¶

[x] Tier intersection logic (Issue #119) - resolve_tier_intersection() in metadata/intersection.py
[x] Shadow votes integration (Issue #117) - Wired into run_council_with_fallback, events emitted
[x] Fallback wrapper integration (Issue #118) - Event emission in execute_with_fallback_detailed

Observability¶

# Metrics to emit
frontier.model.selected{model_id}
frontier.model.shadow_vote{model_id, agreed_with_consensus}
frontier.model.fallback_triggered{model_id, reason}
frontier.model.cost_ceiling_exceeded{model_id}
frontier.graduation.candidate{model_id}
frontier.graduation.promoted{model_id}