Offline-First LLM Orchestration¶

Your LLM system shouldn't crash when OpenAI is down. Here's how we built metadata that survives network failures.

LLM Council's first council review was harsh: "You've built the core brain of your system on APIs you don't control."

They were right. Our initial design required live API calls to OpenRouter just to start. For a self-hosted project, this was unacceptable.

The Sovereign Orchestrator Philosophy¶

We adopted a principle we call the "Sovereign Orchestrator":

External services are plugins, not foundations. If the internet is disconnected, the software must still boot and run—even if degraded.

The Problem: Brittle Dependencies¶

Our first implementation:

# DON'T DO THIS: Fails when external API is unavailable
import httpx

async def get_model_context_window(model_id: str) -> int:
    async with httpx.AsyncClient() as client:
        response = await client.get("https://openrouter.ai/api/v1/models")
        models = response.json()["data"]
        for model in models:
            if model["id"] == model_id:
                return model["context_length"]
    raise ModelNotFoundError(model_id)

Problems: - Cold start failure: System can't initialize without network - Runtime brittleness: API rate limits or outages break production - No fallback: Unknown models cause hard failures

The Solution: Priority Chain Architecture¶

We replaced direct API calls with a three-tier priority chain:

class StaticRegistryProvider:
    """Offline-safe metadata provider (ADR-026)."""

    def get_context_window(self, model_id: str) -> int:
        # 1. Check bundled registry (always available)
        info = self._registry.get(model_id)
        if info:
            return info.context_window

        # 2. Try LiteLLM library (installed, no network)
        litellm_window = self._litellm_adapter.get_context_window(model_id)
        if litellm_window is not None:
            return litellm_window

        # 3. Safe default (conservative—triggers truncation warnings, not crashes)
        logger.warning(f"Using default context window for unknown model: {model_id}")
        return 4096

The key insight: never crash, always return something usable.

Bundled Model Registry¶

We ship a YAML registry with 31 models from 8 providers:

# src/llm_council/models/registry.yaml - Shipped with package
version: "1.0"
updated: "2025-12-23"
models:
  - id: "openai/gpt-4o"
    context_window: 128000
    pricing:
      prompt: 0.0025
      completion: 0.01
    quality_tier: "frontier"

  - id: "anthropic/claude-opus-4.5"
    context_window: 200000
    pricing:
      prompt: 0.015
      completion: 0.075
    quality_tier: "frontier"

  - id: "ollama/llama3.2"
    context_window: 128000
    pricing:
      prompt: 0
      completion: 0
    quality_tier: "local"

This file is bundled with the package. Even on an air-gapped server, model metadata works.

Staleness management: The registry is updated with each release. Between releases, the dynamic provider (when enabled) fetches fresh data. The static registry is the floor, not the ceiling.

Offline Mode¶

For environments with no external connectivity:

export LLM_COUNCIL_OFFLINE=true

When offline mode is enabled: 1. Uses StaticRegistryProvider exclusively 2. Disables all external metadata calls 3. Logs info about limited/stale metadata 4. All core council operations succeed

def get_provider() -> MetadataProvider:
    """Factory function for metadata provider.

    Note: All providers implement the same sync interface.
    The DynamicMetadataProvider uses async internally but exposes
    sync methods via run_in_executor for interface consistency.
    """
    if is_offline_mode():
        return StaticRegistryProvider()

    if os.environ.get("LLM_COUNCIL_MODEL_INTELLIGENCE") == "true":
        return DynamicMetadataProvider()

    return StaticRegistryProvider()

LiteLLM as Hidden Fallback¶

LiteLLM bundles metadata for 100+ models. We use it as a second-tier fallback with lazy loading:

class LiteLLMAdapter:
    """Lazy-loaded LiteLLM metadata adapter."""

    def __init__(self):
        self._loaded = False
        self._model_map: Dict[str, Any] = {}

    def _ensure_loaded(self) -> None:
        if self._loaded:
            return
        try:
            import litellm
            self._model_map = getattr(litellm, "model_cost", {})
            self._loaded = True
        except ImportError:
            # LiteLLM not installed—that's fine
            self._loaded = True

    def get_context_window(self, model_id: str) -> Optional[int]:
        self._ensure_loaded()

        # Try full ID first (preserves provider context)
        if model_id in self._model_map:
            info = self._model_map[model_id]
            return info.get("max_input_tokens") or info.get("max_tokens")

        # Fallback: try without provider prefix
        # Only for well-known models where provider doesn't affect limits
        short_id = model_id.split("/")[-1]
        info = self._model_map.get(short_id)
        if info:
            return info.get("max_input_tokens") or info.get("max_tokens")

        return None

Benefits of lazy loading: - No startup penalty if LiteLLM not needed - No crash if LiteLLM not installed - Graceful degradation with stale versions

Dynamic Metadata (Optional Enhancement)¶

For environments with connectivity, we offer dynamic metadata via OpenRouter:

class DynamicMetadataProvider:
    """Real-time metadata from OpenRouter API."""

    def __init__(self):
        self._cache = TTLCache(ttl_seconds=3600)  # 1 hour
        self._static_fallback = StaticRegistryProvider()

    def get_model_info(self, model_id: str) -> Optional[ModelInfo]:
        # Check cache first
        cached = self._cache.get(model_id)
        if cached is not None:
            return cached

        # Try API (blocking—runs in thread pool for async callers)
        try:
            info = self._fetch_from_api_sync(model_id)
            self._cache.set(model_id, info)
            return info
        except (NetworkError, RateLimitError) as e:
            logger.warning(f"API fetch failed for {model_id}: {e}. Using static fallback.")
            return self._static_fallback.get_model_info(model_id)

The dynamic provider wraps the static provider—never replacing it, only enhancing it.

The Tradeoff: Freshness vs. Reliability¶

Mode	Metadata Source	Freshness	Reliability
Offline	Bundled YAML	Stale (package version)	100%
Static	YAML + LiteLLM	Days-weeks old	100%
Dynamic	API + Cache	Minutes old	95%+

We default to static mode. Dynamic requires explicit opt-in:

export LLM_COUNCIL_MODEL_INTELLIGENCE=true

What This Enables¶

With offline-first architecture, LLM Council works in:

Air-gapped environments: Government, healthcare, finance
Intermittent connectivity: Mobile, edge deployments
Development without API keys: Test locally first
CI/CD pipelines: No external dependencies in tests
Self-hosted with local models: Ollama, llama.cpp

Practical Example: New Model Appears¶

When a new model like openai/o3 releases:

Without offline-first design:

API returns unknown model → Exception → System crash

With priority chain:

Registry: not found
LiteLLM: not found
Default: return 4096 (with warning log)
→ System continues with degraded metadata
→ Update registry.yaml in next release

The system runs with stale data until you update. Stale is better than broken.

This is post 2 of 7. Next: Why Majority Vote Fails for Small Groups

LLM Council is open source: github.com/amiable-dev/llm-council