ADR-012: MCP Server Reliability and Long-Running Operation Handling¶

Status: Accepted (Implemented via TDD) Date: 2025-12-13 Decision Makers: Engineering

Context¶

The LLM Council MCP server performs multi-model deliberation that can take 30-60+ seconds to complete. This creates several problems when used with MCP clients (Claude Code, Claude Desktop):

Observed Issues¶

Timeout failures: MCP clients have transport-layer timeouts (typically 30-60s) that can expire before the council finishes deliberating across 4+ models
Empty results on timeout: When timeout occurs, the entire operation fails and returns empty results rather than partial data
No visibility during execution: Users see no feedback while the council is working, leading to uncertainty about whether the operation is progressing or hung
No health verification: No way to verify the MCP server is healthy before invoking expensive operations

Current Architecture¶

┌─────────────────────────────────────────────────────────┐
│  MCP Client (Claude Code/Desktop)                       │
│  - Invokes consult_council tool                         │
│  - Waits synchronously for response                     │
│  - Times out after N seconds (client-controlled)        │
└─────────────────────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────┐
│  MCP Server (llm-council)                               │
│  - Runs full 3-stage council synchronously              │
│  - Stage 1: Query all models (10-20s)                   │
│  - Stage 2: Peer review (15-30s)                        │
│  - Stage 3: Chairman synthesis (5-15s)                  │
│  - Returns only on completion or error                  │
└─────────────────────────────────────────────────────────┘

Total time: 30-65+ seconds for a 4-model council

Critical Constraint (Council Insight)¶

ctx.report_progress() does NOT extend client timeouts. Progress notifications improve UX but the server is still racing against the client's hard timeout. Internal time budgets must be strictly managed.

Decision¶

Implement a multi-layered reliability strategy:

1. Progress Notifications (Streaming Updates)¶

Use MCP's built-in progress notification mechanism to send real-time updates during council execution.

Implementation:

@mcp.tool()
async def consult_council(query: str, ctx: Context, include_details: bool = False) -> str:
    total_steps = len(COUNCIL_MODELS) * 2 + 2  # stage1 + stage2 + synthesis + finalize
    current_step = 0

    # Stage 1: Individual responses
    for i, model in enumerate(COUNCIL_MODELS):
        await ctx.report_progress(current_step, total_steps, f"Querying {model}...")
        # Query model
        current_step += 1

    # Stage 2: Peer review
    await ctx.report_progress(current_step, total_steps, "Peer review in progress...")
    # ... continue with progress updates

Benefits: - Keeps connection alive (prevents some timeout scenarios) - Provides user visibility into operation progress - Enables client-side timeout decisions based on stage

2. Partial Results on Failure¶

When timeout or partial failure occurs, return whatever data has been collected rather than failing entirely.

Tiered Timeout Strategy (Updated 2025-12-17):

Based on observed model response times (complex queries can take 40-60s per model), timeouts have been calibrated to prioritize completeness:

Confidence Level    Timeout    Use Case
─────────────────────────────────────────
quick               30s        Fast responses, may have fewer models
balanced            75s        Most models respond
high                120s       Full council deliberation (default)

Note: Original council recommendation was 15s/25s/40s/50s tiered deadlines, but real-world latencies required significant increases.

Implementation:

name="__codelineno-3-1" href="#__codelineno-3-1">async def run_full_council_with_fallback(query: str, synthesis_deadline: float = 40.0): results = { "synthesis": "", "model_responses": {}, "metadata": { "status": "complete", # or "partial", "failed" "completed_models": 0, "requested_models": len(COUNCIL_MODELS), "synthesis_type": "full" # or "partial", "insufficient" } } try: async with asyncio.timeout(synthesis_deadline): # Run full council with per-model timeouts stage1, stage2, stage3, meta = await run_full_council(query) results["synthesis"] = stage3.get("response", "") # ... populate model_responses except asyncio.TimeoutError: results["metadata"]["status"] = "partial" # Synthesize from whatever we have if len(results["model_responses"]) > 0: results["synthesis"] = await quick_synthesis(query, results["model_responses"]) results["metadata"]["synthesis_type"] = "partial" return results

Implementation Note (2025-12-17): The above pseudocode has a subtle bug—when asyncio.wait_for times out, it cancels the inner coroutine before model_responses is populated. The fix uses a shared dict pattern:

# Create shared dict BEFORE the try block
shared_raw_responses: Dict[str, Any] = {}

async def run_council_pipeline():
    # Pass shared dict to model queries
    responses = await query_models_with_progress(
        models, messages,
        shared_results=shared_raw_responses  # Populated incrementally
    )
    # ... rest of pipeline

try:
    await asyncio.wait_for(run_council_pipeline(), timeout=deadline)
except asyncio.TimeoutError:
    # shared_raw_responses survives cancellation!
    # Build model_responses from it, marking missing models as "timeout"

Structured Result Schema (Council Recommendation):

{
  "synthesis": "Based on available responses...",
  "model_responses": {
    "gpt-4": {"status": "ok", "latency_ms": 12340, "response": "..."},
    "claude": {"status": "timeout", "error": "timeout after 25s"},
    "gemini": {"status": "ok", "latency_ms": 8920, "response": "..."},
    "llama": {"status": "rate_limited", "retry_after": 30}
  },
  "metadata": {
    "status": "partial",
    "completed_models": 2,
    "requested_models": 4,
    "synthesis_type": "partial",
    "warning": "This answer is based on 2 of 4 intended models; Claude and Llama did not respond."
  }
}

Failure Taxonomy (Council Addition): | Failure Type | Handling | |--------------|----------| | Timeout | Return partial results + synthesis | | Rate limiting (429) | Retry with backoff before falling back | | Auth failure (401/403) | Fail fast, don't waste time on other calls | | Network partition | Different retry strategy than timeout |

Fallback Synthesis Modes: | Condition | Fallback | |-----------|----------| | Stage 1 complete, Stage 2 timeout | Chairman synthesizes from Stage 1 only (skip peer review) | | Stage 1 partial (some models responded) | Synthesize from available responses | | All models timeout | Return error with diagnostic info |

3. Health Check Tool¶

Add a lightweight health check tool that verifies: - MCP server is running - OpenRouter API key is configured - At least one model is reachable

Implementation:

@mcp.tool()
async def council_health_check() -> str:
    """
    Check LLM Council health before expensive operations.
    Returns status of API connectivity and estimated response time.
    """
    checks = {
        "api_key_configured": bool(OPENROUTER_API_KEY),
        "models_configured": len(COUNCIL_MODELS),
        "chairman_model": CHAIRMAN_MODEL,
        "estimated_duration_seconds": estimate_duration(len(COUNCIL_MODELS)),
    }

    # Quick connectivity test (single cheap model, short prompt)
    if checks["api_key_configured"]:
        try:
            start = time.time()
            response = await query_model(
                "google/gemini-2.0-flash-001",  # Fast, cheap
                [{"role": "user", "content": "ping"}],
                timeout=10.0
            )
            checks["api_reachable"] = response is not None
            checks["latency_ms"] = int((time.time() - start) * 1000)
        except Exception as e:
            checks["api_reachable"] = False
            checks["error"] = str(e)

    return json.dumps(checks, indent=2)

4. Confidence Levels (Updated 2025-12-17)¶

Instead of a simple "fast mode" toggle, implement confidence levels that map to different timeout strategies:

@mcp.tool()
async def consult_council(
    query: str,
    ctx: Context,
    confidence: str = "high",  # "quick", "balanced", "high"
    include_details: bool = False
) -> str:
    """
    Args:
        confidence: "quick" (~30s), "balanced" (~75s), "high" (~120s, default)
    """
    configs = {
        "quick": {"models": 2, "timeout": 30},
        "balanced": {"models": 3, "timeout": 75},
        "high": {"models": len(COUNCIL_MODELS), "timeout": 120}
    }
    config = configs.get(confidence, configs["high"])
    # ... proceed with selected configuration

Progress Feedback: During model queries, progress updates show which models have responded:

✓ claude-opus-4.5 (1/4) | waiting: gpt-5.1, gemini-3-pro, grok-4
✓ gemini-3-pro (2/4) | waiting: gpt-5.1, grok-4

Alternative: Racing Pattern (Council Suggestion)

Query more models than needed, return when sufficient responses arrive:

# Query 5 models, return when 3 complete (first-past-the-post)
async def race_council(query: str, target_responses: int = 3):
    tasks = [query_model(m, query) for m in COUNCIL_MODELS[:5]]
    completed = []
    for coro in asyncio.as_completed(tasks):
        result = await coro
        if result:
            completed.append(result)
        if len(completed) >= target_responses:
            break
    return completed

5. Tier-Sovereign Timeout Architecture (Added 2025-12-19)¶

Council Verdict: Move from hardcoded per-model timeouts to a Tier-Sovereign configuration where each tier defines its own time budget.

Problem: Reasoning models (GPT-5.2-pro, o1) require 60-110s per query, far exceeding the original 25s per-model timeout. A global timeout increase would break fast-path tiers.

Solution: 4-tier system with per-tier configurable timeouts:

Tier	Total Timeout	Per-Model Cap	Target Models	Use Case
quick	30s	20s	GPT-4o-mini, Haiku	Fast answers, fewer models
balanced	90s	45s	Sonnet 3.5, GPT-4o	Most models respond
high	180s	90s	Full council (non-reasoning)	Complete deliberation
reasoning	600s	300s	o1, GPT-5.2-pro	Deep reasoning models (doubled 2025-12-22)

Configuration via Environment Variables:

# Per-tier total timeout (seconds)
LLM_COUNCIL_TIMEOUT_QUICK=30
LLM_COUNCIL_TIMEOUT_BALANCED=90
LLM_COUNCIL_TIMEOUT_HIGH=180
LLM_COUNCIL_TIMEOUT_REASONING=600

# Per-tier per-model timeout (seconds)
LLM_COUNCIL_MODEL_TIMEOUT_QUICK=20
LLM_COUNCIL_MODEL_TIMEOUT_BALANCED=45
LLM_COUNCIL_MODEL_TIMEOUT_HIGH=90
LLM_COUNCIL_MODEL_TIMEOUT_REASONING=300

# Global multiplier (emergency override)
LLM_COUNCIL_TIMEOUT_MULTIPLIER=1.0

Implementation:

# config.py additions
DEFAULT_TIER_TIMEOUTS = {
    "quick": {"total": 30, "per_model": 20},
    "balanced": {"total": 90, "per_model": 45},
    "high": {"total": 180, "per_model": 90},
    "reasoning": {"total": 600, "per_model": 300},
}

def get_tier_timeout(tier: str) -> dict:
    """Get timeout configuration for a tier, with env var overrides."""
    defaults = DEFAULT_TIER_TIMEOUTS.get(tier, DEFAULT_TIER_TIMEOUTS["high"])
    tier_upper = tier.upper()

    total = int(os.getenv(f"LLM_COUNCIL_TIMEOUT_{tier_upper}", defaults["total"]))
    per_model = int(os.getenv(f"LLM_COUNCIL_MODEL_TIMEOUT_{tier_upper}", defaults["per_model"]))

    # Apply global multiplier if set
    multiplier = float(os.getenv("LLM_COUNCIL_TIMEOUT_MULTIPLIER", "1.0"))

    return {
        "total": int(total * multiplier),
        "per_model": int(per_model * multiplier),
    }

MCP Server Integration:

CONFIDENCE_CONFIGS = {
    "quick": {"models": 2, **get_tier_timeout("quick")},
    "balanced": {"models": 3, **get_tier_timeout("balanced")},
    "high": {"models": None, **get_tier_timeout("high")},
    "reasoning": {"models": None, **get_tier_timeout("reasoning")},
}

Infrastructure Considerations:

Component	Default	Risk	Mitigation
AWS ALB idle timeout	60s	Kills connection during model "thinking"	Increase to 300s or use WebSocket
Nginx proxy_read_timeout	60s	Same as ALB	Set `proxy_read_timeout 300s;`
Client-side timeout	Varies	User assumes crash	Stream progress updates

Concurrency Safeguard:

# Limit concurrent reasoning-tier requests to prevent resource exhaustion
REASONING_TIER_SEMAPHORE = asyncio.Semaphore(2)

async def consult_council_reasoning(query: str, ...):
    async with REASONING_TIER_SEMAPHORE:
        return await run_council_with_fallback(query, ...)

Model-Tier Compatibility Matrix:

Model	quick	balanced	high	reasoning
GPT-4o-mini	✓	✓	✓	✓
Claude Haiku	✓	✓	✓	✓
Claude Sonnet 3.5	✗	✓	✓	✓
GPT-4o	✗	✓	✓	✓
Claude Opus 4.5	✗	✗	✓	✓
Gemini 3 Pro	✗	✗	✓	✓
GPT-5.2-pro	✗	✗	✗	✓
o1	✗	✗	✗	✓

Automatic Tier Selection (Future):

REASONING_MODELS = {"openai/o1", "openai/gpt-5.2-pro", "openai/o1-preview"}

def infer_tier_from_models(models: List[str]) -> str:
    """Auto-select tier based on slowest model in council."""
    if any(m in REASONING_MODELS for m in models):
        return "reasoning"
    # ... additional logic
    return "high"

6. Job-Based Async Pattern (Deferred)¶

Council Verdict: Defer indefinitely. The job-based pattern adds significant complexity (persistence, job lifecycle, cleanup, polling UX) that conflicts with MCP's stateless design.

When to reconsider: - Operations consistently exceed 5 minutes - Multiple clients need to check the same job - Resumability across server restarts is required

If implemented later, use in-memory job tracking with TTL (jobs expire after 5 minutes) rather than persistent storage.

Implementation Phases¶

Phase	Scope	Effort
Phase 1	Progress notifications + tiered timeouts	1-2 days
Phase 2	Partial results with structured metadata	2-3 days
Phase 3	Health check tool	1 day
Phase 4	Confidence levels parameter	1-2 days
Deferred	Job-based async pattern	Not planned

Alternatives Considered¶

Alternative 1: Increase Client Timeouts¶

Rejected: We don't control client-side timeouts. Users configure their MCP clients independently.

Alternative 2: Reduce Council Size¶

Rejected: Defeats the purpose of multi-model deliberation. Users should be able to use 4+ models. However, confidence levels provide this as an option.

Alternative 3: Pre-compute Common Queries¶

Partially Adopted: We already have caching (ADR-008), but this only helps for repeated queries.

Alternative 4: Server-Sent Events (SSE) Transport¶

Considered for Future: MCP supports streamable-http transport which could enable true streaming. However, this requires client support and is more complex to implement.

Alternative 5: Racing Pattern (Council Suggestion)¶

Adopted as Option: Query more models than needed, return when sufficient responses arrive. Reduces latency by not waiting for slow models.

Risks and Mitigations¶

Risk	Mitigation
Progress notifications not supported by all clients	Graceful degradation - notifications are advisory
Partial results may be lower quality	Clear labeling in metadata: "synthesis_type": "partial"
Health check adds latency before main operation	Make health check optional; recommend calling once per session
Rate limiting (429) during parallel queries	Distinguish from timeouts; retry with backoff before falling back
Memory pressure from concurrent councils	Limit concurrency per request and globally

Success Metrics¶

Timeout rate reduction: < 5% of council operations should timeout (currently estimated 20-30%)
User visibility: Progress updates visible in supporting clients
Partial result utility: When partial results returned, user satisfaction > 70%
Transparency: Users can identify which models contributed to any answer

Council Review Decisions¶

Question	Council Verdict
Job-based async vs streaming?	Streaming progress + tiered timeouts. Defer async pattern indefinitely.
Optimal deadline?	40-45s (not 55s). Tiered per-model deadlines preferred over single global.
Indicate which models responded?	Yes, explicitly. Structured metadata with per-model status is essential.
Fast mode valuable?	Yes, as "confidence levels" (quick/balanced/high) mapping to model count + timeout.
Reasoning model timeouts? (2025-12-19)	Tier-Sovereign architecture. 4 tiers with per-tier configurable timeouts. Reasoning tier: 600s total, 300s per-model (doubled 2025-12-22).
Per-tier vs global timeout config? (2025-12-19)	Per-tier configuration. Global overrides are dangerous; use per-tier env vars with optional multiplier.

ADR-012: MCP Server Reliability and Long-Running Operation Handling¶

Context¶

Observed Issues¶

Current Architecture¶

Critical Constraint (Council Insight)¶

Decision¶

1. Progress Notifications (Streaming Updates)¶

2. Partial Results on Failure¶

3. Health Check Tool¶

4. Confidence Levels (Updated 2025-12-17)¶

5. Tier-Sovereign Timeout Architecture (Added 2025-12-19)¶

6. Job-Based Async Pattern (Deferred)¶

Implementation Phases¶

Alternatives Considered¶

Alternative 1: Increase Client Timeouts¶

Alternative 2: Reduce Council Size¶

Alternative 3: Pre-compute Common Queries¶

Alternative 4: Server-Sent Events (SSE) Transport¶

Alternative 5: Racing Pattern (Council Suggestion)¶

Risks and Mitigations¶

Success Metrics¶

Council Review Decisions¶

References¶