ADR-012: MCP Server Reliability and Long-Running Operation Handling¶
Status: Accepted (Implemented via TDD) Date: 2025-12-13 Decision Makers: Engineering
Context¶
The LLM Council MCP server performs multi-model deliberation that can take 30-60+ seconds to complete. This creates several problems when used with MCP clients (Claude Code, Claude Desktop):
Observed Issues¶
- Timeout failures: MCP clients have transport-layer timeouts (typically 30-60s) that can expire before the council finishes deliberating across 4+ models
- Empty results on timeout: When timeout occurs, the entire operation fails and returns empty results rather than partial data
- No visibility during execution: Users see no feedback while the council is working, leading to uncertainty about whether the operation is progressing or hung
- No health verification: No way to verify the MCP server is healthy before invoking expensive operations
Current Architecture¶
┌─────────────────────────────────────────────────────────┐
│ MCP Client (Claude Code/Desktop) │
│ - Invokes consult_council tool │
│ - Waits synchronously for response │
│ - Times out after N seconds (client-controlled) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ MCP Server (llm-council) │
│ - Runs full 3-stage council synchronously │
│ - Stage 1: Query all models (10-20s) │
│ - Stage 2: Peer review (15-30s) │
│ - Stage 3: Chairman synthesis (5-15s) │
│ - Returns only on completion or error │
└─────────────────────────────────────────────────────────┘
Total time: 30-65+ seconds for a 4-model council
Critical Constraint (Council Insight)¶
ctx.report_progress() does NOT extend client timeouts. Progress notifications improve UX but the server is still racing against the client's hard timeout. Internal time budgets must be strictly managed.
Decision¶
Implement a multi-layered reliability strategy:
1. Progress Notifications (Streaming Updates)¶
Use MCP's built-in progress notification mechanism to send real-time updates during council execution.
Implementation:
@mcp.tool()
async def consult_council(query: str, ctx: Context, include_details: bool = False) -> str:
total_steps = len(COUNCIL_MODELS) * 2 + 2 # stage1 + stage2 + synthesis + finalize
current_step = 0
# Stage 1: Individual responses
for i, model in enumerate(COUNCIL_MODELS):
await ctx.report_progress(current_step, total_steps, f"Querying {model}...")
# Query model
current_step += 1
# Stage 2: Peer review
await ctx.report_progress(current_step, total_steps, "Peer review in progress...")
# ... continue with progress updates
Benefits: - Keeps connection alive (prevents some timeout scenarios) - Provides user visibility into operation progress - Enables client-side timeout decisions based on stage
2. Partial Results on Failure¶
When timeout or partial failure occurs, return whatever data has been collected rather than failing entirely.
Tiered Timeout Strategy (Updated 2025-12-17):
Based on observed model response times (complex queries can take 40-60s per model), timeouts have been calibrated to prioritize completeness:
Confidence Level Timeout Use Case
─────────────────────────────────────────
quick 30s Fast responses, may have fewer models
balanced 75s Most models respond
high 120s Full council deliberation (default)
Note: Original council recommendation was 15s/25s/40s/50s tiered deadlines, but real-world latencies required significant increases.
Implementation:
async def run_full_council_with_fallback(query: str, synthesis_deadline: float = 40.0):
results = {
"synthesis": "",
"model_responses": {},
"metadata": {
"status": "complete", # or "partial", "failed"
"completed_models": 0,
"requested_models": len(COUNCIL_MODELS),
"synthesis_type": "full" # or "partial", "insufficient"
}
}
try:
async with asyncio.timeout(synthesis_deadline):
# Run full council with per-model timeouts
stage1, stage2, stage3, meta = await run_full_council(query)
results["synthesis"] = stage3.get("response", "")
# ... populate model_responses
except asyncio.TimeoutError:
results["metadata"]["status"] = "partial"
# Synthesize from whatever we have
if len(results["model_responses"]) > 0:
results["synthesis"] = await quick_synthesis(query, results["model_responses"])
results["metadata"]["synthesis_type"] = "partial"
return results
Implementation Note (2025-12-17): The above pseudocode has a subtle bug—when asyncio.wait_for times out, it cancels the inner coroutine before model_responses is populated. The fix uses a shared dict pattern:
# Create shared dict BEFORE the try block
shared_raw_responses: Dict[str, Any] = {}
async def run_council_pipeline():
# Pass shared dict to model queries
responses = await query_models_with_progress(
models, messages,
shared_results=shared_raw_responses # Populated incrementally
)
# ... rest of pipeline
try:
await asyncio.wait_for(run_council_pipeline(), timeout=deadline)
except asyncio.TimeoutError:
# shared_raw_responses survives cancellation!
# Build model_responses from it, marking missing models as "timeout"
Structured Result Schema (Council Recommendation):
{
"synthesis": "Based on available responses...",
"model_responses": {
"gpt-4": {"status": "ok", "latency_ms": 12340, "response": "..."},
"claude": {"status": "timeout", "error": "timeout after 25s"},
"gemini": {"status": "ok", "latency_ms": 8920, "response": "..."},
"llama": {"status": "rate_limited", "retry_after": 30}
},
"metadata": {
"status": "partial",
"completed_models": 2,
"requested_models": 4,
"synthesis_type": "partial",
"warning": "This answer is based on 2 of 4 intended models; Claude and Llama did not respond."
}
}
Failure Taxonomy (Council Addition): | Failure Type | Handling | |--------------|----------| | Timeout | Return partial results + synthesis | | Rate limiting (429) | Retry with backoff before falling back | | Auth failure (401/403) | Fail fast, don't waste time on other calls | | Network partition | Different retry strategy than timeout |
Fallback Synthesis Modes: | Condition | Fallback | |-----------|----------| | Stage 1 complete, Stage 2 timeout | Chairman synthesizes from Stage 1 only (skip peer review) | | Stage 1 partial (some models responded) | Synthesize from available responses | | All models timeout | Return error with diagnostic info |
3. Health Check Tool¶
Add a lightweight health check tool that verifies: - MCP server is running - OpenRouter API key is configured - At least one model is reachable
Implementation:
@mcp.tool()
async def council_health_check() -> str:
"""
Check LLM Council health before expensive operations.
Returns status of API connectivity and estimated response time.
"""
checks = {
"api_key_configured": bool(OPENROUTER_API_KEY),
"models_configured": len(COUNCIL_MODELS),
"chairman_model": CHAIRMAN_MODEL,
"estimated_duration_seconds": estimate_duration(len(COUNCIL_MODELS)),
}
# Quick connectivity test (single cheap model, short prompt)
if checks["api_key_configured"]:
try:
start = time.time()
response = await query_model(
"google/gemini-2.0-flash-001", # Fast, cheap
[{"role": "user", "content": "ping"}],
timeout=10.0
)
checks["api_reachable"] = response is not None
checks["latency_ms"] = int((time.time() - start) * 1000)
except Exception as e:
checks["api_reachable"] = False
checks["error"] = str(e)
return json.dumps(checks, indent=2)
4. Confidence Levels (Updated 2025-12-17)¶
Instead of a simple "fast mode" toggle, implement confidence levels that map to different timeout strategies:
@mcp.tool()
async def consult_council(
query: str,
ctx: Context,
confidence: str = "high", # "quick", "balanced", "high"
include_details: bool = False
) -> str:
"""
Args:
confidence: "quick" (~30s), "balanced" (~75s), "high" (~120s, default)
"""
configs = {
"quick": {"models": 2, "timeout": 30},
"balanced": {"models": 3, "timeout": 75},
"high": {"models": len(COUNCIL_MODELS), "timeout": 120}
}
config = configs.get(confidence, configs["high"])
# ... proceed with selected configuration
Progress Feedback: During model queries, progress updates show which models have responded:
✓ claude-opus-4.5 (1/4) | waiting: gpt-5.1, gemini-3-pro, grok-4
✓ gemini-3-pro (2/4) | waiting: gpt-5.1, grok-4
Alternative: Racing Pattern (Council Suggestion)
Query more models than needed, return when sufficient responses arrive:
# Query 5 models, return when 3 complete (first-past-the-post)
async def race_council(query: str, target_responses: int = 3):
tasks = [query_model(m, query) for m in COUNCIL_MODELS[:5]]
completed = []
for coro in asyncio.as_completed(tasks):
result = await coro
if result:
completed.append(result)
if len(completed) >= target_responses:
break
return completed
5. Tier-Sovereign Timeout Architecture (Added 2025-12-19)¶
Council Verdict: Move from hardcoded per-model timeouts to a Tier-Sovereign configuration where each tier defines its own time budget.
Problem: Reasoning models (GPT-5.2-pro, o1) require 60-110s per query, far exceeding the original 25s per-model timeout. A global timeout increase would break fast-path tiers.
Solution: 4-tier system with per-tier configurable timeouts:
| Tier | Total Timeout | Per-Model Cap | Target Models | Use Case |
|---|---|---|---|---|
| quick | 30s | 20s | GPT-4o-mini, Haiku | Fast answers, fewer models |
| balanced | 90s | 45s | Sonnet 3.5, GPT-4o | Most models respond |
| high | 180s | 90s | Full council (non-reasoning) | Complete deliberation |
| reasoning | 600s | 300s | o1, GPT-5.2-pro | Deep reasoning models (doubled 2025-12-22) |
Configuration via Environment Variables:
# Per-tier total timeout (seconds)
LLM_COUNCIL_TIMEOUT_QUICK=30
LLM_COUNCIL_TIMEOUT_BALANCED=90
LLM_COUNCIL_TIMEOUT_HIGH=180
LLM_COUNCIL_TIMEOUT_REASONING=600
# Per-tier per-model timeout (seconds)
LLM_COUNCIL_MODEL_TIMEOUT_QUICK=20
LLM_COUNCIL_MODEL_TIMEOUT_BALANCED=45
LLM_COUNCIL_MODEL_TIMEOUT_HIGH=90
LLM_COUNCIL_MODEL_TIMEOUT_REASONING=300
# Global multiplier (emergency override)
LLM_COUNCIL_TIMEOUT_MULTIPLIER=1.0
Implementation:
# config.py additions
DEFAULT_TIER_TIMEOUTS = {
"quick": {"total": 30, "per_model": 20},
"balanced": {"total": 90, "per_model": 45},
"high": {"total": 180, "per_model": 90},
"reasoning": {"total": 600, "per_model": 300},
}
def get_tier_timeout(tier: str) -> dict:
"""Get timeout configuration for a tier, with env var overrides."""
defaults = DEFAULT_TIER_TIMEOUTS.get(tier, DEFAULT_TIER_TIMEOUTS["high"])
tier_upper = tier.upper()
total = int(os.getenv(f"LLM_COUNCIL_TIMEOUT_{tier_upper}", defaults["total"]))
per_model = int(os.getenv(f"LLM_COUNCIL_MODEL_TIMEOUT_{tier_upper}", defaults["per_model"]))
# Apply global multiplier if set
multiplier = float(os.getenv("LLM_COUNCIL_TIMEOUT_MULTIPLIER", "1.0"))
return {
"total": int(total * multiplier),
"per_model": int(per_model * multiplier),
}
MCP Server Integration:
CONFIDENCE_CONFIGS = {
"quick": {"models": 2, **get_tier_timeout("quick")},
"balanced": {"models": 3, **get_tier_timeout("balanced")},
"high": {"models": None, **get_tier_timeout("high")},
"reasoning": {"models": None, **get_tier_timeout("reasoning")},
}
Infrastructure Considerations:
| Component | Default | Risk | Mitigation |
|---|---|---|---|
| AWS ALB idle timeout | 60s | Kills connection during model "thinking" | Increase to 300s or use WebSocket |
| Nginx proxy_read_timeout | 60s | Same as ALB | Set proxy_read_timeout 300s; |
| Client-side timeout | Varies | User assumes crash | Stream progress updates |
Concurrency Safeguard:
# Limit concurrent reasoning-tier requests to prevent resource exhaustion
REASONING_TIER_SEMAPHORE = asyncio.Semaphore(2)
async def consult_council_reasoning(query: str, ...):
async with REASONING_TIER_SEMAPHORE:
return await run_council_with_fallback(query, ...)
Model-Tier Compatibility Matrix:
| Model | quick | balanced | high | reasoning |
|---|---|---|---|---|
| GPT-4o-mini | ✓ | ✓ | ✓ | ✓ |
| Claude Haiku | ✓ | ✓ | ✓ | ✓ |
| Claude Sonnet 3.5 | ✗ | ✓ | ✓ | ✓ |
| GPT-4o | ✗ | ✓ | ✓ | ✓ |
| Claude Opus 4.5 | ✗ | ✗ | ✓ | ✓ |
| Gemini 3 Pro | ✗ | ✗ | ✓ | ✓ |
| GPT-5.2-pro | ✗ | ✗ | ✗ | ✓ |
| o1 | ✗ | ✗ | ✗ | ✓ |
Automatic Tier Selection (Future):
REASONING_MODELS = {"openai/o1", "openai/gpt-5.2-pro", "openai/o1-preview"}
def infer_tier_from_models(models: List[str]) -> str:
"""Auto-select tier based on slowest model in council."""
if any(m in REASONING_MODELS for m in models):
return "reasoning"
# ... additional logic
return "high"
6. Job-Based Async Pattern (Deferred)¶
Council Verdict: Defer indefinitely. The job-based pattern adds significant complexity (persistence, job lifecycle, cleanup, polling UX) that conflicts with MCP's stateless design.
When to reconsider: - Operations consistently exceed 5 minutes - Multiple clients need to check the same job - Resumability across server restarts is required
If implemented later, use in-memory job tracking with TTL (jobs expire after 5 minutes) rather than persistent storage.
Implementation Phases¶
| Phase | Scope | Effort |
|---|---|---|
| Phase 1 | Progress notifications + tiered timeouts | 1-2 days |
| Phase 2 | Partial results with structured metadata | 2-3 days |
| Phase 3 | Health check tool | 1 day |
| Phase 4 | Confidence levels parameter | 1-2 days |
| Deferred | Job-based async pattern | Not planned |
Alternatives Considered¶
Alternative 1: Increase Client Timeouts¶
Rejected: We don't control client-side timeouts. Users configure their MCP clients independently.
Alternative 2: Reduce Council Size¶
Rejected: Defeats the purpose of multi-model deliberation. Users should be able to use 4+ models. However, confidence levels provide this as an option.
Alternative 3: Pre-compute Common Queries¶
Partially Adopted: We already have caching (ADR-008), but this only helps for repeated queries.
Alternative 4: Server-Sent Events (SSE) Transport¶
Considered for Future: MCP supports streamable-http transport which could enable true streaming. However, this requires client support and is more complex to implement.
Alternative 5: Racing Pattern (Council Suggestion)¶
Adopted as Option: Query more models than needed, return when sufficient responses arrive. Reduces latency by not waiting for slow models.
Risks and Mitigations¶
| Risk | Mitigation |
|---|---|
| Progress notifications not supported by all clients | Graceful degradation - notifications are advisory |
| Partial results may be lower quality | Clear labeling in metadata: "synthesis_type": "partial" |
| Health check adds latency before main operation | Make health check optional; recommend calling once per session |
| Rate limiting (429) during parallel queries | Distinguish from timeouts; retry with backoff before falling back |
| Memory pressure from concurrent councils | Limit concurrency per request and globally |
Success Metrics¶
- Timeout rate reduction: < 5% of council operations should timeout (currently estimated 20-30%)
- User visibility: Progress updates visible in supporting clients
- Partial result utility: When partial results returned, user satisfaction > 70%
- Transparency: Users can identify which models contributed to any answer
Council Review Decisions¶
| Question | Council Verdict |
|---|---|
| Job-based async vs streaming? | Streaming progress + tiered timeouts. Defer async pattern indefinitely. |
| Optimal deadline? | 40-45s (not 55s). Tiered per-model deadlines preferred over single global. |
| Indicate which models responded? | Yes, explicitly. Structured metadata with per-model status is essential. |
| Fast mode valuable? | Yes, as "confidence levels" (quick/balanced/high) mapping to model count + timeout. |
| Reasoning model timeouts? (2025-12-19) | Tier-Sovereign architecture. 4 tiers with per-tier configurable timeouts. Reasoning tier: 600s total, 300s per-model (doubled 2025-12-22). |
| Per-tier vs global timeout config? (2025-12-19) | Per-tier configuration. Global overrides are dangerous; use per-tier env vars with optional multiplier. |