Skip to content

The Case for Multi-Model Architecture

When one model hallucinates, four models usually catch it. Here's how we built a peer-review system for LLMs.


LLM Council queries multiple models in parallel, has them critique each other anonymously, and synthesizes a consensus. This post explains the architecture and the tradeoffs.

The Three-Stage Pipeline

Stage 1: Parallel Generation

Query all council models simultaneously:

import asyncio
from llm_council.openrouter import query_models_parallel

council_models = [
    "openai/gpt-4o",
    "anthropic/claude-3-5-sonnet",
    "google/gemini-2.0-pro",
    "x-ai/grok-2"
]

async def main():
    # All models queried in parallel—latency = slowest model
    responses = await query_models_parallel(
        council_models,
        messages=[{"role": "user", "content": "What's the time complexity of Python's list.sort()?"}]
    )
    return responses

responses = asyncio.run(main())

Latency note: Stage 1 latency equals your slowest model, not the sum. If GPT-4o responds in 2s and Grok in 5s, you wait 5s.

Stage 2: Anonymous Peer Review

Each model evaluates and ranks the responses—but they don't know which model produced which. They see "Response A", "Response B", etc.

# Anonymize responses before peer review
label_to_model = {}
anonymized = []
for i, (model, response) in enumerate(responses.items()):
    label = f"Response {chr(65 + i)}"  # A, B, C, D
    label_to_model[label] = {"model": model, "display_index": i}
    anonymized.append({"label": label, "content": response["content"]})

# Each model ranks all responses (excluding their own vote for themselves)
rankings = await collect_peer_rankings(anonymized, council_models)

Why anonymize? Models show favoritism. In our testing, GPT-4 consistently ranked other GPT responses higher. Claude preferred Claude-style formatting. Anonymization eliminates this.

Cost note: Stage 2 is expensive. Each reviewer sees all responses concatenated, so token usage scales as O(N * total_response_length). With 4 models producing 500 tokens each, each Stage 2 call processes ~2000 input tokens. Budget accordingly.

Stage 3: Chairman Synthesis

A designated "chairman" model synthesizes the final answer:

final_response = await synthesize_final(
    user_query="What's the time complexity of Python's list.sort()?",
    stage1_responses=responses,
    stage2_rankings=rankings,
    chairman_model="anthropic/claude-3-5-sonnet"
)

The chairman sees the original responses, the peer evaluations, and aggregate rankings. It produces a synthesis incorporating the best elements.

A Concrete Example

We asked: "What's the time complexity of Python's list.sort()?"

Model Response Peer Rank
GPT-4o "O(n log n) using Timsort" 3rd
Claude "O(n log n) average and worst case, using Timsort" 2nd
Gemini "O(n log n), but O(n) for already-sorted data" 1st
Grok "O(n log n) using a modified merge sort" 4th

Gemini ranked highest for including the best-case optimization. The synthesis:

Python's list.sort() uses Timsort, an adaptive algorithm with O(n log n) average and worst-case complexity. For already-sorted or nearly-sorted data, it achieves O(n) due to its run-detection optimization.

No single model produced this complete answer. The council did.

Self-Vote Exclusion

Models can't vote for their own responses. This is implemented in the ranking aggregation:

def calculate_aggregate_rankings(stage2_results, label_to_model, exclude_self_votes=True):
    scores = defaultdict(list)

    for result in stage2_results:
        reviewer = result["model"]
        for position, label in enumerate(result["parsed_ranking"]):
            candidate = label_to_model[label]["model"]

            # Skip self-votes
            if exclude_self_votes and reviewer == candidate:
                continue

            # Borda count: lower position = better score
            scores[candidate].append(position + 1)

    # Average position (lower is better)
    return sorted(
        [(model, sum(positions) / len(positions)) for model, positions in scores.items()],
        key=lambda x: x[1]
    )

The Real Tradeoffs

Latency

Total pipeline latency is Stage 1 + Stage 2 + Stage 3, not just Stage 1. For our production setup: - Stage 1: ~5s (slowest model) - Stage 2: ~8s (processing all responses) - Stage 3: ~3s (synthesis) - Total: ~16s vs ~3s for single-model

Cost

With 4 models at $0.01/1K tokens: - Stage 1: 4 completions (~$0.08) - Stage 2: 4 reviews with all responses (~$0.16) - Stage 3: 1 synthesis (~$0.02) - Total: ~$0.26 vs ~$0.02 for single-model

When It's Worth It

  • High-stakes decisions: Legal, medical, financial queries
  • Code generation: Bugs are expensive; peer review catches them
  • Disagreement detection: If models split 2-2, you know the question is ambiguous

When It's Not

  • High-volume, low-stakes queries
  • Real-time chat (16s is too slow)
  • Simple factual lookups

That's why we built tiers—use the full council for complex queries, fast single-model for simple ones.

Failure Modes

What if models disagree 2-2? We flag this as low-confidence and either escalate to the chairman's judgment or return the disagreement to the user.

What if the chairman hallucinates? The chairman prompt includes all peer feedback and rankings. It's instructed to reconcile disagreements, not invent. In practice, chairman errors are rare because it's synthesizing, not generating novel claims.

What if one model times out? We continue with the remaining models. A 3-model council is still more reliable than single-model.

Full API

import asyncio
from llm_council import run_full_council

async def main():
    result = await run_full_council(
        "Explain the CAP theorem and its implications for distributed databases"
    )

    print(f"Stage 1: {len(result['stage1'])} responses")
    print(f"Stage 2: {len(result['stage2'])} evaluations")
    print(f"Final: {result['stage3']['content'][:200]}...")

    # Rankings show which response the council preferred
    for model, score in result['aggregate_rankings']:
        print(f"  {model}: {score:.2f}")

asyncio.run(main())

What's Next

This is post 1 of 7. Coming up: - Post 2: Building a Fault-Tolerant LLM Gateway - Post 3: Why Majority Vote Fails for Small Groups - Post 4: The Latency Tax: Parallel Execution Patterns


LLM Council is open source: github.com/amiable-dev/llm-council. Install with pip install llm-council-core.