ADR-017: Response Order Randomization¶

Status: Accepted → Partially Implemented (2025-12-17) Date: 2025-12-13 Decision Makers: Engineering Related: ADR-010 (Consensus Mechanisms), ADR-015 (Bias Auditing)

Context¶

ADR-010 recommended "response order randomization" to "mitigate positional bias." This ADR documents the existing implementation and proposes enhancements for bias tracking.

The Problem: Position Bias¶

Research on LLM evaluation shows systematic position bias:

Bias Type	Description	Typical Effect
Primacy bias	First response rated higher	+0.3-0.5 score points
Recency bias	Last response rated higher	+0.2-0.4 score points
Middle neglect	Middle positions underrated	-0.2-0.3 score points

Without randomization, models presented first (or last) would have an unfair advantage regardless of quality.

Current Implementation¶

Response order randomization is already implemented in council.py:

async def stage2_collect_rankings(user_query: str, stage1_results: List[Dict]):
    # Randomize response order to prevent position bias
    shuffled_results = stage1_results.copy()
    random.shuffle(shuffled_results)

    # Create anonymized labels for responses (Response A, Response B, etc.)
    labels = [chr(65 + i) for i in range(len(shuffled_results))]  # A, B, C, ...

Decision¶

Status: Already Implemented¶

The core randomization is implemented and working. This ADR formalizes the design and proposes enhancements.

Current Behavior¶

Pre-shuffle: Stage 1 responses arrive in a deterministic order (based on model list)
Shuffle: random.shuffle() randomizes the order before labeling
Label assignment: Labels (A, B, C...) are assigned post-shuffle
Reviewer sees: Randomized order with anonymous labels
De-anonymization: label_to_model mapping allows result reconstruction

Proposed Enhancements¶

Enhancement 1: Position Tracking for Bias Auditing¶

Track which position each response was shown in to enable position bias analysis (ADR-015).

async def stage2_collect_rankings(user_query: str, stage1_results: List[Dict]):
    shuffled_results = stage1_results.copy()
    random.shuffle(shuffled_results)

    labels = [chr(65 + i) for i in range(len(shuffled_results))]

    # Track position for bias auditing
    label_to_model = {}
    label_to_position = {}
    for i, (label, result) in enumerate(zip(labels, shuffled_results)):
        label_to_model[f"Response {label}"] = result['model']
        label_to_position[f"Response {label}"] = i  # 0 = first shown

    # ... rest of implementation ...

    return stage2_results, label_to_model, label_to_position, total_usage

Enhancement 2: Deterministic Randomization (Optional)¶

For reproducibility in testing/debugging, allow seeding the randomization:

# config.py
RANDOM_SEED = os.getenv("LLM_COUNCIL_RANDOM_SEED")  # None for true random

# council.py
if RANDOM_SEED is not None:
    random.seed(int(RANDOM_SEED))
shuffled_results = stage1_results.copy()
random.shuffle(shuffled_results)

Enhancement 3: Per-Reviewer Randomization¶

Currently, all reviewers see the same order. For stronger bias mitigation, randomize per-reviewer:

async def get_reviewer_perspective(reviewer: str, stage1_results: List[Dict]):
    """Generate a unique randomized order for each reviewer."""
    # Seed based on reviewer name for reproducibility
    seed = hash(reviewer) % (2**32)
    rng = random.Random(seed)

    shuffled = stage1_results.copy()
    rng.shuffle(shuffled)

    return shuffled

Trade-off: This makes cross-reviewer analysis more complex but provides stronger position bias mitigation.

Alternatives Considered¶

Alternative 1: No Randomization¶

Present responses in deterministic order (e.g., alphabetical by model).

Rejected: Research clearly shows position bias affects LLM evaluations.

Alternative 2: Balanced Latin Square¶

Use a Latin square design where each response appears in each position an equal number of times across reviewers.

Considered for Future: Requires coordination across reviewers. Overkill for 3-5 reviewers but valuable for large-scale evaluations.

Alternative 3: Counterbalancing¶

For each reviewer, systematically rotate the order.

Considered for Future: Similar to Latin square, adds complexity for marginal benefit at small scale.

Implementation Status¶

Feature	Status	Notes
Basic randomization	✅ Implemented	`random.shuffle()` in Stage 2
Anonymous labels	✅ Implemented	Response A, B, C...
Label-to-model mapping	✅ Implemented	Enhanced format with `display_index`
Position tracking	✅ Implemented (v0.3.0)	Via `display_index` in enhanced format
Per-reviewer randomization	❌ Not yet	See "When More Advanced Tracking is Needed"
Deterministic seed option	❌ Not yet	See "When More Advanced Tracking is Needed"

Position Tracking Implementation (v0.3.0)¶

Position tracking is now implemented via the enhanced label_to_model format:

# Enhanced format (v0.3.0+) - includes explicit display_index
label_to_model = {
    "Response A": {"model": "openai/gpt-4", "display_index": 0},
    "Response B": {"model": "anthropic/claude-3", "display_index": 1},
    "Response C": {"model": "google/gemini-pro", "display_index": 2}
}

The derive_position_mapping() function in bias_audit.py extracts position data for ADR-015 bias auditing.

INVARIANT: Labels are assigned in lexicographic order corresponding to presentation order (A=0, B=1, etc.). This invariant MUST be maintained by any changes to the anonymization module.

When More Advanced Position Tracking is Needed¶

Per LLM Council review, the current implementation (single-order randomization with position tracking) is sufficient for MVP. However, separate position tracking mechanisms would be needed for:

Scenario 1: Per-Reviewer Randomization¶

If each reviewer sees a different order to further mitigate position bias, the current single display_index won't capture reviewer-specific positions.

Solution: Add reviewer_position_mapping: Dict[str, Dict[str, int]] to track per-reviewer orders.

Scenario 2: Client-Side Shuffling¶

If the frontend shuffles response order for UI reasons (e.g., to prevent "first-token loading bias"), the backend display_index won't reflect the actual displayed order.

Solution: Frontend must report actual display positions back to the backend.

Scenario 3: Dynamic/Interactive Reordering¶

If users can manually reorder responses, sort by criteria, or collapse/expand sections, static position tracking breaks.

Solution: Log position at interaction time, not at generation time.

Scenario 4: Multi-Round Re-Presentation¶

If responses are re-shown in subsequent conversation turns with different ordering, initial position data becomes stale.

Solution: Track position per-round, not just per-session.

Scenario 5: Non-Ordinal Labels¶

If anonymization evolves to use non-alphabetical labels (GUIDs, colors, random strings), the current display_index derivation from label letters would break.

Solution: Already mitigated by explicit display_index in enhanced format.

Questions for Council Review¶

Is per-reviewer randomization worth the added complexity?
Should we implement Latin square balancing for larger councils?
How important is deterministic seeding for reproducibility?
Should position tracking be mandatory (for ADR-015) or optional?

Council Review Feedback¶

Reviewed: 2025-12-17 (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5, Grok 4)

Verdict: Approved - Position Tracking Essential¶

The council unanimously approved ADR-017, emphasizing that position tracking is essential for ADR-015 bias auditing to function.

Key Insights¶

"Position bias is one of the most well-documented biases in LLM evaluation. Without position tracking, you cannot measure it, and without measuring it, you cannot prove your randomization is working."

Approved Enhancements (Priority Order)¶

Enhancement	Priority	Rationale
Position Tracking	P0 - Required	Foundation for ADR-015 bias auditing
Deterministic Seeding	P1 - High	Essential for reproducible testing
Per-Reviewer Randomization	P2 - Medium	Stronger bias mitigation but adds complexity
Latin Square Balancing	P3 - Deferred	Only needed for large-scale evaluations

Implementation Guidance¶

Position Tracking Must Return: Add label_to_position to Stage 2 return signature
Store in Metadata: Position data should be persisted in council result metadata
Seed Configuration: Add LLM_COUNCIL_RANDOM_SEED environment variable for testing

Cross-ADR Dependencies¶

ADR-017 (Position Tracking)
    │
    ├──► ADR-015 (Bias Auditing) - REQUIRES position data
    │
    └──► ADR-016 (Rubric Scoring) - BENEFITS from position analysis

Code Changes Required¶

~~Original proposal: Add separate label_to_position return value.~~

Actual Implementation (v0.3.0): Position data embedded in enhanced label_to_model format:

# council.py - Enhanced label_to_model format
label_to_model = {
    f"Response {label}": {"model": result['model'], "display_index": i}
    for i, (label, result) in enumerate(zip(labels, shuffled_results))
}

# bias_audit.py - Extract position mapping
def derive_position_mapping(label_to_model):
    """Supports both enhanced and legacy formats."""
    for label, value in label_to_model.items():
        if isinstance(value, dict):
            position_mapping[value["model"]] = value["display_index"]
        else:
            # Legacy fallback: derive from label letter
            position_mapping[value] = ord(label.split()[-1]) - ord('A')

Status Update¶

Status: Accepted → Partially Implemented (2025-12-17)

✅ Position tracking: Implemented via display_index
✅ ADR-015 integration: derive_position_mapping() extracts position data
❌ Per-reviewer randomization: Not yet needed
❌ Deterministic seeding: Not yet needed

Success Metrics¶

Position-score correlation < 0.1 (no significant position bias)
Rankings should be stable across multiple runs (with same content)
Position bias auditing (ADR-015) shows balanced position distribution

References¶

Position Bias in LLM Evaluation - Zheng et al.
Judging LLM-as-a-Judge with MT-Bench - Shows position bias effects
Current implementation: src/llm_council/council.py:574-590