ADR-041: Verification Telemetry Wiring¶

Status: Accepted 2026-03-07 Date: 2026-03-07 Decision Makers: Chris Joseph, LLM Council Related: ADR-026 (Phase 3), ADR-040 (Phase 2 item 4)

Context¶

Two existing ADRs promise telemetry capabilities that are not functioning in production:

ADR-026 Phase 3: Dead Code¶

ADR-026 Phase 3 is marked IMPLEMENTED and delivered a complete performance/ module (~700 lines, 70 tests) with: - ModelSessionMetric — per-model per-session latency, Borda score, parse success - InternalPerformanceTracker — rolling window aggregation with exponential decay - ModelPerformanceIndex — p50/p95 latency, mean Borda, confidence levels - persist_session_performance_data() — integration entry point - JSONL storage following the bias_persistence.py pattern

The problem: persist_session_performance_data() is never called from any production code path. It is only called from tests. The performance store at ~/.llm-council/performance_metrics.jsonl does not exist on any installation. The entire Phase 3 module is dead code.

This means: - get_quality_score() always returns 50.0 (cold start) for every model - select_tier_models() cannot use internal performance data for selection - The ADR-029 audition mechanism's quality percentile graduation gate (>= 75th percentile) can never fire - ADR-026's validation gate ("100+ sessions tracked") has never been met

ADR-040 Phase 2 Item 4: Missing Observability¶

ADR-040 Phase 2 explicitly requires:

"Add observability metrics (stage durations, timeout frequency, estimated vs actual duration)"

With success criteria:

"P95 verification latency for high tier < 270s (4.5 min)"

And resolved question Q4:

"Add instrumentation to track char_count -> actual_duration for data-driven refinement"

The problem: The verification pipeline (verification/api.py) uses time.monotonic() internally for waterfall budgeting but discards all timing data. No per-stage duration is recorded in transcripts. No total elapsed time appears in result.json. The success criteria cannot be measured.

Additionally, ADR-040's deferred options depend on this data: - Option E (Tiered Stage 2 Optimization): "Deferred until Options A+D provide observability data" - Option F (Early Consensus Termination): Requires stage duration data to tune thresholds

Impact: High-Tier Timeouts Cannot Be Tuned¶

High-tier verifications frequently hit the global timeout (observed across 26 local transcript sessions, where timeout_fired and partial results indicate timeout-related failures). Without timing telemetry, we cannot: 1. Determine which stage is the bottleneck (Stage 1 vs Stage 2 vs Stage 3) 2. Identify which models are consistently slow 3. Validate whether the 1.5x timeout multiplier is appropriate 4. Measure whether waterfall budget ratios (50%/70%/remaining) are well-calibrated 5. Compare char_count to actual duration for input size limit refinement 6. Make data-driven decisions about ADR-040 Option E or F

Current Data Available¶

The only data source is verification transcripts (.council/logs/), which contain: - request.json — tier, paths, rubric_focus (no timing) - stage1.json — model responses (no per-model latency) - stage2.json — rankings (no per-reviewer latency) - stage3.json — synthesis (no duration) - result.json — verdict, confidence, timeout_fired (no elapsed time)

26 transcripts exist locally but none contain timing data.

Decision¶

Wire the existing telemetry infrastructure into the verification pipeline. This is primarily an integration task — the storage, aggregation, and analysis code already exists.

Phase 1: Transcript Timing (Verification Pipeline)¶

Record wall-clock timing in verification transcripts for immediate observability.

1.1 Per-stage timing in _run_verification_pipeline()

Initialize stage_timings at pipeline entry and record per-stage elapsed time. Timing must be captured even on timeout or error — this is the most valuable data for tuning.

# Initialize at pipeline entry
partial_state["stage_timings"] = {}
pipeline_start = time.monotonic()

# Per-stage pattern (repeated for each stage)
stage1_start = time.monotonic()
try:
    # ... stage 1 execution ...
    pass
finally:
    # Always capture timing, even on exception/cancellation
    partial_state["stage_timings"]["stage1"] = int((time.monotonic() - stage1_start) * 1000)

On global timeout (asyncio.CancelledError from wait_for), partial_state["stage_timings"] contains whatever stages completed plus the in-progress stage's partial timing. The finally block ensures the current stage's elapsed time is recorded before cancellation propagates.

1.2 Per-model latency in Stage 1 and Stage 2

Stage 1 already returns model_statuses with latency_ms from the gateway. Persist this in stage1.json.

Stage 2's asyncio.as_completed path (ADR-040) records per-reviewer timing by wrapping each task with a start timestamp and computing delta on completion:

# In stage2 as_completed path
task_start_times: Dict[str, float] = {}
for model in reviewers:
    task_start_times[model] = time.monotonic()
    # ... create task ...

# On completion
elapsed = int((time.monotonic() - task_start_times[model]) * 1000)
reviewer_timings[model] = elapsed

1.3 Summary timing in result.json

Add to transcript result. budget_utilization is defined as total_elapsed_ms / global_deadline_ms (ratio of time used to time available; always in [0, 1] for non-timeout cases, can exceed 1.0 if timeout fires mid-cleanup):

{
  "timing": {
    "total_elapsed_ms": 142000,
    "stage1_elapsed_ms": 45000,
    "stage2_elapsed_ms": 78000,
    "stage3_elapsed_ms": 19000,
    "global_deadline_ms": 270000,
    "timeout_fired": false,
    "budget_utilization": 0.53
  },
  "input_metrics": {
    "content_chars": 32000,
    "tier_max_chars": 50000,
    "num_models": 4,
    "num_reviewers": 4
  }
}

On timeout, partially-completed stages appear with their elapsed time. Missing stages are omitted (not null), so consumers check key existence:

{
  "timing": {
    "total_elapsed_ms": 270000,
    "stage1_elapsed_ms": 45000,
    "stage2_elapsed_ms": 225000,
    "global_deadline_ms": 270000,
    "timeout_fired": true,
    "budget_utilization": 1.0
  }
}

Phase 2: Performance Tracker Integration¶

Wire persist_session_performance_data() into the verification pipeline so the ADR-026 Phase 3 tracker actually accumulates data.

2.1 Call site in run_verification()

The existing function signature (from performance/integration.py) is:

def persist_session_performance_data(
    session_id: str,
    model_statuses: Dict[str, Dict[str, Any]],   # model_id -> {"latency_ms": int, ...}
    aggregate_rankings: Dict[str, Dict[str, Any]], # model_id -> {"borda_score": float, ...}
    stage2_results: Optional[List[Dict[str, Any]]] = None,
) -> int:

The verification pipeline produces these exact shapes: - model_statuses from stage1_collect_responses_with_status() — confirmed compatible - aggregate_rankings from calculate_aggregate_rankings() — confirmed compatible - stage2_results from stage2_collect_rankings() — confirmed compatible

Call in a finally block after pipeline completion, wrapped in try/except to ensure telemetry failures never fail the verification:

# In run_verification(), after pipeline completes (or times out)
try:
    persist_session_performance_data(
        session_id=verification_id,
        model_statuses=partial_state.get("model_statuses", {}),
        aggregate_rankings=partial_state.get("aggregate_rankings", {}),
        stage2_results=partial_state.get("stage2_results"),
    )
except Exception:
    logger.warning("Telemetry persistence failed", exc_info=True)

Reliability constraint: Telemetry persistence must never fail the verification pipeline. A high-tier verification taking minutes should never crash at the last step because a JSONL write failed (disk full, permissions, etc.).

2.2 Call site in consult_council MCP tool (Deferred)

The main council deliberation path in mcp_server.py also never calls the tracker. This is a separate integration with different data shapes and session semantics. Deferred to a follow-up task to keep this ADR focused on the verification pipeline.

Concurrency note: JSONL append writes are atomic on POSIX for lines under PIPE_BUF (4096 bytes on Linux/macOS). ModelSessionMetric.to_jsonl_line() produces ~200 bytes per record, well under this limit. The existing bias_persistence.py uses the same pattern in production. No file locking is required.

Phase 3: Analysis CLI (Deferrable)¶

Extend the existing bias-report CLI pattern to add a timing-report command. This phase is a convenience tool and is explicitly deferrable — raw JSONL and transcript data can be analyzed with jq in the interim.

llm-council timing-report [--days N] [--tier TIER] [--format text|json]

Output: - Per-tier P50/P95/P99 total elapsed time - Per-stage P50/P95 breakdown - Per-model P50/P95 latency - Timeout frequency by tier - char_count vs actual_duration correlation - Budget utilization distribution

What This Enables¶

Once data accumulates (target: 30+ sessions for PRELIMINARY confidence): - Timeout tuning: Adjust VERIFICATION_TIMEOUT_MULTIPLIER based on P95 data - Budget ratio tuning: Adjust 50%/70%/remaining waterfall based on actual stage proportions - Input limit refinement: Adjust TIER_MAX_CHARS based on char_count -> duration correlation - Model selection feedback: select_tier_models() uses real latency/quality data - ADR-040 Option E: Data-driven Stage 2 optimization decisions - ADR-029 graduation: Audition quality percentile gate becomes functional

ADR Updates Required¶

ADR-026 Phase 3: Add "Known Gaps" subsection noting that wiring is incomplete, pointing to ADR-041
ADR-040 Phase 2: Add implementation status for item 4 pointing to ADR-041

Consequences¶

Positive¶

Existing investment pays off: ~700 lines of performance tracking code becomes functional
Data-driven timeout tuning: Replace guesswork with measured P95 latencies
ADR-040 success criteria measurable: Can finally validate "P95 < 270s for high tier"
ADR-029 audition unblocked: Quality percentile graduation gate becomes functional
Low risk: No new abstractions — wiring existing code to existing call sites

Negative¶

Disk usage: JSONL files grow over time (~200 bytes per model per session). Data retention/rotation is deferred — at current usage rates (~30 sessions/month), annual growth is ~300KB, which does not warrant a rotation strategy yet.
Minor latency: File I/O for persistence adds ~1-2ms per session (negligible vs 60-270s sessions)
Migration: Old transcripts lack timing data; analysis tools must handle missing timing field gracefully
Test impact: Integration tests that run real verification will create JSONL files; test fixtures should use temp directories with cleanup

Neutral¶

Additive VerifyResponse schema: Optional timing and input_metrics fields added to VerifyResponse (backward compatible — both default to None)
Backward compatible: transcript result.json gains new timing field (additive)

Compliance / Validation¶

Transcript timing test: Verify result.json contains timing object with total_elapsed_ms, global_deadline_ms, budget_utilization, and at least one stageN_elapsed_ms field
Timeout timing test: Verify that on global timeout, result.json still contains partial timing data for completed and in-progress stages
Performance store test: Verify ~/.llm-council/performance_metrics.jsonl is created and populated after a verification session
Telemetry isolation test: Verify that persist_session_performance_data() failure (simulated via mock raising IOError) does not affect the verification result
Integration test: Run 5 verification sessions, verify timing values are non-negative, stage times sum to approximately total_elapsed_ms (within 100ms measurement tolerance), and budget_utilization is in [0, 1] for non-timeout cases
ADR-040 validation (runtime, not code gate): After 30+ sessions, measure P95 high-tier latency against 270s target

Council Deliberation (2026-03-07, Reasoning Tier)¶

Models consulted: gemini-3.1-pro-preview, claude-opus-4.6, deepseek-v3.2-speciale Consensus level: Moderate (2 approved, 1 needs-review)

Council feedback incorporated: 1. Error/timeout path timing — Added finally blocks for stage timing capture, explicit timeout transcript example, and reliability constraint for persistence (all 3 reviewers) 2. Function signature verification — Added explicit signature from performance/integration.py with compatibility confirmation (Opus, DeepSeek) 3. partial_state["stage_timings"] initialization — Added explicit initialization at pipeline entry (Gemini, DeepSeek) 4. Phase 2.2 scoping — Deferred consult_council wiring to follow-up task (Opus) 5. budget_utilization formula — Defined as total_elapsed_ms / global_deadline_ms with edge case behavior (Opus) 6. Concurrent write safety — Added POSIX atomic append justification with PIPE_BUF analysis (Opus) 7. Phase 3 deferrable — Explicitly marked as deferrable with jq interim workaround (Opus) 8. Telemetry isolation — Added try/except wrapper and compliance test for persistence failure (Gemini, Opus) 9. Data retention — Added explicit deferral with growth estimate (Opus) 10. Test impact — Added note about temp directory fixtures (Opus) 11. Claim substantiation — Reworded "users report" to reference observed transcript data (Opus)