ADR-049: Prompt Caching Across Gateways¶
Status: Implemented 2026-07-04 (v0.34.0, epic #458: D1 #459/PR #464, D2 #460/PR #466, D3 #461/PR #467, D4 #462/PR #468, D5 #463/PR #469; live two-call probe green against the production route. Rev 2 research matrix: vendor docs via adversarial deep-research + empirical probes on our own OpenRouter key)
Date: 2026-07-04
Decision Makers: llm-council maintainers (review requested)
Proposed by: epic-loop project (consumer hand-over — Chris / Claude; companion evidence in epic-loop/docs/assessments/research-2026-07-04-headroom-context-compression.md and council-verify-stats.md)
Relates to: ADR-011 (cost tracking), ADR-023 (multi-router gateway support), ADR-024 (unified routing architecture), ADR-026 (reasoning-params injection precedent), ADR-040/041 (verification telemetry), ADR-042 (verify evidence injection)
Context¶
llm-council makes no use of provider-side prompt caching: no cache_control
(or any provider cache directive) appears anywhere in the source, and the
production telemetry proves the consequence — across ~15 real verification
runs on 2026-07-03, input_metrics.cached_tokens was 0 on every call but
one (a stray 128). Every council call pays full input price on every
token, every round.
Correction from rev 1: the client layer is httpx, not
litellm.acompletion—openrouter.pyand the gateway routers (gateway/openrouter.py,direct.py,requesty.py) build requests via the sharedbuild_openrouter_payload(); LiteLLM is used only on the Ollama path. This matters because the implementation point is our own payload builder, not a LiteLLM pass-through.
Two codebase facts make the current state maximally cache-hostile:
- The verification prompt opens with the snapshot SHA
(
You are reviewing code at commit \`.—verification/api.py::_build_verification_prompt`). Every round verifies a new post-fix commit, so the first bytes of the prompt differ every round, invalidating any exact-prefix cache from byte zero. - The stable content is at the wrong end. The round-invariant instructions/rubric block renders last; the volatile subject content renders early. Even providers with automatic caching can find no shared prefix to reuse.
The consumer-side evidence says this leaves real money on the table. The
epic-loop calibration log (council-verify-stats.md) shows the dominant
usage pattern: multi-round verify() on the same PR (r1→r4 observed on
PR #76), at $0.4–1.1 per call at high tier, 1,170 calls in June 2026.
This repo's own 2026-07-03 session ran ~15 rounds at $0.09–0.15 each
(balanced tier, 25–63K prompt tokens per call).
Correction from rev 1: across rounds the subject content is not byte-identical — each round verifies a new commit (house discipline forbids re-verifying the same SHA). The truly round-invariant content is the instructions + rubric + (usually) ADR-042 evidence; file content is mostly stable but byte-shifts at the first fixed line. Savings estimates must be based on the stable head, not the whole prompt — which is why part 1's ordering and breakpoint placement matter more than any single knob.
Decisive empirical result (2026-07-04, our OpenRouter key, two-call probes, ~$0.20 total):
| Probe (via OpenRouter) | Call 1 | Call 2 (same prompt, +3–4s) |
|---|---|---|
anthropic/claude-haiku-4.5 with cache_control |
cache_write_tokens: 36008, cost $0.045041 |
cached_tokens: 36008, cost $0.0036318 (−92%) |
anthropic/claude-haiku-4.5 without directive |
cost $0.036039, no cache fields | identical — full price |
openai/gpt-4o-mini (automatic expected) |
cached_tokens: 0 |
cached_tokens: 0, same cost |
google/gemini-2.5-flash and -flash-lite |
no cache fields | identical cost |
deepseek/deepseek-chat |
no cache fields | identical cost |
Three conclusions: (a) Anthropic explicit caching through OpenRouter works
end-to-end today — directive forwarded, 1.25× write premium and 0.1×
read visible to the cent in usage.cost (the field our ADR-011 pipeline
already treats as billing ground truth), fields reported in
prompt_tokens_details, on OpenRouter's pooled key (is_byok: false);
(b) the directive is required on that route — nothing is cached without
it; (c) the "it's automatic elsewhere" belief is refuted through
OpenRouter as of today — OpenAI/Gemini/DeepSeek showed zero cache
activity on identical 27–35K-token prompts seconds apart.
Decision (proposed)¶
Adopt provider-aware prompt caching in the gateway layer, in five parts:
1. Prompt-assembly invariant: stable-prefix-first¶
Restructure council prompt assembly into ordered stability segments:
- Static head — role, instructions, rubric, focus. Identical across rounds AND across subjects.
- Evidence (ADR-042) — stable within a round sequence when evidence is re-passed; changes between rounds only if the consumer updates it.
- Subject — file contents; mostly stable across rounds but byte-shifts at the first fixed line. Order multi-file content stable-files-first.
- Volatile tail — snapshot SHA, round number, prior-round dispositions, timestamps. The SHA moves from the first line to here.
Enforce deterministic serialization (sorted keys, no timestamps/UUIDs above the tail). This is the universal precondition for every vendor's caching, explicit or implicit, and is free on providers with none.
Minimum-prefix caveat (verified): Anthropic silently caches nothing below a per-model minimum — 1,024 tokens for Opus 4.8/Sonnet 5, 4,096 for Haiku 4.5, 512 for Fable 5 — with both cache fields reporting 0. The static head alone (~500 tokens) does NOT clear the bar; the head+evidence segment usually does. Breakpoint placement must respect this or hits will silently be zero.
2. Capability modeling — per route, with verified values¶
Extend RouterCapabilities (gateway/base.py) with a caching descriptor:
caching: CachingCapability
# semantics: "explicit" | "explicit_auto" | "router_flag" | "implicit" | "none"
# directive: "anthropic_cache_control" | "requesty_auto_cache" | None
# billing_passthrough: bool | None # None = unverified
# usage_fields: "openrouter_normalized" | "anthropic_native" | "openai_details" | None
Verified per-route values (see matrix):
| Route | semantics | directive | billing pass-through | usage fields |
|---|---|---|---|---|
| OpenRouter → anthropic/* | explicit | cache_control blocks (≤4 breakpoints, ttl:"1h" supported) |
verified (in usage.cost) |
prompt_tokens_details.{cache_write_tokens, cached_tokens} (empirical) |
| OpenRouter → other vendors | none (as tested 2026-07-04) | — | n/a | fields present, always 0 in probes |
| Direct Anthropic | explicit + optional automatic (top-level cache_control, uses 1 of 4 slots) |
cache_control |
n/a (direct) | cache_creation_input_tokens (+ per-TTL sub-object), cache_read_input_tokens |
| Requesty | router_flag | requesty.auto_cache: true (router injects breakpoints for Anthropic/Gemini) |
documented "up to 90%", rates unpublished | undocumented |
| Direct OpenAI | implicit (≥1,024 tokens, 128-token increments) | none | n/a | prompt_tokens_details.cached_tokens |
| Ollama / local | none (KV reuse = latency only) | — | no billing dimension | — |
v1 implementation point: build_openrouter_payload() — the single
shared payload builder — injects cache_control for anthropic/* models
exactly the way ADR-026 injects reasoning params (metadata capability check
→ payload mutation). No new router machinery is required to capture the
verified 92% on the production route; the descriptor generalizes it to
direct/Requesty routes later.
3. Cost-model extension (amends ADR-011)¶
Correction from rev 1: on OpenRouter/Requesty the provider-reported
usage.cost is ADR-011's ground truth and already includes the cache
discount (verified to the cent in the probe) — enabling caching cannot
misreport spend on those routes. The gap is only the registry-estimate
path (Direct APIs): add optional cache_read / cache_write_5m /
cache_write_1h per-1K prices to models/registry.yaml and teach
cost_resolver.py to price the provider cache token fields. Unknown cache
prices default to the prompt price (conservative over-estimate, never
under-reported) — consistent with cost_known semantics.
4. Telemetry (amends ADR-040/041)¶
Correction from rev 1: cache-read telemetry already exists end-to-end —
openrouter.py::_extract_cached_tokens reads
prompt_tokens_details.cached_tokens, ADR-011 aggregation carries
cached_tokens per stage/model/total, and verify input_metrics exposes it
(that is how we know production is at zero). What is missing:
- capture
cache_write_tokens(OpenRouter) /cache_creation_input_tokens(+ per-TTL sub-object) (Anthropic direct); - record the route and
session_idthat served each call; - surface both in verify
input_metricsand the ADR-011 usage block, so a consumer can reconstruct hit-rate per PR from logs alone. A zero cache-read count across rounds is the diagnostic for a broken prefix or lapsed TTL — telemetry is the guard, not code review.
5. TTL / stickiness policy — resolved¶
TTL (policy question B, resolved): Anthropic publishes the break-even —
5-min write (1.25×) pays for itself after one read; 1-hour write (2×)
after two reads; the 5-min cache refreshes free on every use. Observed
verification cadence (this repo, 2026-07-03: gaps of 3–11 minutes between
rounds; epic-loop fix cycles similar) straddles the 5-minute boundary, and a
lapsed 5-min cache pays the write premium again every round.
Recommendation: ttl:"1h" for verification flows (config default for
the verify path), 5-min default for consult_council/interactive use.
Config knob: LLM_COUNCIL_CACHE_TTL (5m | 1h), verify path defaulting
to 1h.
Explicit three-way math (Anthropic's published break-even — 1h pays off after two reads — compares 1h-write against no caching; the operative comparison for verify flows is 1h vs 5m at the observed cadence):
| Round gap | 5-min TTL outcome | 1-hour TTL outcome | Winner |
|---|---|---|---|
| < 5 min | free refresh, 0.1× reads | same reads, paid 2× once | 5m (1h wastes 0.75×) |
| 5–60 min | cache lapsed: every round re-pays 1.25× write, zero reads | 2× once, then 0.1× reads | 1h from the first read |
| sequence ends after r1 | 1.25× sunk | 2× sunk | 5m (bounded 0.75× loss) |
Observed gaps straddle 5 minutes (3–11 min), so 1h on the verify path
wins in expectation; the worst case (single-round sequence) is a bounded
0.75× premium on one cached segment.
Stickiness (policy question A, resolved): OpenRouter itself ships
provider sticky routing keyed on session_id (request body or
x-session-id header), designed precisely to preserve cache warmth across
its upstream pool. Adopt it with a precisely-specified key:
session_id = "verify:{repo}:{subject}" where subject is the stable
sequence identifier (PR number / branch) — explicitly NOT the per-round
snapshot SHA, which changes every round and would defeat the affinity it
exists to provide. Note session_id is routing affinity only, not a
cache key: provider caches match on exact prompt prefix, so two subjects
sharing a session_id cannot collide on content — the worst mis-keying
outcome is lost affinity (a cold cache), never cross-subject reuse. This delegates
stickiness to the router, adds no state to our selection policy, and —
resolving the resilience concern — never overrides circuit state: our
circuit breaker and fallback chains behave exactly as today; a failover
simply lands on a cold cache, costing at most one extra cache-write premium
(bounded, ~25% of one prompt) against availability. Homegrown
route-stickiness in router.py is deferred — the default deployment is
single-gateway, where it is a no-op.
Research matrix — verified 2026-07-04¶
Method: primary vendor documentation via adversarial multi-agent research (claims retained only after 3-vote verification), plus empirical two-call probes through our own OpenRouter key where we hold credentials. Statuses: ✅ VERIFIED (cited/observed) · ❌ REFUTED (as tested) · ◻ UNDOCUMENTED.
| Vendor / route | Status | Finding | Source |
|---|---|---|---|
| Anthropic direct — mechanics | ✅ | 0.1× read / 1.25× 5-min write / 2× 1-h write, multipliers exact across the current model table; TTL refresh-on-use free; ≤4 breakpoints; NEW optional automatic mode (top-level cache_control, consumes one slot); exact-prefix match; org-scoped, workspace-isolated since 2026-02-05 (Claude API/AWS/Foundry) |
platform.claude.com/docs/en/build-with-claude/prompt-caching; …/about-claude/pricing (checked 2026-07-04) |
| Anthropic direct — minimums | ✅ | Per-model minimum cacheable prefix: Fable 5/Mythos 5 512; Opus 4.8/Sonnet 5 1,024; Haiku 4.5 4,096; below-minimum silently uncached (fields report 0). (Reviewer note: these are current July-2026 model names from Anthropic's live pricing table — several are this repo's own configured council members — and will read as unfamiliar to models with earlier knowledge cutoffs.) | same |
| Anthropic direct — usage fields | ✅ | cache_creation_input_tokens (+ cache_creation.{ephemeral_5m,ephemeral_1h}_input_tokens), cache_read_input_tokens; input_tokens = uncached remainder |
same |
| Claude via OpenRouter | ✅ (docs + empirical) | cache_control forwarded (≤4 breakpoints, ttl:"1h"); Anthropic multipliers passed through billing — observed: write $0.045 = 1.25× base $0.036, read −92%; fields prompt_tokens_details.{cache_write_tokens, cached_tokens} (empirical — OpenRouter docs don't name them); works on pooled key (is_byok:false); directive REQUIRED — no directive ⇒ zero caching (empirical) |
openrouter.ai/docs/guides/best-practices/prompt-caching + two-call probe 2026-07-04 |
| OpenRouter sticky routing | ✅ | Provider sticky routing after a cached request; explicit session_id (body or x-session-id, ≤256 chars) controls affinity — the router-native answer to cache-aware stickiness |
openrouter.ai docs (checked 2026-07-04) |
| OpenAI via OpenRouter | ❌ (as tested) | Identical 27.6K-token prompt twice, 4s apart: cached_tokens: 0, identical cost — automatic caching did not apply or was not passed through |
two-call probe 2026-07-04 |
| OpenAI direct | ✅ (docs) | Fully automatic ≥1,024 tokens, hits in 128-token increments, usage.prompt_tokens_details.cached_tokens; no fixed discount ratio published — per-model cached-input prices on the pricing page ("up to 90%" input-cost reduction) |
developers.openai.com/api/docs/guides/prompt-caching (checked 2026-07-04) |
| Gemini via OpenRouter | ❌ (as tested) | 2.5-flash and flash-lite: no discount, no cache fields on identical repeat | two-call probe 2026-07-04 |
| Gemini direct (implicit + explicit context caching) | ◻ | Docs cells did not survive adversarial verification (rates/models/fields unresolved); no direct key held to probe | unresolved 2026-07-04 |
| DeepSeek via OpenRouter | ❌ (as tested) | No cache fields, identical cost (OpenRouter may serve deepseek/* from non-DeepSeek hosts) | two-call probe 2026-07-04 |
| DeepSeek direct | ◻ | Disk-cache hit/miss pricing not verified; no direct key held | unresolved 2026-07-04 |
| Requesty | ✅ (docs) | requesty.auto_cache: true — router injects breakpoints (Anthropic/Gemini); "up to 90% savings"; exact rates & usage fields unpublished; no key held to probe |
docs.requesty.ai/features/auto-caching (checked 2026-07-04) |
| LiteLLM (Ollama path only) | ✅ (docs) | Passes cache_control for 7 named providers; OpenAI stays automatic; nothing documented for Ollama — irrelevant to caching economics (local, no billing) |
docs.litellm.ai/docs/completion/prompt_caching (checked 2026-07-04) |
| Ollama / local | ✅ (nature of the thing) | KV/prefix reuse is a latency win only; no billing dimension; stable-prefix ordering helps for free | — |
Consequences¶
Positive. On the production route (OpenRouter), Anthropic council members get a verified 92% discount on the cached segment of every repeat call within TTL — the largest available cost lever that does not touch council quality (content unchanged; only its price class). The stable-prefix restructure also improves determinism and makes prompt diffs reviewable. Bench matrix runs (same 20 items × several configs, minutes apart) get full-prompt cache hits per model essentially free.
Honest scoping. Only the Anthropic members benefit today via OpenRouter
(typically 1–2 of 4 council seats + the chairman when Anthropic); the
across-round win covers the stable head + evidence + unshifted file bytes,
not the whole prompt. OpenAI/Gemini/DeepSeek routes gain nothing until the
❌/◻ cells change — the descriptor defaults them to none, and re-probing
them is a one-script check (the two-call probe pattern) worth re-running
quarterly.
Negative / cost. Registry grows cache price classes (direct path only); routes gain a capability axis to keep accurate; a byte-stability regression (a stray timestamp above the tail) silently degrades to full price — telemetry (part 4) is the guard. The 1-hour TTL doubles the write premium when a sequence ends after one round (bounded loss: 0.75× of one cached segment).
Neutral. Routes with semantics: none are unaffected; the assembly
invariant is harmless everywhere.
Compliance / Validation¶
- Unit: assembled prompts for rounds r1/r2 of the same subject are byte-identical through the declared segment boundaries (golden-file test); the SHA and all volatile fields appear only in the tail segment.
- Unit: breakpoint placement respects per-model minimum cacheable prefix (no breakpoint marking a segment below the model's minimum) AND the Anthropic limit of 4 cache breakpoints per request: automatic mode's system-managed breakpoint consumes one of the four, so at most 3 explicit breakpoints may be set alongside it (exceeding the limit returns a 400); the builder must enforce this arithmetic.
- Drift guard for the empirically-observed (not vendor-documented)
OpenRouter field names
prompt_tokens_details.{cache_write_tokens, cached_tokens}: the live two-call integration probe is the regression test; extraction must treat a missing field as 0/unknown (degrade to full-price accounting, per ADR-011cost_knownsemantics — never crash, never fabricate). - Integration (per caching-capable route, opt-in/live): two back-to-back
calls; assert call 2 reports cache-read tokens > 0 and
usage.costreflects the discount (the 2026-07-04 probe is the template). - Cost:
cost_resolvergolden tests for hit, miss, write-5m, write-1h, and unknown-cache-price-defaults-to-prompt-price paths (direct route). - Telemetry: verify
input_metricsand the ADR-011 usage block expose cache read + write tokens and the serving route/session; hit-rate per PR reconstructable from logs alone.