ADR-011: Cost Tracking and Prediction System¶
Status: Proposed Date: 2024-12-12 Deciders: LLM Council Technical Story: Design comprehensive cost tracking, prediction, and budget controls for the council system
Context and Problem Statement¶
The council system uses multiple LLM calls per query (3-10 models × 3 stages), making costs unpredictable. Users need:
- Pre-submission estimates: Know costs before running a query
- Post-submission breakdowns: Understand where money went
- Budget controls: Prevent surprise charges
- Optimization guidance: Reduce costs without sacrificing quality
Current State¶
- Token counts tracked per stage (stage1, stage1.5, stage2, stage3)
- OpenRouter provides usage:
{prompt_tokens, completion_tokens, total_tokens} - No cost calculation or prediction
- No budget enforcement
Key Challenge: Peer Review Cost Growth¶
Stage 2 (peer review) input size grows as O(N × M) where: - N = number of models - M = average Stage 1 response length
With 5 models generating 500 tokens each, each reviewer sees ~2500+ input tokens.
Decision Drivers¶
- Accuracy: Cost predictions should be within 20% of actuals
- Reliability: Never fail a query due to pricing lookup issues
- User Safety: Prevent runaway costs before they happen
- Transparency: Users understand exactly how costs are calculated
- Extensibility: Support future providers beyond OpenRouter
Design Questions & Decisions¶
1. Pricing Data Source¶
Question: How to get and cache model pricing data?
Options Considered: | Approach | Pros | Cons | |----------|------|------| | Hardcoded only | Simple, no dependencies | Stale quickly, maintenance burden | | Dynamic API only | Always current | API failures break pricing | | Hybrid (chosen) | Resilient, current when available | Slightly more complex |
Decision: Hybrid approach with cached dynamic fetching + hardcoded fallback
class PricingService:
def __init__(self):
self.cache_ttl = 3600 # 1 hour
self.fallback_prices = {
# Per million tokens
"openai/gpt-4o": {"input": 2.50, "output": 10.00},
"anthropic/claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"google/gemini-1.5-pro": {"input": 1.25, "output": 5.00},
}
self._cache = {}
async def get_price(self, model_id: str) -> dict:
if self._is_cache_valid(model_id):
return self._cache[model_id]
try:
price = await self._fetch_from_openrouter(model_id)
self._update_cache(model_id, price)
return price
except APIError:
return self.fallback_prices.get(model_id)
Rationale: - OpenRouter prices change frequently (new models, price drops) - API failures shouldn't break cost tracking - Cached prices provide sub-millisecond lookups during query execution - Fallback ensures graceful degradation
2. Token Estimation for Cost Prediction¶
Question: How to estimate completion tokens before a query runs?
Options Considered: | Approach | Pros | Cons | |----------|------|------| | Simple multiplier | Easy | Ignores model/stage variance | | Trained ML predictor | Accurate | Complex, data-hungry | | Historical percentiles (chosen) | Accurate, simple | Needs bootstrap data |
Decision: Model-specific historical percentiles with stage multipliers
class CostPredictor:
def __init__(self):
# Per (model, stage) completion token statistics
self.completion_stats = {
"openai/gpt-4o": {
"initial_response": {"p50": 450, "p75": 680, "p95": 1200},
"peer_review": {"p50": 580, "p75": 850, "p95": 1400},
"synthesis": {"p50": 400, "p75": 600, "p95": 1000},
},
# ... other models
}
def estimate_query_cost(
self,
prompt_tokens: int,
models: list[str],
confidence: str = "p75" # p50, p75, p95
) -> CostEstimate:
"""
Estimate total cost for a council query.
Returns low/expected/high range based on historical data.
"""
estimates = []
for stage in ["initial_response", "peer_review", "synthesis"]:
stage_models = self._models_for_stage(stage, models)
stage_prompt = self._estimate_stage_prompt(stage, prompt_tokens)
for model in stage_models:
completion = self.completion_stats[model][stage][confidence]
price = self.pricing.get_price(model)
cost = (
(stage_prompt * price["input"]) +
(completion * price["output"])
) / 1_000_000
estimates.append(StageCostEstimate(stage, model, cost))
total = sum(e.cost for e in estimates)
return CostEstimate(
low=total * 0.6, # p25 equivalent
expected=total, # chosen confidence
high=total * 1.5, # p95 buffer
breakdown=estimates
)
def update_stats(self, model: str, stage: str, actual_tokens: int):
"""Update statistics after each completion (exponential moving average)."""
alpha = 0.1
current = self.completion_stats[model][stage]["p50"]
self.completion_stats[model][stage]["p50"] = (
alpha * actual_tokens + (1 - alpha) * current
)
Rationale: - Different models have different verbosity (Claude verbose, GPT concise) - Different stages have different output patterns (reviews longer than synthesis) - Percentiles let users choose risk tolerance - EMA updates improve accuracy over time without complex ML
3. Budget Enforcement Strategy¶
Question: How to handle spending limits?
Options Considered: | Approach | Pros | Cons | |----------|------|------| | Hard reject | Safe | Frustrating UX, estimates uncertain | | Warn only | User-friendly | Can exceed budget | | Abort mid-query | Real-time control | Wastes spent tokens | | Tiered (chosen) | Flexible, safe | More complex |
Decision: Tiered enforcement with configurable strictness
class BudgetEnforcer:
class Mode(Enum):
STRICT = "strict" # Reject if p75 estimate exceeds
BALANCED = "balanced" # Reject if p50 exceeds, warn if p75 exceeds
PERMISSIVE = "permissive" # Warn only, never reject upfront
def pre_query_check(
self,
estimate: CostEstimate,
budget_remaining: float,
mode: Mode = Mode.BALANCED
) -> BudgetDecision:
if mode == Mode.STRICT:
if estimate.high > budget_remaining:
return BudgetDecision.REJECT, self._suggest_cheaper(estimate)
elif mode == Mode.BALANCED:
if estimate.expected > budget_remaining:
return BudgetDecision.REJECT, "Likely to exceed budget"
elif estimate.high > budget_remaining:
return BudgetDecision.WARN, f"May exceed (${estimate.expected:.2f} expected, up to ${estimate.high:.2f})"
elif mode == Mode.PERMISSIVE:
if estimate.expected > budget_remaining:
return BudgetDecision.WARN, "Likely to exceed budget"
return BudgetDecision.ALLOW, None
def mid_query_check(
self,
spent_so_far: float,
remaining_stages: list[str],
budget_remaining: float
) -> BudgetDecision:
"""Check between stages - abort gracefully if over budget."""
if spent_so_far > budget_remaining:
return BudgetDecision.ABORT_GRACEFULLY, "Budget exceeded. Returning partial results."
return BudgetDecision.CONTINUE, None
Critical UX considerations: - Never abort mid-completion: Wastes tokens, creates broken responses - Check between stages: Can return partial results gracefully - Suggest alternatives: "Removing GPT-4 reduces estimate to $0.28"
4. Cost Attribution in Peer Review¶
Question: In Stage 2, each model reviews all others. How to attribute costs?
Options Considered: | Approach | Pros | Cons | |----------|------|------| | Split among reviewed | Intuitive | Masks reviewer verbosity | | Reviewer only (chosen) | Actionable, causal | Less intuitive | | Both views | Complete | Complex |
Decision: Primary attribution to reviewer, with cross-reference tagging
@dataclass
class CostAttribution:
stage: str
model: str
tokens_in: int
tokens_out: int
cost: float
reviewing_models: list[str] | None = None # For peer review stage
@property
def cost_per_review_target(self) -> float | None:
"""Secondary view: split cost among reviewed responses."""
if self.reviewing_models:
return self.cost / len(self.reviewing_models)
return None
Rationale: - Causality: The reviewer's verbosity determines cost - Actionability: "Claude's reviews cost 2x GPT's" → adjust reviewer selection - Simplicity: Primary attribution is straightforward - Flexibility: Can still compute "cost to be reviewed" analytically
Example output:
{
"model_costs": {
"openai/gpt-4o": {
"direct_cost": 0.0234,
"cost_as_review_target": 0.0156,
"breakdown": {
"initial_response": 0.0089,
"peer_review": 0.0102,
"synthesis": 0.0043
}
}
},
"stage_costs": {
"initial_response": 0.0312,
"peer_review": 0.0847,
"synthesis": 0.0098
}
}
5. Where Cost Tracking Lives¶
Question: Core library, optional module, or cloud-only?
Decision: Core library with pluggable interfaces
┌─────────────────────────────────────────────────────────────────┐
│ OPEN SOURCE CORE │
├─────────────────────────────────────────────────────────────────┤
│ llm_council/ │
│ └── cost_tracking/ │
│ ├── __init__.py │
│ ├── types.py # CostEstimate, Attribution DTOs │
│ ├── calculator.py # Pure token→cost math │
│ ├── predictor.py # Estimation algorithms │
│ ├── enforcer.py # Budget enforcement │
│ └── interfaces.py # Abstract PricingProvider │
│ │
│ llm_council/cost_tracking/backends/ │
│ ├── memory_storage.py # In-memory (default) │
│ ├── sqlite_storage.py # Local persistence │
│ └── static_pricing.py # Bundled fallback prices │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ PAID CLOUD TIER │
├─────────────────────────────────────────────────────────────────┤
│ • PostgreSQL storage backend │
│ • Real-time OpenRouter pricing sync │
│ • Cross-organization prediction models │
│ • Budget alerts, dashboards, admin controls │
│ • Cost anomaly detection │
│ • Multi-tenant quotas (org/team/user) │
└─────────────────────────────────────────────────────────────────┘
Feature Matrix: | Feature | Open Source | Cloud | |---------|-------------|-------| | Per-query cost calculation | Yes | Yes | | Cost estimation (local history) | Yes | Yes (global models) | | Budget warnings | Yes | Yes | | Budget enforcement | Yes (local) | Yes (org-wide) | | Historical dashboards | No | Yes | | Cross-user prediction models | No | Yes | | Cost anomaly alerts | No | Yes |
Rationale: - Safety default: OSS users need visibility into spend - Transparency: Users can see exactly how costs are calculated - Extensibility: Cloud tier adds value without restricting core functionality - Community contribution: Open cost logic means community can improve it
Implementation¶
API Integration¶
# In run_full_council()
async def run_full_council(
user_query: str,
cost_tracker: CostTracker | None = None,
budget_limit: float | None = None,
) -> CouncilResult:
cost_tracker = cost_tracker or DefaultCostTracker()
# Pre-flight cost estimation
estimate = cost_tracker.estimate(user_query, COUNCIL_MODELS)
# Budget check
if budget_limit:
decision, msg = cost_tracker.enforcer.pre_query_check(
estimate, budget_limit
)
if decision == BudgetDecision.REJECT:
raise BudgetExceededError(msg, estimate=estimate)
# ... run stages, tracking actual costs ...
return CouncilResult(
answer=synthesis,
cost_summary=cost_tracker.get_summary()
)
Configuration¶
# config.py additions
DEFAULT_BUDGET_MODE = "balanced" # strict, balanced, permissive
DEFAULT_COST_TRACKING = True
DEFAULT_PRICING_CACHE_TTL = 3600 # seconds
Response Schema¶
@dataclass
class CostSummary:
total_cost: float
currency: str = "USD"
by_stage: dict[str, float]
by_model: dict[str, float]
estimate_accuracy: float # actual / estimated
tokens: TokenUsage
@dataclass
class TokenUsage:
prompt_tokens: int
completion_tokens: int
total_tokens: int
by_stage: dict[str, dict]
Consequences¶
Positive¶
- Users can predict costs before running queries
- Budget controls prevent surprise charges
- Cost breakdowns enable optimization
- OSS users get full cost visibility
Negative¶
- Adds complexity to core library
- Pricing data maintenance required
- Estimates are inherently uncertain
Risks¶
- Stale prices: Mitigated by dynamic fetching + short TTL
- Inaccurate estimates: Mitigated by percentile ranges
- Budget too strict: Mitigated by configurable modes
Migration Path¶
- Phase 1: Add cost calculation (post-query only)
- Phase 2: Add cost estimation (pre-query)
- Phase 3: Add budget warnings
- Phase 4: Add budget enforcement (opt-in)
- Phase 5: Add cost dashboards in cloud tier