ADR-001: AI-Integrated Incident Response System¶

Status: Draft (Revised per LLM Council Review) Date: 2025-12-12 Decision Makers: Engineering, Platform, Security, Compliance

Context¶

We have comprehensive runbooks (e.g., 2,600-line SQL Server Deadlock Guide) that engineers must manually navigate during incidents. We want to transform these into an interactive, AI-assisted system where:

Alerts automatically trigger triage using codified decision trees
Diagnostic data is collected automatically via deterministic tool execution
LLMs identify patterns via RAG over indexed runbook documentation
Results appear in Slack with interactive remediation options
Risky actions require human approval; safe actions auto-execute

Decision¶

Proposed Architecture¶

Alert → Redaction Gateway → Orchestrator (Temporal) → Agentic Diagnostic Router
    → Deterministic Tool Execution → LLM+RAG Analysis → Safety Policy Check
    → Slack UX (Progressive Disclosure) → Approval Workflow (RBAC) → Action Execution

Hybrid Deterministic/Probabilistic Boundary¶

Component	Approach	Rationale
Severity classification	Strict Deterministic	LLM can advise UP but never downgrade severity
Escalation decisions	Deterministic	Must be auditable and traceable
SLA enforcement	Deterministic	Regulatory requirement
Diagnostic tool selection	Agentic Router (Constrained)	LLM selects from allowlist of tools; never generates raw SQL/queries
Diagnostic execution	Deterministic	Code executes the selected tools
Confidence computation	Deterministic	Computed from RAG scores + heuristics, not LLM output
Pattern identification	LLM + RAG	Grounded in runbook content
Root cause explanation	LLM + RAG	Low-risk, high-value use case
Action parameterization	LLM proposes → Deterministic validates	Schema validation, bounds checking before human review
Action authorization	Deterministic	Risk-tier based approval policies
Conversational Q&A	LLM with guardrails	Treat all log data as untrusted content

Revised Tech Stack¶

Component	Original	Revised	Rationale
Orchestration	n8n	Temporal.io	Durable execution, audit trail, workflow-as-code, git-backed logic
Vector DB	Qdrant	pgvector	Consolidates on existing PostgreSQL; transactional consistency; simpler compliance
LLM	Claude API	Claude API + Self-hosted fallback	Primary/fallback/emergency modes
Chat	Slack Bolt	Slack Bolt	No change
Audit	PostgreSQL	PostgreSQL	No change

Data Sanitization Pipeline¶

Critical for Financial Services: All data must pass through a redaction gateway before reaching external LLM APIs.

Raw Alert/Diagnostics
        ↓
┌───────────────────────────────────────┐
│         REDACTION GATEWAY             │
├───────────────────────────────────────┤
│ 1. Regex/NER → Detect PII/MNPI        │
│ 2. Tokenize Account Numbers           │
│ 3. Redact Customer Names              │
│ 4. Scan for Secrets/Credentials       │
│ 5. Log Redaction Actions              │
└───────────────────────────────────────┘
        ↓
   Sanitized Data → LLM API

Note: Token maps for customer IDs must be session-scoped and non-persistent.

Failure Modes & Mitigations¶

Risk	Description	Mitigation
LLM hallucination	Fabricated patterns or recommendations	RAG grounding, confidence computed deterministically, human review
Feedback loop amplification	Remediation triggers new alert, causing loop	Idempotency keys, cooldown periods per resource, circuit breakers
Alert storms	High-velocity incidents overwhelm system	Deduplication, rate limiting, graceful degradation
RBAC privilege escalation	Bot has admin access; junior user exploits it	RBAC Passthrough: verify triggering user's permissions, not bot's
Context window exhaustion	Large logs/traces exceed LLM context limits	Deterministic "Log Crusher" (summarizer/filter) before LLM
False reassurance	LLM confidently misdiagnoses P1 as transient	Aggressive disclaimers on low-confidence outputs; LLM never downgrades severity
Stale embeddings	Runbook changes not reflected in RAG	Event-driven sync, version-aware retrieval, freshness warnings
Prompt injection	Malicious content in logs/alerts	Treat all log data as untrusted; structured tool calling with ACLs
Multi-incident interference	Concurrent incidents share context	Strict session isolation; incident ID scoping for all state
Confidence miscalibration	92% doesn't mean 92% accuracy	Calibration monitoring; track confidence vs outcome correlation
Approval fatigue	Too many requests → rubber-stamping	Approval budgeting; escalation if approval rate exceeds threshold
Model drift	LLM behavior changes without warning	Pin model versions; regression test prompts; staging validation
API unavailability	LLM provider outage	Self-hosted fallback (Llama 3); deterministic-only emergency mode
Compliance gaps	Missing audit trail	Full chain logging: input → retrieval → reasoning → approval → execution
Change window violations	Actions suggested outside approved windows	Calendar-aware action filtering; deterministic freeze enforcement

Enterprise & Compliance (Financial Services)¶

Regulatory Framework¶

Requirement	Implementation
Model Risk Management (SR 11-7)	Treat system as a "model"; independent validation, performance monitoring, drift detection
SOX	Tamper-evident audit logging; change management trail
Data Residency (GDPR)	Confirm zero-day retention; region-pinned LLM API; self-hosted option
Right to Explanation	Log full reasoning chain; surface evidence to users
Separation of Duties	Same person cannot propose AND approve high-risk changes

Required Documentation¶

Model Card: Claude's role, limitations, failure modes
Data Flow Diagram: Exactly what data reaches external APIs
Rollback Procedure: How to disable AI and operate manually
Validation Report: Evidence that LLM suggestions align with runbooks

Vendor Governance¶

DPAs/BAAs, SOC2/ISO 27001 evidence for LLM provider
Explicit contracts: no training on data, retention policies, subprocessors
Multi-LLM fallback strategy documented

Trust & Adoption: UX Design¶

Design Principles¶

1. Citations Are Mandatory¶

Never show a suggestion without a source link:

🔧 Recommended: Create Index
   [Source: SQL Server Deadlock Guide, Section 4.2]

2. Progressive Disclosure of Confidence¶

Show why, not just the score:

📊 Pattern: Key Lookup Deadlock (92% match)
   ├── Wait type MATCH: LCK_M_X ✓
   ├── Object pattern MATCH: Clustered index + nonclustered seek ✓
   ├── Frequency PARTIAL: 7/15min (typical: 5-20/15min) ⚠️
   └── Source: "SQL Server Deadlock Guide" §4.2.1 [View Context]

3. Show Your Work¶

Collapsible section revealing raw log lines and runbook chunks used.

4. Graceful Degradation¶

When uncertain, say so:

🤔 Low confidence analysis (61%)

Closest patterns:
- Key Lookup Deadlock (61% match) - missing typical wait type
- Lock Escalation (58% match) - frequency doesn't fit

Recommended: Manual runbook review
[Open Deadlock Guide] [Open Lock Escalation Guide]

5. Disagreement Loop¶

[👍 Helpful] [👎 Wrong] [🔄 Partially Right]

If wrong: [Suggest correction...] → triggers review ticket + feeds evaluation dataset

6. Latency Masking¶

Stream status updates in Slack:

[10:00:01] Analyzing alert...
[10:00:03] Fetching logs from Splunk...
[10:00:08] Consulting Runbooks...
[10:00:12] Final Analysis:

Adoption Phases¶

Phase	Duration	Scope	Success Criteria
Shadow	4 weeks	System suggests; humans act independently	<20% disagreement rate
Advisory	8 weeks	Suggestions shown; humans decide	>80% rated helpful
Assisted	Ongoing	Auto-collection; human-approved actions	MTTR improvement measurable
Autonomous	After 6mo+	Pre-approved safe actions auto-execute	Zero incidents caused by automation

Runbook Synchronization Strategy¶

Architecture: Event-Driven Sync¶

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Runbook Repo  │────▶│   Git Webhook    │────▶│  Embedding      │
│   (Markdown)    │     │   or CI Trigger  │     │  Pipeline       │
└─────────────────┘     └──────────────────┘     └────────┬────────┘
                                                          │
                              ┌────────────────────────────┤
                              ▼                            ▼
                    ┌─────────────────┐          ┌─────────────────┐
                    │  Chunk Differ   │          │  Full Reindex   │
                    │  (incremental)  │          │  (nightly)      │
                    └────────┬────────┘          └────────┬────────┘
                              │                            │
                              ▼                            ▼
                    ┌──────────────────────────────────────────────┐
                    │         pgvector (Versioned Namespaces)      │
                    └──────────────────────────────────────────────┘

Implementation¶

Source of Truth: Runbooks in Git (Markdown) with structured headings
Event-Driven: Git push triggers CI → chunk → embed → atomic upsert
Version-Aware: Metadata includes commit SHA; warn if chunk >6 months old
Conflict Detection: Flag contradictory guidance from different runbooks
Nightly Safety Net: Full reindex with atomic namespace swap

Freshness Warning¶

⚠️ Warning: This documentation may be outdated (last update: Jan 2023). Verify manually.

Example Interaction (Revised)¶

🟠 DEADLOCK ALERT - P2 HIGH
Frequency: 7/15min | Impact: 42 failed requests | DB: PROD-SQL-01

📊 Pattern: Key Lookup Deadlock (92% match)
   ├── Wait type: LCK_M_X ✓
   ├── Object: Clustered + nonclustered seek ✓
   └── [Source: SQL Deadlock Guide §4.2]

🔧 Recommended: CREATE INDEX ... INCLUDE (...)
   [Validated: Schema ✓ | Naming ✓ | Size ✓]

[View XML] [Show Evidence] [Create JIRA] [Request Index Creation]

User: "Why now?"
Bot: "Traffic increased 40% at 11:45. Index adequate at normal load
     but bottlenecks under concurrency. [Source: §4.2.3]"

User: [Clicks Request Index Creation]
Bot: "⚠️ Requires DBA approval. Checking permissions..."
Bot: "✓ @alice has DBA-ONCALL role. @alice [Approve] [Deny]"

Architecture Diagram¶

graph TD
    Alert --> A[Redaction Gateway]
    A --> B{Temporal Workflow}
    B --> C[Agentic Diagnostic Router]
    C -->|Select Tools| D[Deterministic Tool Execution]
    D --> B
    B --> E[pgvector Retrieval - Versioned]
    E --> F[LLM Reasoning + Parameterization]
    F --> G[Deterministic Safety Policy Check]
    G --> H[Slack UX - Progressive Disclosure]
    H --> I{User Approval - RBAC Check}
    I -->|Approved| J[Execution]
    J --> K[Audit Log - PostgreSQL]

Interactive Runbooks: Advanced UX Vision¶

Council Review: 2025-12-13

The Concept¶

Beyond Slack-based interactions, the council evaluated a more ambitious UX pattern: Interactive Runbooks combining: - Google Colab: Executable cells, live code, prefilled outputs - NotebookLM: AI that deeply understands document context - Slack: Real-time collaboration, threading, @mentions - Google Cloud Operational Docs: Structured troubleshooting with decision trees

Council Verdict¶

The vision is the correct direction for incident response. The industry is moving from static wikis toward unified, executable surfaces. However, the engineering challenges of state management and trust calibration are the actual hurdles—not LLM capabilities.

Feasibility Assessment¶

Challenge	Difficulty	Solution
Pre-filling context	Easy	API integrations to monitoring stack
Natural language queries	Medium	RAG over runbooks + tool-calling LLM
Multi-player state sync	Hard	CRDTs for text; kernel state requires distributed systems work
Context stuffing	Hard	LLM writes queries (SQL/PromQL), doesn't analyze raw data
Latency budgets	Hard	Sub-5s responses required; async loading with "fade-in" insights

Key Insight: The LLM should never analyze raw logs directly. It should write the SQL/PromQL to query your observability platforms. Feasibility hinges on how well you index existing tools.

Cognitive Load: The "Dashboard of Everything" Trap¶

Risk: Creating a surface so dense with chat, cells, logs, and AI suggestions that it paralyzes responders.

Mitigation: Progressive Disclosure

┌─────────────────────────────────────────────────────────────────┐
│                    INTERACTIVE RUNBOOK                          │
├─────────────────────────────────────────────────────────────────┤
│  LEFT: Navigation / Phases                                      │
│        [Detect] → [Triage] → [Mitigate] → [Verify]             │
├─────────────────────────────────────────────────────────────────┤
│  CENTER: Runbook Steps (The "Spine")                           │
│          - Primary view, always visible                         │
│          - AI insights collapsed by default                     │
├─────────────────────────────────────────────────────────────────┤
│  RIGHT: Chat / Threads / AI                                     │
│         - Pull-based (user asks) not push-based (popups)       │
├─────────────────────────────────────────────────────────────────┤
│  BOTTOM: Facts Panel                                            │
│          - Key metrics snapshot                                 │
│          - Current alerts                                       │
│          - Recent deploys                                       │
└─────────────────────────────────────────────────────────────────┘

Mode Separation: Distinctly separate "Triage Mode" (Is it real? Where's the fire?) from "Forensics Mode" (deep dive analysis).

Expert Agent Delegation: Hub-and-Spoke Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                     Interactive Runbook UI                      │
├─────────────────────────────────────────────────────────────────┤
│                      Orchestration Layer                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │
│  │   Router    │──│  Context    │──│  Response Synthesizer   │ │
│  │   Agent     │  │  Manager    │  │  (combines agent outputs)│ │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│                      Expert Agent Pool                          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐          │
│  │ Database │ │ Network  │ │Kubernetes│ │ Service  │   ...    │
│  │ Agent    │ │ Agent    │ │ Agent    │ │ X Agent  │          │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘          │
│       │            │            │            │                  │
├───────┴────────────┴────────────┴────────────┴──────────────────┤
│                    Retrieval & Tool Layer                       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐          │
│  │ Runbook  │ │ Metrics  │ │   Logs   │ │  Change  │   ...    │
│  │ Corpus   │ │   APIs   │ │   APIs   │ │  History │          │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘          │
└─────────────────────────────────────────────────────────────────┘

Agent Design Principles: - Specialized retrieval: DB agent queries DBA wiki, schemas, past DB incidents - Specialized tools: DB agent can run EXPLAIN ANALYZE; network agent can run traceroute - Calibrated confidence: Each agent knows its boundaries - Plan + Action output: Agents generate text plans AND code blocks for human review

Critical: Agents should not see other agents' interpretations—only facts—to avoid echo chambers.

Trust Calibration: The Friction/Risk Matrix¶

	Low Impact	High Impact
High Confidence	Auto-execute	One-click + confirm
Low Confidence	One-click	Requires explanation + manager approval

Adversarial Onboarding: Training modules where AI deliberately gives wrong answers to teach engineers to verify sources.

Graceful Degradation¶

Principle: The system must be useful even if the "Brain" is lobotomized.

The "Markdown + Terminal" Fallback:

┌─────────────────────────────────────────────────────┐
│  AI Status: ⚠️ Degraded                             │
│  ───────────────────────────────────────────────── │
│  • AI chat: Unavailable                             │
│  • Pre-filled context: Working (cached)             │
│  • Executable cells: Working                        │
│  • Expert agents: Unavailable                       │
│                                                     │
│  [Continue with manual runbook] [Retry AI services] │
└─────────────────────────────────────────────────────┘

The runbook itself—structured steps, documentation, executable cells—must work without AI. AI is an enhancement, not a dependency.

Prior Art & Lessons¶

Tool	Lesson
Fiberplane	Collaborative SRE notebooks work; nailed provider integration
Shoreline/RunOps	Executable notebooks often stuck as post-mortem tools—too slow to set up during fires
NotebookLM	Document-grounded Q&A reduces hallucinations
Incident.io/FireHydrant	Slack-native beats custom UI for adoption

The "Start State" Insight: If an engineer opens the tool and relevant metrics are already queried and visualized in Cell 1, you've won. This removes "blank canvas paralysis."

Simplified 80/20 Alternative: The Smart Launcher (Recommended V1)¶

If the full vision is too complex for V1:

Concept: Alert fires → System generates a static document with: - Pre-filled metrics and relevant runbook sections - Simple "Chat with this doc" sidebar - No executable state management

Value: 70% of the benefit with 20% of the engineering risk.

┌─────────────────────────────────────────────────────────────────┐
│  INCIDENT: Deadlock Alert on PROD-SQL-01                       │
├─────────────────────────────────────────────────────────────────┤
│  📊 METRICS (auto-populated)                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ [Live Grafana Embed: Connection Pool Saturation]        │   │
│  │ [Live Grafana Embed: Query Latency P99]                 │   │
│  └─────────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────┤
│  📖 RELEVANT RUNBOOK SECTIONS                                   │
│  • SQL Server Deadlock Guide §4.2 - Key Lookup Patterns        │
│  • Index Optimization Strategies §2.1                          │
├─────────────────────────────────────────────────────────────────┤
│  💬 CHAT WITH THIS DOC                      [Collapse ▼]       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ You: "Does this match the pattern from last week?"      │   │
│  │ AI: "Yes, similar symptoms. Last week resolved with..." │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Implementation Phases for Interactive Runbooks¶

Phase	Focus	Scope
V1: Smart Launcher	Pre-filled context + chat sidebar	Static docs, no execution
V2: Safe Automation	Executable read-only diagnostics	Buttons for queries, ticket creation
V3: Notebooks	Executable cells with RBAC	Per-team tested playbooks
V4: Full Vision	Multi-agent orchestration	Deep telemetry integration

Reconciling Human-in-the-Loop with Automated Remediation¶

Council Review: 2025-12-13

The False Dichotomy¶

Council Verdict: The tension between HITL and automation is a false dichotomy. Design it as a governed spectrum of autonomy, not "Manual vs. Automated."

The Autonomy Ladder¶

Level	Mode	AI Role	Human Role
0	Manual/Assisted	Diagnoses, suggests runbook	Executor: Types commands manually
1	Human-Gated	Prepares exact command, computes impact	Approver: Reviews and clicks "Execute"
2	Human-Supervised	Executes immediately, notifies human	Supervisor: Monitors, can abort
3	Bounded Autonomy	Executes within strict limits	Reviewer: Post-incident audit
4	Fully Autonomous	Self-healing when conditions met	Governor: Reviews aggregate metrics

The Automation Suitability Score¶

For each runbook action, score on 5 dimensions (1-5 scale):

Dimension	Low (1)	High (5)
Reversibility	Data deletion	Stateless restart
Blast Radius	Entire region	Single pod
Determinism	"Model suspects anomaly"	"Disk full"
Time Criticality	Can wait hours	Seconds matter
Regulatory Class	Touches SOX scope	No regulatory concern

Decision Rule: - Score < 15: Human-Executed or Human-Gated - Score 15-20: Candidate for Human-Supervised - Score > 20: Candidate for Fully Automated (if regulatory allows)

The "3am Problem" Solution¶

Principle: Pre-commitment policies made in daylight by stakeholders.

┌─────────────────────────────────────────────────────────────────┐
│                    3AM DECISION TREE                            │
├─────────────────────────────────────────────────────────────────┤
│  1. Attempt page → Try on-call engineering                      │
│                                                                 │
│  2. Check Service Tier:                                         │
│     • Tier 0 (Payment Rails): Default to FAIL-SAFE              │
│       → Accept degradation to preserve data integrity           │
│     • Tier 1+ (Non-critical): Default to FAIL-OPEN              │
│       → Prioritize availability                                 │
│                                                                 │
│  3. Emergency Logic:                                            │
│     IF (Cost_of_Downtime > Risk_Threshold)                      │
│        AND (Action is Reversible)                               │
│     THEN → Auto-Execute                                         │
│                                                                 │
│  Example: Restarting web server = OK                            │
│           Dropping database table = NEVER                       │
└─────────────────────────────────────────────────────────────────┘

The Runbook's Role in an Automated World¶

In an automated world, the Interactive Runbook evolves from checklist to Glass Cockpit:

Role	Description
Glass Break Mechanism	Manual override when automation fails or loops
Context Preservation	Shows evidence for human approval decisions
Audit Artifact	Immutable record for compliance—logs AI and human decisions
Visibility Layer	"I detected X, checked Y, executed Z"

Trust Accumulation: The Graduation Pipeline¶

Automation must be earned, not built.

┌─────────────────────────────────────────────────────────────────┐
│              THE TRUST GRADUATION PIPELINE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PHASE 1: Shadow Mode                                           │
│  └─ AI logs: "I would have executed X"                          │
│  └─ Human executes manually                                     │
│  └─ Requirement: 50 matching incidents                          │
│                           │                                     │
│                           ▼                                     │
│  PHASE 2: Recommendation (Human-Gated)                          │
│  └─ Runbook presents "Fix It" button                            │
│  └─ Human must click                                            │
│  └─ Requirement: 20 consecutive successes, 0 rollbacks          │
│                           │                                     │
│                           ▼                                     │
│  PHASE 3: Supervised Autonomy                                   │
│  └─ "Executing X in 60s unless you veto"                        │
│  └─ Requirement: 3 months stability                             │
│                           │                                     │
│                           ▼                                     │
│  PHASE 4: Full Autonomy                                         │
│  └─ Graduated by governance vote (Engineering + Risk)           │
│                                                                 │
│  ⚠️ DEMOTION TRIGGER: Any Sev-1 incident or rollback            │
│     → Immediate demotion to Phase 2                             │
└─────────────────────────────────────────────────────────────────┘

Regulatory Constraints (Financial Services)¶

The "Never Automate" List (regardless of technical safety):

Category	Examples	Reason
Money Movements	Reversing transactions, ledger changes	Fiduciary duty
Security Controls	Disabling firewalls, IAM changes	Compliance requirement
Model/Risk Parameters	Trading limits, fraud thresholds	Regulatory oversight
Data Deletion	Purging records	Retention policies

Audit Requirement: Logs must distinguish "System Initiated" vs "Human Initiated". For automated actions, the "Identity" is the approved runbook version.

Avoiding the "Autopilot Paradox"¶

Aviation shows partial automation can be dangerous—humans become complacent.

Anti-Complacency Measures:

Strategy	Implementation
Active Engagement	Don't ask "Click OK"; ask "Confirm you checked Graph A and B"
Game Days	Monthly drills with automation disabled
No Hidden Modes	UI clearly shows "AUTO-PILOT ENGAGED" vs "MANUAL CONTROL"
Prediction Exercises	Before approval, ask: "What do you expect will happen?"
Variable Autonomy	Occasionally downgrade to Human-Gated for training

Integrated Decision Flow¶

┌─────────────────────────────────────────────────────────────────┐
│                    AUTOMATION DECISION FLOW                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Incident Detected                                              │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Calculate: Base Level + Context Modifiers + Regulatory  │   │
│  │            Constraints = Effective Automation Level     │   │
│  └─────────────────────────────────────────────────────────┘   │
│         │                                                       │
│    ┌────┴────┬────────────┐                                    │
│    ▼         ▼            ▼                                    │
│  FULL     SUPERVISED   APPROVED    MANUAL                      │
│  AUTO     (notify)     (click)     (human runs)                │
│    │         │            │            │                       │
│    └─────────┴────────────┴────────────┘                       │
│                    │                                            │
│                    ▼                                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ OUTCOME RECORDING                                        │   │
│  │ • Audit log (compliance)                                 │   │
│  │ • Update trust scores                                    │   │
│  │ • Feed graduation/demotion evaluation                    │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Key Principles Summary¶

The dichotomy is false — HITL and automation are a spectrum, not opposites
Level varies by action — Different actions within same incident warrant different levels
Trust is earned — Actions graduate through automation levels based on evidence
Context matters — Same action may be automated in one context, require approval in another
Compliance is non-negotiable — Some actions require human approval regardless of technical safety
Meaningful involvement — Humans must be engaged meaningfully, not rubber-stamping
Skills atrophy — Active measures needed to maintain human capability

Tier-2 Technical Decisions¶

Council Review: 2025-12-13

These decisions are required before implementation can begin. Each represents a design choice that affects multiple components.

Decision Summary Table¶

Area	Decision	Rationale
Runbook Storage	GitOps Hybrid (Markdown + YAML)	Engineers prefer Git; Postgres for runtime queries
Alert Ingestion	Webhook Gateway + CloudEvents	Decouples sources from workflows; standard schema
Diagnostic Tools	Tooling Gateway (Typed Adapters)	Safety, RBAC passthrough, tool abstraction
Prompt Engineering	Git-managed Jinja2 Templates	Prompts are code; version control + evaluation
Action Execution	Temporal Activities + Registry	Durability, Saga pattern for rollback
Approval Workflow	Slack Block Kit + Middleware	Rich UI, identity verification, quorum support
Embedding Pipeline	Event-Driven (CI/CD trigger)	Freshness critical; transactional consistency
Agent Orchestration	Hierarchical Temporal Workflows	Deterministic boundary; isolates specialist failures
Confidence Scoring	Deterministic Ensemble	LLMs can't self-calibrate; explainable metrics
Audit & Compliance	Event Sourcing in Postgres	Structured traces; query-able decision history

1. Runbook Authoring & Storage¶

Decision: Extended Markdown in Git with CI sync to PostgreSQL

┌────────────────────────────────────────────────────────────────┐
│  RUNBOOK LIFECYCLE                                             │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Author (Engineer)                                             │
│       │                                                        │
│       ▼                                                        │
│  ┌─────────────┐                                               │
│  │  Markdown   │  - YAML frontmatter (metadata)                │
│  │  in Git     │  - Executable code blocks                     │
│  └──────┬──────┘                                               │
│         │ PR Review                                            │
│         ▼                                                      │
│  ┌─────────────┐                                               │
│  │  CI/CD      │  - Validate schema                            │
│  │  Pipeline   │  - Generate embeddings                        │
│  └──────┬──────┘  - Upsert to Postgres                         │
│         │                                                      │
│         ▼                                                      │
│  ┌─────────────┐     ┌─────────────┐                          │
│  │  runbooks   │     │  pgvector   │                          │
│  │  table      │     │  embeddings │                          │
│  └─────────────┘     └─────────────┘                          │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Requirements: - YAML frontmatter: id, service_owner, autonomy_level, required_permissions - Executable steps: fenced blocks like ```action:restart_pod ` - Version pinning: Runtime references specific Git SHA - Validation: CI fails on schema violations

2. Alert Ingestion¶

Decision: Webhook Gateway normalizing to CloudEvents standard

Common Alert Format (CAF) Schema:

alert:
  id: string              # UUID
  fingerprint: string     # Deduplication key
  source: string          # e.g., "prometheus", "datadog"
  severity: enum          # critical|high|medium|low|info
  service_id: string      # From service catalog
  title: string
  description: string
  labels: map
  fired_at: timestamp
  raw_payload: object     # Original for debugging

Requirements: - Normalize all sources to CAF within 100ms p99 - Deduplication: 30-minute sliding window on fingerprint - Enrich with service catalog metadata before workflow start - Dead-letter failed ingestion with exponential backoff retry

3. Diagnostic Tool Integration¶

Decision: Tooling Gateway with strongly-typed adapters

┌────────────────────────────────────────────────────────────────┐
│  TOOLING GATEWAY                                               │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Workflow/Agent                                                │
│       │                                                        │
│       │ get_cpu_metrics(service, window)                       │
│       ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │              TOOLING GATEWAY SERVICE                     │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐   │  │
│  │  │  Auth    │ │  Rate    │ │ Timeout  │ │ Redact   │   │  │
│  │  │  RBAC    │→│  Limit   │→│ Circuit  │→│ Output   │   │  │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘   │  │
│  └─────────────────────────────────────────────────────────┘  │
│       │                                                        │
│       ├───────────────┬───────────────┬───────────────┐       │
│       ▼               ▼               ▼               ▼       │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   │
│  │Prometheus│   │ Splunk   │   │ Datadog  │   │   K8s    │   │
│  │ Adapter  │   │ Adapter  │   │ Adapter  │   │ Adapter  │   │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘   │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Requirements: - Each tool: defined input/output schema (Pydantic) - Circuit breakers: 30s timeout, 1MB max result - Credentials: HashiCorp Vault with dynamic secrets - RBAC passthrough: impersonation token per request

4. LLM Prompt Engineering¶

Decision: Git-managed Jinja2 templates with evaluation suite

Requirements: - Prompts versioned in Git alongside runbooks - CI runs evaluation suite against historical incidents - Supports Chain-of-Thought structure enforcement - Dynamic tool injection based on user RBAC - Rollout via feature flags (percentage-based)

Prompt Structure:

prompt:
  id: incident.diagnose
  version: "2.3.0"
  input_schema: { ... }
  output_schema: { ... }    # JSON schema for structured output
  template: |
    You are an SRE assistant...
    {{alert | format_alert}}
    ...
  evaluation:
    test_cases: [...]
    regression_baseline: "2.2.0"

5. Action Execution Engine¶

Decision: Temporal Activities with Action Registry (Saga pattern)

Action Registry Entry:

action:
  id: k8s.deployment.restart
  risk_level: medium
  permissions_required: [k8s:deployments:update]
  blast_radius:
    max_affected_pods: 100
    requires_approval_above: 50
  rollback_handler: k8s.deployment.rollback
  dry_run_support: true
  idempotent: true

Requirements: - Every action registered with declared capabilities/risks - Dry-run mode for preview before execution - Saga pattern: compensating action for rollback - Blast radius limits enforced - All executions logged with correlation IDs

6. Approval Workflow¶

Decision: Slack Block Kit with middleware identity verification

Flow:

Action Requires Approval
         │
         ▼
┌─────────────────────────────────────────┐
│  Slack Approval Card                    │
│  ┌───────────────────────────────────┐  │
│  │ 🔐 Approval Required              │  │
│  │ Action: Restart deployment        │  │
│  │ Risk: ⚠️ Medium                   │  │
│  │ Impact: 12 pods affected          │  │
│  │ Expires: 30 minutes               │  │
│  │                                   │  │
│  │ [✅ Approve] [❌ Reject] [Details] │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
         │
         │ User clicks
         ▼
┌─────────────────────────────────────────┐
│  Middleware Service                     │
│  - Resolve Slack ID → Corporate ID      │
│  - Verify RBAC permissions              │
│  - Check quorum (if multi-party)        │
│  - Signal Temporal workflow             │
└─────────────────────────────────────────┘

Requirements: - Approvals expire after configurable timeout (default 30min) - Multi-party quorum support (e.g., "2 SREs required") - High-risk actions require modal confirmation (not just button) - All decisions logged with user ID, timestamp, rationale

7. Embedding Pipeline¶

Decision: Event-driven CI/CD pipeline on Git merge

Requirements: - Trigger on git merge to main branch - Semantic chunking by Markdown headers (not token count) - Metadata tags: runbook_version, service_tags - Transactional upsert (delete old + insert new atomically) - Content hash to skip unchanged chunks

8. Agent Orchestration¶

Decision: Hierarchical Temporal Child Workflows

┌────────────────────────────────────────────────────────────────┐
│  AGENT ORCHESTRATION                                           │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│                  ┌─────────────────────┐                       │
│                  │ Coordinator Workflow │                       │
│                  │ (Incident Lifecycle) │                       │
│                  └──────────┬──────────┘                       │
│                             │                                  │
│        ┌────────────────────┼────────────────────┐            │
│        ▼                    ▼                    ▼            │
│  ┌───────────┐       ┌───────────┐       ┌───────────┐       │
│  │ Database  │       │ Network   │       │ K8s       │       │
│  │ Agent     │       │ Agent     │       │ Agent     │       │
│  │ (Child)   │       │ (Child)   │       │ (Child)   │       │
│  └─────┬─────┘       └─────┬─────┘       └─────┬─────┘       │
│        │                   │                   │              │
│        │                   │                   │              │
│        ▼                   ▼                   ▼              │
│   DB Tools only      Network Tools      K8s Tools only       │
│   (scoped access)    only               (scoped access)      │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Requirements: - Shared IncidentContext object (read-only to children) - Scoped tool access per specialist - Time-boxing: 5-minute max before yield to coordinator - Structured output format (not free-form chat)

9. Confidence Scoring¶

Decision: Deterministic ensemble (not LLM self-assessment)

Formula:

Confidence = (w1 × VectorScore)
           + (w2 × PreconditionCheck)
           + (w3 × HistoricalSuccessRate)
           - (Penalty × HedgingWordsDetected)

Requirements: - Explainable: UI shows WHY confidence is low - Configurable thresholds per service tier - Drives autonomy ladder: Score < 0.7 → Human-Gated - Negative signals: "maybe", "unclear", missing data → reduce score

10. Audit & Compliance¶

Decision: Structured Event Sourcing in PostgreSQL

Schema:

CREATE TABLE audit_events (
    trace_id        UUID PRIMARY KEY,
    incident_id     UUID REFERENCES incidents(id),
    event_type      VARCHAR(50),  -- 'llm_call', 'action_executed', 'approval'
    actor           VARCHAR(100), -- user_id or 'system:bot'
    action          TEXT,
    rationale       TEXT,         -- LLM reasoning (redacted)
    prompt_version  VARCHAR(20),
    parameters      JSONB,        -- redacted
    outcome         VARCHAR(20),
    timestamp       TIMESTAMPTZ,

    -- Partitioned by month for performance
) PARTITION BY RANGE (timestamp);

Requirements: - PII redaction BEFORE insertion - Immutable: no UPDATE/DELETE permissions - Hot storage: 90 days in Postgres - Cold archive: S3 (Parquet) for 7 years - Query: "How often did users reject DB restart proposals?"

Questions for the Team¶

What's our current runbook storage? (Git/Confluence/Wiki)
What's our change management window policy?
Do we have existing Model Risk Management processes?
What's our tolerance for external API dependency?
How many concurrent incidents typically occur?

Decision Outcome¶

Approved with conditions: Implementation to proceed with revised architecture pending: - [ ] Security review of redaction gateway - [ ] Compliance sign-off on audit logging approach - [ ] Platform approval of Temporal infrastructure - [ ] Pilot scope definition (which runbooks/alert types first)

References¶

SR 11-7: Model Risk Management Guidance
OWASP LLM Top 10
Temporal.io Documentation
pgvector Extension Documentation

This ADR was revised based on feedback from the LLM Council (GPT-5.1, Gemini 3 Pro, Claude Opus 4.5, Grok 4) on 2025-12-12.