ADR-019: AI-Powered Outlook Email Analysis for ITIL Operations Insights¶

Status: Proposed Date: 2025-12-18 Decision Makers: Engineering, IT Operations Council Review: Full council (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4)

Context¶

IT Operations teams receive large volumes of unstructured support emails via shared Outlook mailboxes. These emails contain valuable operational intelligence but lack the structure of formal ITSM ticketing systems:

Current Challenges¶

Challenge	Impact
Unstructured data	Cannot query "top issues" or "worst apps"
Reply chain noise	Signatures, disclaimers, "thank you" messages pollute analysis
No ITIL mapping	Emails don't classify as Incident/Problem/Change/Request
Pattern blindness	Recurring issues go undetected until major outages
Tribal knowledge	Solutions exist in email threads but aren't discoverable

Desired Outcomes¶

Most frequent issues ranked by volume and severity
Effective solutions extracted and catalogued
Worst offending applications identified for remediation
Avoidance strategies generated proactively
Trend detection for emerging problems

Decision¶

Implement a Hybrid AI Architecture combining traditional NLP for preprocessing with LLM-based extraction for ITIL insights, using a "Filter-Cluster-Extract" pattern.

Council Consensus¶

The LLM Council unanimously agreed on three critical architectural decisions:

Graph API + Pre-processing Layer (not IMAP/SMTP)
Vector Database + Temporal Aggregation for pattern detection
Structured JSON Output constrained to ITIL taxonomy

Architecture¶

High-Level Data Flow¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 1: INGESTION                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  Microsoft Graph API                                                         │
│  ├── OAuth 2.0 (Client Credentials Flow)                                    │
│  ├── Delta Queries for incremental sync                                     │
│  ├── Target: Shared mailboxes (support@, helpdesk@)                         │
│  └── Batch: Every 15-60 minutes via orchestrator                            │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 2: PRE-PROCESSING                              │
├─────────────────────────────────────────────────────────────────────────────┤
│  2a. PII Redaction (MUST be first)                                          │
│      └── Microsoft Presidio: names, emails, IPs, passwords → <REDACTED>     │
│                                                                              │
│  2b. Thread Reconstruction                                                   │
│      └── Group by conversationId, extract First (problem) + Last (solution) │
│                                                                              │
│  2c. Noise Removal                                                           │
│      ├── Signatures: regex for "Best regards", "--", phone patterns         │
│      ├── Disclaimers: keyword match "confidential", "do not reply"          │
│      ├── Reply headers: "On [date] wrote:", "From:", "Sent:"                │
│      └── Transactional: discard <50 words or "thank you" sentiment          │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 3: AI PROCESSING                               │
├─────────────────────────────────────────────────────────────────────────────┤
│  3a. Vectorization (Low-cost, high-volume)                                  │
│      ├── Model: text-embedding-3-small or all-MiniLM-L6-v2                  │
│      └── Output: 384-1536 dim vectors per email thread                      │
│                                                                              │
│  3b. Semantic Clustering                                                     │
│      ├── Algorithm: HDBSCAN (density-based, handles noise)                  │
│      └── Result: 1000 emails → ~50 distinct issue clusters                  │
│                                                                              │
│  3c. LLM Extraction (High-cost, cluster centroids only)                     │
│      ├── Model: GPT-4o or Llama-3-70b                                       │
│      ├── Input: Representative emails from each cluster                     │
│      └── Output: Structured JSON (see ITIL Schema below)                    │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 4: STORAGE                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│  Bronze Layer (Raw)                                                          │
│  └── Azure Blob / S3: Full JSON from Graph API (enables replay)             │
│                                                                              │
│  Vector Store                                                                │
│  └── Pinecone / pgvector / FAISS: Embeddings + metadata                     │
│                                                                              │
│  Silver Layer (Structured)                                                   │
│  └── PostgreSQL: Extracted ITIL records for analytics                       │
└─────────────────────────────────────────────────────────────────────────────┘
                                    ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LAYER 5: ANALYTICS                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  Dashboards: Power BI / Grafana                                              │
│  Reports: Automated weekly PDF via Azure Functions                           │
│  Alerts: Cluster drift detection → Teams/Slack notification                  │
└─────────────────────────────────────────────────────────────────────────────┘

ITIL Classification Schema¶

The LLM must output structured JSON conforming to this schema (enforced via Pydantic or JSON mode):

{
  "thread_id": "AAMkAGI2...",
  "timestamp": "2025-12-18T10:30:00Z",

  "classification": {
    "itil_type": "Incident | Problem | Service Request | Change Request | Noise",
    "category": "Hardware | Software | Network | Identity | Security",
    "subcategory": "VPN | Email | ERP | CRM | Endpoint | Other",
    "severity": "Critical | High | Medium | Low",
    "sentiment": "Frustrated | Neutral | Satisfied"
  },

  "extraction": {
    "issue_summary": "One sentence description of the problem",
    "affected_ci": "Application or system name (from approved CI list)",
    "error_message": "Extracted error code or message if present",
    "root_cause": "Identified cause or 'Unknown'",
    "resolution_summary": "How it was resolved or 'Unresolved'",
    "resolution_type": "Restart | Access Grant | Patch | Workaround | Escalation | User Training | Self-Healed"
  },

  "metadata": {
    "thread_length": 5,
    "time_to_resolution_hours": 2.5,
    "participants": ["support@company.com", "<REDACTED>"]
  }
}

Constrained Generation¶

To prevent hallucination of application names, provide the LLM with an enumerated list of valid Configuration Items:

VALID_CIS = [
    "Microsoft Outlook", "Microsoft Teams", "SAP ERP", "Salesforce CRM",
    "GlobalProtect VPN", "Okta SSO", "ServiceNow", "Jira", "Confluence",
    "Active Directory", "Azure AD", "Office 365", "Other"
]

The LLM must select from this list or output "Other" with a free-text description.

Analytics Outputs¶

1. Frequent Issues Report¶

Rank	Issue Cluster	Volume	Severity	Top Solution
1	VPN Connection Failures	234	High	Token resync via self-service
2	Password Reset Requests	189	Low	AD self-service portal
3	Outlook Freezing	156	Medium	Clear OST cache

2. Worst Offending Applications¶

Scoring: Score = Volume × Severity_Weight × (1 + Negative_Sentiment_Ratio)

Application	Incidents	Avg Severity	Frustration Score	Trend
SAP ERP	89	High	0.72	↑ 23%
GlobalProtect VPN	234	Medium	0.45	↓ 12%
Microsoft Teams	67	Low	0.31	→ Stable

3. Avoidance Matrix¶

                    HIGH COMPLEXITY (Long threads, escalations)
                              │
     PROBLEM MANAGEMENT       │      ROOT CAUSE ANALYSIS
     (Systemic fixes needed)  │      (Deep investigation)
                              │
    ──────────────────────────┼──────────────────────────────
                              │
     AUTOMATION CANDIDATE     │      TRAINING / DOCUMENTATION
     (Chatbot, self-service)  │      (User education)
                              │
                    LOW COMPLEXITY (Quick resolution)

    LOW FREQUENCY ────────────┼──────────────── HIGH FREQUENCY

4. Proactive Recommendations¶

LLM-generated from top clusters:

"Based on 234 VPN connection failures in the past 30 days, recommend: 1. Implement automatic RSA token refresh before expiration 2. Create self-service portal for token resync 3. Add monitoring alert for certificate expiration"

Privacy & Security¶

Concern	Mitigation
PII in emails	Microsoft Presidio redaction BEFORE any AI processing
Credentials in text	TruffleHog scanner for leaked passwords/API keys
Data residency	Same-region Azure OpenAI (GDPR/HIPAA compliance)
Access control	Row-level security by email folder/department
Retention	Auto-delete raw data after 90 days per policy
LLM data usage	Zero Data Retention agreement with API provider

Alternatives Considered¶

Alternative 1: Pure LLM Approach (All emails through GPT-4)¶

Rejected: Cost-prohibitive at scale. Processing 10,000 emails/day at $0.01/email = $3,000/month. Hybrid approach reduces LLM calls by 90%+ via clustering.

Alternative 2: Pure Traditional NLP (No LLM)¶

Rejected: Cannot extract nuanced insights like root causes or generate avoidance recommendations. Keyword-based classification misses semantic similarity ("frozen" vs "hangs" vs "unresponsive").

Alternative 3: Fine-tuned Domain Model¶

Deferred: Requires 5,000+ labeled examples. Start with few-shot prompting; fine-tune later if accuracy <85% on validation set.

Alternative 4: Direct IMAP/SMTP Access¶

Rejected: Security concerns (storing credentials), no OAuth support, limited metadata. Graph API provides conversationId, threading, and enterprise-grade auth.

Implementation Phases¶

Phase 1: Foundation (Weeks 1-4)¶

[ ] Azure AD app registration with Graph API permissions
[ ] Delta query ingestion pipeline (Azure Functions)
[ ] PII redaction with Presidio
[ ] Bronze layer storage (Blob)

Phase 2: Intelligence (Weeks 5-8)¶

[ ] Thread reconstruction and noise filtering
[ ] Embedding generation pipeline
[ ] Vector store setup (pgvector or Pinecone)
[ ] HDBSCAN clustering job

Phase 3: Extraction (Weeks 9-12)¶

[ ] LLM extraction prompts with JSON schema
[ ] ITIL classification validation
[ ] Silver layer database schema
[ ] CI enumeration and constraint logic

Phase 4: Analytics (Weeks 13-16)¶

[ ] Power BI dashboard integration
[ ] Automated weekly reports
[ ] Cluster drift alerting
[ ] Avoidance recommendation generation

Risks and Mitigations¶

Risk	Likelihood	Impact	Mitigation
Thread fragmentation (10 replies = 10 incidents)	High	High	Strict conversationId aggregation
Signature hallucination (LLM reads signature as issue)	Medium	Medium	Aggressive regex truncation before LLM
Topic drift in reply chains (new issue in old thread)	Medium	Medium	LLM prompt: "Is this semantically related?"
"Thank you" pollution	High	Low	Filter messages <50 words + positive sentiment
Cost overruns	Medium	High	Cluster-first approach; LLM only for centroids
Compliance breach (PII to external LLM)	Low	Critical	Presidio redaction as FIRST step, not optional

Success Metrics¶

Metric	Target	Measurement
Issue classification accuracy	>85% F1	Manual validation of 200 samples
Pattern detection coverage	>90% of issues in clusters	Noise cluster <10%
Time to insight	<24 hours	Ingestion to dashboard latency
Cost per email	<$0.001	Total monthly cost / emails processed
Actionable recommendations	5+ per week	Auto-generated avoidance strategies

Technology Stack¶

Layer	Technology	Rationale
Ingestion	MS Graph API + Azure Functions	Native O365 integration, serverless scale
Orchestration	Apache Airflow	Complex DAG support, monitoring
PII Redaction	Microsoft Presidio	Open source, runs locally
Embeddings	text-embedding-3-small	Cost-effective, good quality
Vector Store	pgvector (PostgreSQL)	Single database, familiar SQL
Clustering	HDBSCAN (scikit-learn)	Handles noise, variable density
LLM	Azure OpenAI GPT-4o	Enterprise compliance, JSON mode
Structured Output	Pydantic + Instructor	Schema enforcement
Analytics	Power BI	Enterprise standard, Graph integration
Alerting	Azure Logic Apps	Low-code, Teams/Slack integration

References¶

Council Review Summary¶

Reviewed by: GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4

Council Rankings: 1. Claude Opus 4.5 (0.667) - Hybrid embeddings + LLM approach 2. GPT-5.1 (0.500) - Feedback loop emphasis 3. Gemini 3 Pro (0.333) - PII-first architecture 4. Grok-4 (0.000) - Cloud-native scaling

Key Council Insights: - PII redaction must be architectural prerequisite, not afterthought - Vector clustering before LLM extraction reduces costs 90%+ - Structured JSON output prevents ITIL taxonomy drift - Time-windowed aggregation distinguishes chronic vs acute issues