ADR-019: AI-Powered Outlook Email Analysis for ITIL Operations Insights¶
Status: Proposed Date: 2025-12-18 Decision Makers: Engineering, IT Operations Council Review: Full council (GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4)
Context¶
IT Operations teams receive large volumes of unstructured support emails via shared Outlook mailboxes. These emails contain valuable operational intelligence but lack the structure of formal ITSM ticketing systems:
Current Challenges¶
| Challenge | Impact |
|---|---|
| Unstructured data | Cannot query "top issues" or "worst apps" |
| Reply chain noise | Signatures, disclaimers, "thank you" messages pollute analysis |
| No ITIL mapping | Emails don't classify as Incident/Problem/Change/Request |
| Pattern blindness | Recurring issues go undetected until major outages |
| Tribal knowledge | Solutions exist in email threads but aren't discoverable |
Desired Outcomes¶
- Most frequent issues ranked by volume and severity
- Effective solutions extracted and catalogued
- Worst offending applications identified for remediation
- Avoidance strategies generated proactively
- Trend detection for emerging problems
Decision¶
Implement a Hybrid AI Architecture combining traditional NLP for preprocessing with LLM-based extraction for ITIL insights, using a "Filter-Cluster-Extract" pattern.
Council Consensus¶
The LLM Council unanimously agreed on three critical architectural decisions:
- Graph API + Pre-processing Layer (not IMAP/SMTP)
- Vector Database + Temporal Aggregation for pattern detection
- Structured JSON Output constrained to ITIL taxonomy
Architecture¶
High-Level Data Flow¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 1: INGESTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ Microsoft Graph API │
│ ├── OAuth 2.0 (Client Credentials Flow) │
│ ├── Delta Queries for incremental sync │
│ ├── Target: Shared mailboxes (support@, helpdesk@) │
│ └── Batch: Every 15-60 minutes via orchestrator │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 2: PRE-PROCESSING │
├─────────────────────────────────────────────────────────────────────────────┤
│ 2a. PII Redaction (MUST be first) │
│ └── Microsoft Presidio: names, emails, IPs, passwords → <REDACTED> │
│ │
│ 2b. Thread Reconstruction │
│ └── Group by conversationId, extract First (problem) + Last (solution) │
│ │
│ 2c. Noise Removal │
│ ├── Signatures: regex for "Best regards", "--", phone patterns │
│ ├── Disclaimers: keyword match "confidential", "do not reply" │
│ ├── Reply headers: "On [date] wrote:", "From:", "Sent:" │
│ └── Transactional: discard <50 words or "thank you" sentiment │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 3: AI PROCESSING │
├─────────────────────────────────────────────────────────────────────────────┤
│ 3a. Vectorization (Low-cost, high-volume) │
│ ├── Model: text-embedding-3-small or all-MiniLM-L6-v2 │
│ └── Output: 384-1536 dim vectors per email thread │
│ │
│ 3b. Semantic Clustering │
│ ├── Algorithm: HDBSCAN (density-based, handles noise) │
│ └── Result: 1000 emails → ~50 distinct issue clusters │
│ │
│ 3c. LLM Extraction (High-cost, cluster centroids only) │
│ ├── Model: GPT-4o or Llama-3-70b │
│ ├── Input: Representative emails from each cluster │
│ └── Output: Structured JSON (see ITIL Schema below) │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 4: STORAGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ Bronze Layer (Raw) │
│ └── Azure Blob / S3: Full JSON from Graph API (enables replay) │
│ │
│ Vector Store │
│ └── Pinecone / pgvector / FAISS: Embeddings + metadata │
│ │
│ Silver Layer (Structured) │
│ └── PostgreSQL: Extracted ITIL records for analytics │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 5: ANALYTICS │
├─────────────────────────────────────────────────────────────────────────────┤
│ Dashboards: Power BI / Grafana │
│ Reports: Automated weekly PDF via Azure Functions │
│ Alerts: Cluster drift detection → Teams/Slack notification │
└─────────────────────────────────────────────────────────────────────────────┘
ITIL Classification Schema¶
The LLM must output structured JSON conforming to this schema (enforced via Pydantic or JSON mode):
{
"thread_id": "AAMkAGI2...",
"timestamp": "2025-12-18T10:30:00Z",
"classification": {
"itil_type": "Incident | Problem | Service Request | Change Request | Noise",
"category": "Hardware | Software | Network | Identity | Security",
"subcategory": "VPN | Email | ERP | CRM | Endpoint | Other",
"severity": "Critical | High | Medium | Low",
"sentiment": "Frustrated | Neutral | Satisfied"
},
"extraction": {
"issue_summary": "One sentence description of the problem",
"affected_ci": "Application or system name (from approved CI list)",
"error_message": "Extracted error code or message if present",
"root_cause": "Identified cause or 'Unknown'",
"resolution_summary": "How it was resolved or 'Unresolved'",
"resolution_type": "Restart | Access Grant | Patch | Workaround | Escalation | User Training | Self-Healed"
},
"metadata": {
"thread_length": 5,
"time_to_resolution_hours": 2.5,
"participants": ["support@company.com", "<REDACTED>"]
}
}
Constrained Generation¶
To prevent hallucination of application names, provide the LLM with an enumerated list of valid Configuration Items:
VALID_CIS = [
"Microsoft Outlook", "Microsoft Teams", "SAP ERP", "Salesforce CRM",
"GlobalProtect VPN", "Okta SSO", "ServiceNow", "Jira", "Confluence",
"Active Directory", "Azure AD", "Office 365", "Other"
]
The LLM must select from this list or output "Other" with a free-text description.
Analytics Outputs¶
1. Frequent Issues Report¶
| Rank | Issue Cluster | Volume | Severity | Top Solution |
|---|---|---|---|---|
| 1 | VPN Connection Failures | 234 | High | Token resync via self-service |
| 2 | Password Reset Requests | 189 | Low | AD self-service portal |
| 3 | Outlook Freezing | 156 | Medium | Clear OST cache |
2. Worst Offending Applications¶
Scoring: Score = Volume × Severity_Weight × (1 + Negative_Sentiment_Ratio)
| Application | Incidents | Avg Severity | Frustration Score | Trend |
|---|---|---|---|---|
| SAP ERP | 89 | High | 0.72 | ↑ 23% |
| GlobalProtect VPN | 234 | Medium | 0.45 | ↓ 12% |
| Microsoft Teams | 67 | Low | 0.31 | → Stable |
3. Avoidance Matrix¶
HIGH COMPLEXITY (Long threads, escalations)
│
PROBLEM MANAGEMENT │ ROOT CAUSE ANALYSIS
(Systemic fixes needed) │ (Deep investigation)
│
──────────────────────────┼──────────────────────────────
│
AUTOMATION CANDIDATE │ TRAINING / DOCUMENTATION
(Chatbot, self-service) │ (User education)
│
LOW COMPLEXITY (Quick resolution)
LOW FREQUENCY ────────────┼──────────────── HIGH FREQUENCY
4. Proactive Recommendations¶
LLM-generated from top clusters:
"Based on 234 VPN connection failures in the past 30 days, recommend: 1. Implement automatic RSA token refresh before expiration 2. Create self-service portal for token resync 3. Add monitoring alert for certificate expiration"
Privacy & Security¶
| Concern | Mitigation |
|---|---|
| PII in emails | Microsoft Presidio redaction BEFORE any AI processing |
| Credentials in text | TruffleHog scanner for leaked passwords/API keys |
| Data residency | Same-region Azure OpenAI (GDPR/HIPAA compliance) |
| Access control | Row-level security by email folder/department |
| Retention | Auto-delete raw data after 90 days per policy |
| LLM data usage | Zero Data Retention agreement with API provider |
Alternatives Considered¶
Alternative 1: Pure LLM Approach (All emails through GPT-4)¶
Rejected: Cost-prohibitive at scale. Processing 10,000 emails/day at $0.01/email = $3,000/month. Hybrid approach reduces LLM calls by 90%+ via clustering.
Alternative 2: Pure Traditional NLP (No LLM)¶
Rejected: Cannot extract nuanced insights like root causes or generate avoidance recommendations. Keyword-based classification misses semantic similarity ("frozen" vs "hangs" vs "unresponsive").
Alternative 3: Fine-tuned Domain Model¶
Deferred: Requires 5,000+ labeled examples. Start with few-shot prompting; fine-tune later if accuracy <85% on validation set.
Alternative 4: Direct IMAP/SMTP Access¶
Rejected: Security concerns (storing credentials), no OAuth support, limited metadata. Graph API provides conversationId, threading, and enterprise-grade auth.
Implementation Phases¶
Phase 1: Foundation (Weeks 1-4)¶
- [ ] Azure AD app registration with Graph API permissions
- [ ] Delta query ingestion pipeline (Azure Functions)
- [ ] PII redaction with Presidio
- [ ] Bronze layer storage (Blob)
Phase 2: Intelligence (Weeks 5-8)¶
- [ ] Thread reconstruction and noise filtering
- [ ] Embedding generation pipeline
- [ ] Vector store setup (pgvector or Pinecone)
- [ ] HDBSCAN clustering job
Phase 3: Extraction (Weeks 9-12)¶
- [ ] LLM extraction prompts with JSON schema
- [ ] ITIL classification validation
- [ ] Silver layer database schema
- [ ] CI enumeration and constraint logic
Phase 4: Analytics (Weeks 13-16)¶
- [ ] Power BI dashboard integration
- [ ] Automated weekly reports
- [ ] Cluster drift alerting
- [ ] Avoidance recommendation generation
Risks and Mitigations¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Thread fragmentation (10 replies = 10 incidents) | High | High | Strict conversationId aggregation |
| Signature hallucination (LLM reads signature as issue) | Medium | Medium | Aggressive regex truncation before LLM |
| Topic drift in reply chains (new issue in old thread) | Medium | Medium | LLM prompt: "Is this semantically related?" |
| "Thank you" pollution | High | Low | Filter messages <50 words + positive sentiment |
| Cost overruns | Medium | High | Cluster-first approach; LLM only for centroids |
| Compliance breach (PII to external LLM) | Low | Critical | Presidio redaction as FIRST step, not optional |
Success Metrics¶
| Metric | Target | Measurement |
|---|---|---|
| Issue classification accuracy | >85% F1 | Manual validation of 200 samples |
| Pattern detection coverage | >90% of issues in clusters | Noise cluster <10% |
| Time to insight | <24 hours | Ingestion to dashboard latency |
| Cost per email | <$0.001 | Total monthly cost / emails processed |
| Actionable recommendations | 5+ per week | Auto-generated avoidance strategies |
Technology Stack¶
| Layer | Technology | Rationale |
|---|---|---|
| Ingestion | MS Graph API + Azure Functions | Native O365 integration, serverless scale |
| Orchestration | Apache Airflow | Complex DAG support, monitoring |
| PII Redaction | Microsoft Presidio | Open source, runs locally |
| Embeddings | text-embedding-3-small | Cost-effective, good quality |
| Vector Store | pgvector (PostgreSQL) | Single database, familiar SQL |
| Clustering | HDBSCAN (scikit-learn) | Handles noise, variable density |
| LLM | Azure OpenAI GPT-4o | Enterprise compliance, JSON mode |
| Structured Output | Pydantic + Instructor | Schema enforcement |
| Analytics | Power BI | Enterprise standard, Graph integration |
| Alerting | Azure Logic Apps | Low-code, Teams/Slack integration |
References¶
- Microsoft Graph API Mail Documentation
- Microsoft Presidio
- ITIL 4 Foundation
- HDBSCAN Clustering
- OpenAI JSON Mode
Council Review Summary¶
Reviewed by: GPT-5.1, Claude Opus 4.5, Gemini 3 Pro, Grok-4
Council Rankings: 1. Claude Opus 4.5 (0.667) - Hybrid embeddings + LLM approach 2. GPT-5.1 (0.500) - Feedback loop emphasis 3. Gemini 3 Pro (0.333) - PII-first architecture 4. Grok-4 (0.000) - Cloud-native scaling
Key Council Insights: - PII redaction must be architectural prerequisite, not afterthought - Vector clustering before LLM extraction reduces costs 90%+ - Structured JSON output prevents ITIL taxonomy drift - Time-windowed aggregation distinguishes chronic vs acute issues