AI Productionization & Observability at E.ON
Role: AI-native PO | Technical Manager, Observability Platform · 2025
AI productionization for 8M+ energy customers under KRITIS. Six monitoring tools consolidated into one platform. 2M+ monthly LLM interactions at 92% accuracy.

- LLM interactions at 92% accuracy
- 2M+ monthly
- Model deployment
- 45% faster
- Monitoring tools consolidated
- 6 → 1
Team
- Team
- Tungi Dang
- AI-native PO | Technical Manager
Two problems, one engagement
E.ON runs 1.6 million km of energy networks for 47 million customers across 17 countries. When something breaks, people lose power. The engagement covered two connected problems: getting AI into production for 8M+ energy customers, and fixing the observability foundation that AI-assisted operations would depend on.
AI productionization at KRITIS scale
Deploying AI for critical infrastructure isn't a model problem — it's a governance and delivery problem. The work: MLOps pipelines with automated testing frameworks that cut model deployment time by 45% and production incidents by 60%. LLM deployment for customer service automation now processes 2M+ monthly interactions at 92% accuracy. Cloud security governance aligned with BSI C5 controls, NIS2, and KRITIS/IT-SiG 2.0, including DSFA documentation for AI-driven customer-data processing.
The agentic layer came next: piloting AI observability with New Relic and MCP-connected LLMs in AWS and Azure for automated diagnostics and remediation with human-in-the-loop guardrails. A cross-functional working group — observability, operations, data science — validated that collected metrics mapped to defined SLOs and delivered measurable value.
Six tools, zero visibility
Six different monitoring tools across IT, OT, and grid operations had no shared view. Decentralised renewables were making the grid more volatile by the quarter. Alert fatigue was burning out on-call teams. Every incident started with the same question: which tool has the answer?
One backbone instead of six band-aids
OpenTelemetry became the single standard for traces, metrics, and logs — vendor-neutral by design. New Relic provided the platform layer: APM, infrastructure monitoring, Kubernetes auto-discovery via eBPF, and AI-assisted root cause analysis. For the first time, IT signals, OT telemetry, and grid state all flowed into one place.
The semantic convention work mattered as much as the tooling: fixing correlation gaps from inconsistent naming across teams so that AI-assisted diagnostics could actually reason across signals.
When an outage hits, the business impact is immediate
Pathpoint mapped customer journeys, order flows, and backend services to shared KPIs. Grid operators got real-time state estimation and congestion detection. When an incident fired, it came with context: not just "this service is down" but "this is affecting X customers in their billing flow." That changed how teams prioritised.
Teams stopped waiting for the platform team
Self-service dashboards and templates scaled observability to thousands of users without bottlenecking on a central team. A reusable onboarding playbook standardised collectors, alerts, SLOs, and dashboards across markets. New teams went from zero to production-grade observability in days, not weeks.
What changed
- Production AI serving 8M+ energy customers — 2M+ monthly LLM interactions at 92% accuracy
- 45% faster model deployment, 60% fewer production incidents through MLOps pipelines
- Agentic AI observability with MCP-connected LLMs for automated diagnostics
- Unified telemetry across IT, OT, and grid, replacing six fragmented tools
- Faster incident detection and reduced MTTR through AI-assisted root cause analysis
- Business-aligned observability that ties every alert to customer and revenue impact
- BSI C5, NIS2, and KRITIS-compliant security governance with DSFA documentation
2M+ monthly lLM interactions at 92% accuracy. 45% faster model deployment. 6 → 1 monitoring tools consolidated.
