Energy

AI Productionization & Observability at E.ON

Role: AI-native PO | Technical Manager, Observability Platform · 2025

AI productionization for 8M+ energy customers under KRITIS. Six monitoring tools consolidated into one platform. 2M+ monthly LLM interactions at 92% accuracy.

Energy: AI Productionization & Observability at E.ON

LLM interactions at 92% accuracy: 2M+ monthly
Model deployment: 45% faster
Monitoring tools consolidated: 6 → 1

The shift

Grid observability

Team

Team: Tungi Dang; AI-native PO | Technical Manager

Two problems that were secretly one

E.ON runs 1.6 million km of energy networks for 47 million customers across 17 countries. When something breaks, people lose power. My engagement had two remits: get AI into production for 8M+ energy customers, and fix the observability foundation underneath it. They sound separate. They aren't — AI-assisted operations are only as good as the telemetry they reason over, and E.ON's telemetry was split across six monitoring tools.

"Which tool has the answer?"

Six tools across IT, OT, and grid operations, none with the full picture. Decentralised renewables were making the grid more volatile by the quarter, alert fatigue was burning out on-call teams, and every incident opened with the same ritual: figuring out which of the six screens to trust.

The instinct in every large organisation is to commission tool number seven, the one that will finally replace the others. I'd made a version of that bet earlier in my career and paid for it — a technically better platform that stalled because people had years of scripts, integrations, and muscle memory wrapped around the old ones. So at E.ON we didn't replace anyone's workflow; we consolidated underneath it. OpenTelemetry became the single standard for traces, metrics, and logs — vendor-neutral by design — with New Relic as the platform layer: APM, infrastructure monitoring, Kubernetes auto-discovery via eBPF, AI-assisted root cause analysis.

The part I pushed hardest wasn't tooling. It was semantic conventions — the unglamorous work of making teams name things consistently. Machines can't correlate what humans label inconsistently, and every AI-diagnostics ambition dies on that rock first.

Putting AI into production under KRITIS

Deploying AI for critical infrastructure is not a model problem; it's a governance and delivery problem. The MLOps pipelines we built — automated testing replacing manual release steps — cut model deployment time by 45% and production incidents by 60%. The customer-service LLM deployment now handles 2M+ monthly interactions at 92% accuracy. And all of it ships inside the regulatory envelope: BSI C5 controls, NIS2, KRITIS/IT-SiG 2.0, with DSFA documentation for every AI-driven use of customer data.

On top of that baseline we piloted the agentic layer: MCP-connected LLMs against New Relic in AWS and Azure, doing automated diagnostics and remediation with human-in-the-loop guardrails. A cross-functional working group — observability, operations, data science — checked that collected metrics actually mapped to defined SLOs, so the pilot had to prove value in numbers the operators already trusted.

Alerts that name the business impact

Pathpoint mapped customer journeys, order flows, and backend services to shared KPIs, and grid operators got real-time state estimation and congestion detection. An incident stopped arriving as "this service is down" and started arriving as "this is affecting X customers in their billing flow." Prioritisation arguments got noticeably shorter.

Scaling without a bottleneck

Self-service dashboards and templates took observability to thousands of users without queueing on a central team. A reusable onboarding playbook — collectors, alerts, SLOs, dashboards — meant new teams reached production-grade observability in days.

What changed

Production AI serving 8M+ energy customers: 2M+ monthly LLM interactions at 92% accuracy
45% faster model deployment and 60% fewer production incidents through MLOps pipelines with automated testing
Six monitoring tools consolidated into one OpenTelemetry platform spanning IT, OT, and grid
Agentic diagnostics piloted with MCP-connected LLMs, gated by human-in-the-loop guardrails
BSI C5, NIS2, and KRITIS-compliant governance with DSFA documentation

These patterns are part of The AI-native Platform Playbook — five principles for productionizing LLMs in KRITIS-regulated environments.

EnergyObservabilityOT/IT ConvergenceBusiness JourneysMulti-tenant Platform+16

2M+ monthly lLM interactions at 92% accuracy. 45% faster model deployment. 6 → 1 monitoring tools consolidated.

Got a challenge? I've probably seen it before.

Download CV

Nelly. PVS & Embedded FinanceHealthTech / FinTech