Scaling AI at the Energy Edge: Why Pilots Succeed and Deployments Stall

I’ve been involved in distributed and embedded computing for about 30 years now, and one of the things I see again and again in the energy sector is how a successful AI pilot fails to become a successful deployment. The model works. The proof of concept proves what it was designed to do. Leadership approves the rollout. And then somewhere between site five and site fifty, the whole programme runs out of road.

This is the pattern worth being honest about, because the reasons rarely have anything to do with the AI itself.

Why the pilot worked in the first place

A pilot succeeds because it lives in a controlled setting. One site, a known set of assets, and an experienced team on hand to tune things as you go. That’s an environment you can actually manage. The trouble starts when you try to take whatever worked at one substation, one wind site, or one battery installation and roll it out across a fleet. Energy operations are large, distributed, equipment-rich systems, and the assets in them were never designed to work together. Different protocols, different naming conventions, different proprietary communication layers in different combinations at each site. A model that performed well at one location can produce noisy or unreliable output at the next, simply because the upstream data behaves differently there.

What I’d call the last-mile problem isn’t really about the model at all. The model is usually fine. The plumbing underneath it is what struggles to keep up.

Where the questions stop being about the model

Once you push past a handful of sites, the conversation stops being about algorithms and starts being about operations. How do you push a model update to hundreds of edge nodes without taking assets offline? Which version is running on which site, and who’s tracking that? How do you spot drift across a fleet when no two sites generate the same data? When a deployment misbehaves, how quickly can you roll back to a known good state? These questions sound less interesting than the AI itself, and that’s part of the problem. They get treated as secondary, and they’re where most rollouts quietly come apart.

What I keep coming back to is that managing AI at scale is mostly a problem of lifecycle management. You need to develop the model, version it, version the data on which it was trained, test it in a lab or simulator, pilot it on a subset of the fleet (what some teams now call canary deployment) and only then push it out more broadly. After that, you need to monitor the model outputs versus expected behaviour, watch for drift over time, and have automated processes to roll back to predicted safe states when something goes wrong. None of this is exotic. It’s just disciplined software lifecycle work applied to an asset class that most OT environments weren’t set up to handle.

Project thinking vs. platform thinking

That’s where project thinking and platform thinking start to diverge. Most organisations approach their first edge AI deployment as a project. Defined scope, defined site, defined timeline, a team that disbands once the pilot wraps. That works exactly once. The second deployment exposes everything the first one didn’t have to solve, because nothing about it was built to be reused.

A platform approach takes a different view. The edge layer is treated as durable infrastructure, with consistent ways to acquire and normalise data, deploy applications, orchestrate them across the fleet, manage their lifecycle, and observe what’s actually happening across hundreds of nodes. The AI model becomes one component running on that platform, not the whole story. New use cases inherit the foundation rather than rebuilding it.

This switch is harder than it looks. It means committing to open standards over a convenient proprietary integration. It requires investment in orchestration before there is a visible return. It means accepting that the first deployment will run slower because you are building something that will inherit the next ten. The return appears on deployments two, five and fifty when a new site goes live in days rather than months.

Nobody owns this yet

There’s also an organisational dimension I think gets underestimated. AI safety and AI operations don’t have a clear owner in most organisations. The work falls between data science, IT, OT, and the existing safety and engineering teams, and nobody quite has the brief for it. That has to change. Treating AI as a first-class, safety-relevant component, with shared ownership across those disciplines, is what separates operators who get to fleet scale from those who keep running pilots indefinitely.

The pilot is the easier half of any of this. Running the fleet is where the actual work sits.

Sponsored by  IOTech.