You’re a business deploying connected products.
Your top-level business goals might include reducing churn, perhaps by increasing your Net Promoter Score (customer satisfaction), exceeding a certain deployment growth rate, driving up revenue, reducing operating costs, and meeting your Service Level Agreements (SLAs) with your customers.
Meanwhile, at a low-level, your device estate is experiencing many different incidents which all damage your ability to achieve the above goals to a greater or lesser extent, ranging from a variety of hardware and software problems, to configuration and operational issues, not to mention connectivity issues. Your technical staff can understand, triage and fix these problems, but may not know or care very much about their effect on the high-level business goals.
As a business, you need to take action to resolve these incidents, to increase your business performance. There are a variety of resolution actions you can take, depending on the cause, such as “send a reset message”, “reset the breaker”, “upgrade the firmware”, “provision the SIM” etc. These actions are the “hammers” your team can use to improve business performance:
- We can analyse this top down: Why are our business numbers not what we want? Let’s drill down and identify all the detailed reasons.
- Or we can analyse this bottom up: What are all the problems, how important is each to business performance?
Typically we need to do a bit of both. But either way, the “day job” is to carve-off problems one by one, build a process for identifying them and fixing each, while prioritising by their effect on the overall business goals.
KPIs to drive actions
To help us track what is going on, and prioritise, we define Key Performance Indicators (KPIs), each of which clearly relates to one or more of our business goals. Then at any moment in time we can say e.g. “we have 34 incidents of type X which are leading to a Y% drop in performance on KPI Z”. Thus we’re joining the dots from a horde of technical incidents all the way to their financial business consequences.
Armed with a good set of KPIs, we can build live dashboards full of incidents of each type which need addressing. This will show us trends which give a big-picture view of the situation and changes over time, which is great.
But also we need to ensure that this information is actionable, so we can hit the incidents with our “hammers”. We can do this by building lists of actions, with each item on the list being a “problem:solution” pair, “incident:action”.
If we ensure that each defined incident is actionable then - in many cases - we can go further and automate those actions, closing the loop and driving up quality much more quickly and efficiently than manual processes can achieve.
Some example KPIs:
- Uptime - this is a very important number, but to make it actionable we probably need to split it up by cause, e.g. “offline due to hardware failure” or “offline for >7 days so needs a reset”, so we know which hammer to use to fix it.
- Performance out of spec - often devices will continue to work, but with operating parameters outside specification, whether that’s a low oil level, or a high input voltage or whatever, which is a warning sign that all is not well and some kind of action needs to be taken.
- Mis-configured - e.g. a meter that is reporting on wrong day for billing, or reporting with the wrong cadence (e.g. it’s supposed to be half-hourly, but it’s only reporting daily)
- Customer usage challenges - e.g. prepayment meter out of credit etc.
- Missing fuel - one meter of a pair is online, the other one not - a sign of problems either with that meter, or with the Home Area Network which connects them
- Site availability - for devices deployed in clusters, individual device failures may not immediately damage service availability, so we want to measure that by whole site, not by device.
- Nursery failures - during the first N hours of use, when customer is getting those all-important first-impressions
- Delivery of business value. A great cross-check against any failures which we might not have anticipated is to actually measure the value delivered by our devices - for example if a smart meter is supposed to send a reading every 30 minutes, do we have all those readings and do they look sensible? This kind of positive indicator can help us quickly detect “black swan” events which we weren’t expecting.
Any KPI can of course be dimensioned by metadata e.g. “uptime by device type” or by customer, by region etc.
See also
- Best Practice SLA Management
- Uptime vs. Availability - a tale of 2 metrics
- The Device Kindergarten
- Who will win the race? Driving-up quality and efficiency in rapid growth Smart Energy.