When we say “connected products” here, we’re generally talking about business-to-business Industrial Internet of Things (IIoT) products: connected products which are deployed by a business, for the benefit of other businesses. There may be consumers as the final end-users, but it’s the B2B nature of the relationship that’s the key aspect. These B2B IIoT devices are typically commercial, industrial, energy, smart-city - for example smart meters, vending machines, electric vehicle charging points etc.
In this situation, the vendor who is deploying the connected products will usually have some kind of contractual relationship with their customer, and in time this will likely include a Service Level Agreement (SLA).
A history lesson
At my last company AlertMe we built one of the first mass-market Smart Home platforms. In March 2010 we’d just completed a trial deploying a couple of thousand devices via a large “channel partner” customer Centrica, to their end users (UK householders), for what eventually became Hive™. We held a big meeting to review the trial and agree next steps, and they started by outlining their goals to deploy 10k devices by this date, 100k devices by that date … it all sounded great. Then they dropped their bombshell:
“But your stuff doesn’t work!”
Hmm, we thought. Certainly we’d experienced some teething problems during the trial, various bugs and process issues, which we’d ironed-out - but that’s what trials are for, right? According to our data, everything now seemed to be working pretty smoothly.
But it turned out that according to their data, there were still various unresolved issues. Each side had lots of data to support our positions, and after waving bits of paper about and some heated discussion, we finally came to an understanding of the fundamental problem between us:
We hadn’t defined what “working” meant
For example, if a user intentionally turns-off their internet connection at night, does that count as a product failure? And how good is good-enough? This was when we thrashed-out the first SLA between us, as part of an overall contract for the next stage of growth. Defining the metrics for “working”, and agreeing what numbers were good-enough.
At DevicePilot, we help our customers through this classic “growing pain” barrier, which is a sign of increasing maturity and customer success.
Connected products are services. High service levels (happy customers, revenue) are achieved by using good tools wisely, and creating the right set of business processes. It doesn’t all get sorted-out on day one - there’s a journey to maturity. Until technical quality is high, then technical quality is rightly what gets the most attention. For example, is poor device uptime being caused primarily by software bugs, or comms problems?
Once technical quality has been made good-enough, companies become increasingly interested in the quality of service delivery to the customer, which is a higher-level, more business-oriented metric which goes right to the heart of every business. Service quality can be affected by business and people issues just as much as by technical issues. Unlike uptime, this requires thinking about more than just single devices in isolation, and recognising that they often get deployed in clusters or sites.
How SLAs are made
SLAs usually need to be:
- Hard, measurable numbers (objective, not subjective)
- Measured over a period of time
- Typically a “tumbling window” of time, such as a calendar month
- But sometimes a sliding window, such as “the last 30 days”
- There’s a cost to achieving high performance, which must be balanced against the cost of lost revenue and financial penalties.
- Under the vendor’s control. SLAs have to be “fair”
- if a vendor is to be made responsible for hitting a number, they must have the power to meet that number, and mustn’t be held responsible for circumstances beyond their control.
KPI → SLA
SLAs are built upon the building-block of the KPI (Key Performance Indicator). A KPI is a measurement over a period of time. An example technical KPI is “device uptime”, and we could choose to measure this per day, week etc. To get an SLA, we apply a threshold to the KPI, so that, at any time, we can make a binary decision: is the KPI meeting our SLA, or not?
A good SLA often needs to take into account more than just how each single device behaves in isolation. For example, electric vehicle chargers are often deployed together in clusters: in one particular car park there might be 4 chargers.
Looking at this from the end-user’s perspective (which we should always try to do):
- Does a user care if one charger is broken? Not really, if there’s another one next to it that is working and available.
- If all the chargers are working, but all are busy being used, then while there is no technical problem (the uptime is 100% right now), there is certainly is a business problem - the user can’t charge their car.
(for a deeper dive into this example, see Uptime and Availability - a tale of two metrics)
As another example, imagine that there are 8 ticket barriers at a railway station. If one stops working, that may be no big problem. But when the next one stops working too, suddenly we’re below expected capacity and big queues will form, which could even cause safety issues. And rush-hour starts in 3 hours! So in this case our criterion would be not “at least one available” per site, but rather “no more than 1 broken”, or in this case perhaps “at least <capacity> available at each site”.
The point is that an SLA often needs to consider the clusters in which a connected product is deployed - in these examples the cluster is a car-park, or a ticket hall. Clusters don’t have to be geographically-localised, but they usually are, and so often we use the generic term “site” for a cluster. As in, “which sites met their SLA last month?”.
The anatomy of an SLA
Let’s explore the anatomy of a particular SLA, using electric vehicle charging as an example.
“In any calendar month, on any site, there must be at least 1 working charger available, at least 90% of the time, during business hours”.
Perhaps there’s a $10,000 penalty payable for each month in which this SLA is not met.
In the diagram below we can see how this is derived. Reading from the left:
- A number of mainly technical issues might affect the up-time of a particular device
- And of course, when a device is being used, it’s not available for others to use
- Taking 1) & 2) into account, when a device is up and not in-use, then it’s available. We have 4 devices on this particular site, so we can measure availability of “at least one working charger” on that site, over time, and see if it meets the “at least 90% of the time” criterion.
- We can do the same for each site, to see if this customer’s overall SLA is met
There are plenty of ways in which SLAs can be qualified. We’ve already specified that we only want to measure availability “during business hours”, and of course business hours may vary from site to site. There can be other qualifiers too, for example:
- We only want to measure devices that are supposed to be working - marked as “deployed” - not ones that are waiting to be commissioned, or have been intentionally taken out of service.
- Different types of device may have different “value”, e.g.a rapid EV charger is a more valuable asset than a slow one.
- Some sites may have “VIP” status if e.g. they are trial sites.
How might the SLA not be met?
Imagining the ways in which an SLA might not be met gives us insights into ways to make sure it is met.
Since the criterion is that “at least one charger is available” (rather than requiring e.g. a percentage of chargers to be available), then sites with more chargers are probably less likely to violate the SLA, all other things being equal. Adding an extra charger at small sites is one way to radically improve our measured performance, albeit at quite a cost.
Which raises a key point: the tighter an SLA is, the more it costs to meet. So if the customer wants a tighter SLA, they will need to pay more. Or from a vendor’s perspective, if you’re able to deliver better performance, then maybe you can raise your prices.
Equally, if one particular charger on a small site has a long-term outage (maybe it’s waiting for a part), that is likely to push that site towards violating the SLA. So perhaps we should prioritise long outages, especially at small sites.
Since the SLA is monthly, then we might just get unlucky if a number of problems happen at one site by chance, all within one calendar month. If it’s halfway through the month right now, and a site has had at least one charger available only 80% of the time, then we really need to make sure it has perfect availability for the rest of this month, or we’ll miss our SLA.
And if we’ve already “blown” the SLA for a site this month, then from a contractual standpoint it might make sense even to de-prioritise work on that site (as long as we get it going again before next month).
Sometimes an SLA will be set across the whole estate, in which case one problematic site can blow the whole SLA. But in that case, once the SLA is blown then other sites don’t matter, so sometimes it makes more sense to set the SLA per site, so any penalties are proportional to the number of sites with inadequate service. This is probably a better match to customer satisfaction.
SLAs are measured over time, so to “manage to an SLA” there is usually a window of time during which it’s possible to recover a situation, before the SLA is violated. So for example, an SLA based on achieving say 90% performance over any calendar month will enter each month with a new ‘budget’ of 10% performance loss. So even if a catastrophic problem occurs then we have up to 3 days to fix it without SLA penalties - if we can then achieve 0% loss for the rest of the month. This rather extreme example is illustrated below:
Probably a sensible rule of thumb is to attempt not to lose more than the fractional SLA budget remaining, otherwise we’re incurring a “debt” that we’ll need to repay over the remainder of the month by over-performing.
Predicting performance against SLA
Any prediction of future performance is based on probabilities. It rarely makes sense to say “I have 100% confidence that we will meet this month’s SLA” (though in the above example, if we’ve had a perfect month so far, then 3 days from the end of the month we do become 100% confident we will meet the SLA).
The better your model, the better your prediction. If we have no “inside knowledge” of a business then historical performance is the best guide, and historical data can be used to build a statistical model to then predict future performance.
But in general we do have inside knowledge which we can use to produce more-accurate predictions. For example, during holiday periods we might anticipate lower availability of repair people, and therefore longer time-to-repair.
Who uses SLAs?
SLAs are at the heart of the success of a business, so it is everyone’s business to make sure they are met. But people in certain roles are likely to pay particular attention to SLAs:
- Operations teams. They probably “own” the SLAs, and need to plan, and prioritise remediation work, based on Value/Cost assessment.
- Financial teams: SLAs are a significant area of business risk (because of potential penalties), and so financial people will want to monitor them closely, and perhaps investigate near-misses as well as violations, to make a financial decision about what investment might be justified to ensure SLAs continue to be hit.
- Sales teams: SLAs indicate you’re delivering something of value, so they are a key part of initial or ongoing pricing negotiations.
- Customers: Customers will use SLAs to hold vendors to account, so as a vendor it’s vital that you’re on top of things, and proactively “managing to” the SLA, rather than just reporting failure retrospectively. And indeed, it’s important to understand what SLA you can deliver before signing any contract.
How is SLA management different from Service Monitoring?
The worlds of Service Monitoring and SLA management are close and do overlap. Typically the differences are:
- SLAs need to be proactively “managed-to”, to ensure that they are achieved. It’s not sufficient to just report how and why you missed them.
- In general, Service Monitoring doesn’t differentiate between customers, whereas SLA management does, because different customers have different SLAs.
DevicePilot is a good fit for both service monitoring and SLA management.
Benchmarks - what does “good” look like?
IoT still has a long way to go to achieve the levels of performance achieved in server farms or telco infrastructure. In our June 2018 talk “IoT - is it always this hard?” we gave some sense of the typical performance levels that IoT projects can achieve, at different scales, from 100 to 100,000 devices deployed and beyond: Watch the video from 10’00” to 15’00”.
SLAs become an inevitable part of business once you start to serve large customers. It is just as important to get the SLA definition right, as it is to get the pricing right. There are many ways to define SLAs, but in all cases they can be “managed to” in a proactive way, so that at any moment in time, a business knows how close it is to not meeting its SLAs, and can take action in a timely manner to ensure that it does.
Further questions about defining or managing IoT SLAs? Please get in touch.