Pilgrim
24th October 2017

The importance of operations

You’ve got over your initial product development hurdle, your connected product is into trials and you’re beginning to think about scaling. Sooner or later you’ll probably come to realise that the biggest long-term challenge of IoT is Operations. By Operations we mean all the people, tools and processes needed to support your connected devices day-to- day, keeping them working and keeping your customers happy. Effective operations is the difference between happy customers and unhappy customers, and between profit and loss, and between success and failure. We’re often surprised by the lack of visibility that many companies have into the state of their connected devices. Whilst the truth may be scary, the first step on the road to improving the quality of your operations is to reveal exactly where you stand today.

How many devices are working right now?

Once you’ve connected your device data-stream to DevicePilot, your first step is to create a filter to teach DevicePilot what “good” looks like. Filters are a powerful concept used throughout DevicePilot. A filter selects a subset of your devices according some combination of device property values and time. For example, we could define a simple filter called Down as:

installed but not seen for a day

In other words, the filter will select only devices that are installed (so we’d expect to be hearing from it) but aren’t producing regular heartbeats – so something’s gone wrong with either the device or its communications. It’s Down.

Now you’ve created this filter, DevicePilot can count how many of your devices match that filter – how many are down right now. Is it 1% or 10%? A very useful metric to manage to. You can put this number live on your Dashboard for all to see, and you can plot how it changes over time to see whether things are improving or getting worse.

That’s all very useful and is a great first step, but you wanted more – so more you shall have.

  1. Over time, you’re likely to want to change your definition of Down. Perhaps you want to expand it to include situations where a device is still talking but is reporting an application error, or an exhausted battery. Having changed your definition, you’ll want to recalculate its history to preserve the historical baseline that you compare against.
  2. Few things in life are 100% reliable: At any given moment some of your devices will be down. But clearly there’s a difference between:
    a) A device that’s just gone down after a prolonged period of being up
    b) A device that’s down right now, having been up and down like a yo-yo for the past week (it may even be having a rare moment of working – very misleading if you only look at ‘now’).
    c) A device that’s just occasionally spotty, and we happen to have caught it at a rare bad moment.
    So when we ask about the performance of a device, we probably want to measure it over some period of time. For example over the past month was it down 1% of the time, or 50% of the time? That’s much more helpful than an instantaneous measurement for identifying problematic devices – and for quantifying those which need the most urgent action.
  3. Next you’ll want to identify the root cause(s) of failures – for a specific device and across your whole device estate. Although “correlation does not imply causation”, it’s often a great start: If you’re using multiple communications-providers, or multiple bearers (e.g. WiFi and cellular), then are devices using one better than those using the other? Does failure correlate with specific hardware revisions, or specific software versions? Perhaps devices fail more when they’re hot? As soon as you think of an hypothesis you’ll want to test it.

Today we’re announcing a powerful new feature of DevicePilot which enables all of the above and more: Cohort Analysis.

Cohort Analysis

DevicePilot Cohort Analysis gives you the power to apply a traditional DevicePilot filter against every device individually, ‘slide’ that filter over time to take into account the effect of every event in that device’s history, and then group the answers from all devices according to another metric.

So for example now you can easily ask vital business questions like:

  • What total percentage up-time am I delivering across all my devices? How is that changing?
  • What’s the shape of the distribution of up-time over the past month? Are there just a few devices letting the side down, or is there a ‘long tail’ of devices?
  • Which devices have the worst uptime?
  • Have devices connected via Telefonica seen better reliability than devices connected via Vodafone, over the past 6 months?
  • How good a predictor of communications problems is signal-strength? Empirically, what’s the minimum signal strength that I should allow during installation, to deliver acceptable performance?
  • Show me my performance in terms of value delivered. For example, if I’m a demand-response energy aggregator then instead of saying “134 devices offline” I want to know “140MW offline” so I can quantify the problem. Or maybe even “$14,312 offline”.
  • Is the new version of software more reliable than the old version of software (given that different devices were upgraded at different times).
  • Which of my customers is experiencing worst performance?

There’s some pretty powerful new technology under the hood to enable all this, as neither SQL nor time-series databases can handle the kinds of “stateful” queries required to detect failures such as timeouts. And DevicePilot is able to deliver quick answers despite having to scan potentially terabytes of time-series data.

All of this is bundled into an easy-to-use interface with only three controls:

  1. A filter to ‘slide’ through time (e.g. to define a timeout interval).
  2. A grouping filter, to collate the results
  3. A scoping filter to select the set of devices you’re interested in

At first glance, the scoping filter looks just like the filter you’re used-to from the top of the View menu. But scoping is applied to include/exclude every data point through time. So if you’re interested in measuring the up-time of devices running v1.3 software, then set the scoping filter to be “current_firmware == 1.3” and DevicePilot will measure reliability whilst each device was running v1.3 software, not just the reliability of devices currently running v1.3 software. A subtle but very important distinction.

Of course all this power is available directly from the DevicePilot API too, so you can integrate it with your own systems and UI.

Managing KPIs to hit SLAs … and the Good Stuff

Now whenever you think of some performance metric (also known as a Key Performance Indicator) DevicePilot can measure that metric in seconds, giving you the confidence to then consider what an appropriate target for that metric might be. Then you can see if you can hit that target internally before you expose the metric to your customer. Sooner or later most customers of IoT services will want some kind of performance guarantee written into the contract (a Service Level Agreement) and it’s simply never too early to get a sense of what an appropriate metric might be, and what target value is achievable.

Finally, we recognise that here we’ve talked a lot about measuring how much stuff is not working, since – let’s face it –  that’s what engineers and operations people do spend most of their time worrying about. But do remember that one of the fantastic benefits of connecting your product is that it empowers you to also measure all sorts of good stuff too – metrics that show that your customers are engaging with your product, and that it’s actually delivering the benefits you promise. Are the users interacting with the product? Which features are they using? Cohort Analysis can be a very powerful tool for Product Managers and CMOs too. Take a look at our whitepaper “Measuring Customer Happiness” for ideas.

Ready to try DevicePilot? Let’s get started