Tom
20th July 2018

Originally posted on our Medium blog.

There’s an expression- ‘one nine’ is a good day in IoT. Based on the telco industry’s ‘five nines’ (99.999%) reliability assurance, it’s a joke about how difficult it is to get any sort of impressive reliability metric when all your devices are on dodgy connections and being subjected by customers to all sorts of conditions your designers wouldn’t have believed possible.

Being able to spot when a device goes offline is important step along the way to reliability. Much like trees in a forest, however, when a device goes down it might not always make a sound. So just how do you find out if one of your devices has broken, without waiting for your customer to notice and author a mocking tweet?

After answering this question across a number of forums, I thought it’d be helpful to put out a bit of guidance as to how you can use AWS IoT to identify when your devices connect and disconnect, and the strategies surrounding that.

Heartbeat, why do you skip?

Anyone with a bit of networking up their sleeve is likely to be putting up their hand already: use a heartbeat. At regular intervals, get your device to chirp-up and send a message home to let you know everything’s ok. If you didn’t receive a heartbeat, then you can start to assume things aren’t ok.

A heartbeat is a tricky thing. As a rule of thumb people set the interval to a third of the minimum notification time. Networks are unreliable, so just because you’ve missed one heartbeat, doesn’t mean your device is a smouldering heap on the floor.

At the scale that IoT can bring, however, frequent heartbeats can become problematic. If your network or battery constrained then it’s an financial and energy overhead. But even if you’re plugged in on WiFi; chances are you don’t want the operational overhead of processing billions of ‘I’m alive’ messages.

Use the broker, Luke.

Luckily, AWS IoT Core gives you a more reliable alternative built right in to their offering. Albeit one that comes with a few caveats…

Whenever anything connects or disconnects from the IoT message broker you will receive a message on the $aws/events/presence/connected/+ and $aws/events/presence/disconnected/+ topics (where + is a wildcard for the client id).

Messages look like:

{
  "clientId": "5f2b88024706498e81d93c5842d840b6",
  "timestamp": 1527438389228,
  "eventType": "connected",
  "sessionIdentifier": "331d7422-3322-4686-8b61-8ec74bb41fd7",
  "principalIdentifier": "21f1dfebcbf6fc60f44af783ccdeef808d2d1cbef"
}

And (note the eventType):

{
  "clientId": "5f2b88024706498e81d93c5842d840b6",
  "timestamp": 1527438433731,
  "eventType": "disconnected",
  "clientInitiatedDisconnect": true,
  "sessionIdentifier": "331d7422-3322-4686-8b61-8ec74bb41fd7",
  "principalIdentifier": "21f1dfebcbf6fc60f44af783ccdeef808d2d1cbef"
}

By subscribing to the presence/(dis)connected topics AWS will let you know when the broker can no longer reach your device. Under the hood, this just utilises MQTT’s built-in ping/pong. But it’ll save you the duplicated effort of building, monitoring and maintaining your own heartbeat infrastructure.

You should get a disconnected message immediately if you close your connection cleanly. However if the mobile link is broken, or someone treads on your device, you’ll need to wait for the broker to notice you’re down. When this happens the clientInitiatedDisconnect flag will be false.

Being able to tell the difference between a clean and dirty disconnect if useful for architectures where the devices don’t maintain a constant connection to the broker. By filtering the messages where clientInitiatedDisconnect is true, you can ensure you’re only taking action when something unexpected has happened.

As an alternative to the presence/(dis)connected topics the IoT broker supports LWT (Last Will and Testament). This allows you to specify a message that should be sent to a specific topic if the client disconnects uncleanly, typically to notify another client. Note that AWS does not allow you to send these messages to a topic beginning with $ (e.g. their special topics).

In both cases you’ll be waiting for the connection to timeout on the broker side. From my own experiments, you’ll normally receive the message within a few minutes. This is dependent both on the keep alive interval you negotiate when connecting, and AWS’ own infrastructure.

It’s not your fault.

Knowing that a device is connected is not the only reason you might want a heartbeat. You might also use it as a health check. For example, you could require all the systems on your device to participate in the heartbeat update to differentiate between a device being connected but not functioning, and a device fault.

This is a bit of an edge case, as most preferred patterns involve sending fault codes as part of your telemetry. However, there’s a technique you can use in AWS IoT to cover this case. If your shadow is connected, then you should be able to see if there is a drift between the reported and requested states. If this grows significantly while the device is apparently connected, then you should start to become suspicious…

Isn’t it about time you skipped that heartbeat?