Pilgrim
19th July 2016

For many connected devices, the user experience is determined as much by the characteristics of the service behind that device as on the device itself. Here we explore how the “shape” of that service affects its performance and cost, and thus your customers’ experience and value for money.

Messages

Connected devices are usually configured to exchange periodic messages with the cloud service that supports them, for purposes which include:

  1. Sending back application data, e.g. a temperature sensor sends back periodic temperature readings. The values are usually stored in the cloud, may have analytics performed on them, and are ultimately delivered to the end-user, usually through some value-adding application.
  2. Receiving instructions, for example to turn off a local switch, or upgrade to a new software version. Sometimes it is necessary for the edge device to “long poll”, i.e. repeatedly ask the server “is there anything for me?” and perhaps wait a long time for an answer. This is needed to penetrate so-called “stateful” firewalls (which allow only connections initiated by the device, prohibiting spontaneous messages in the return direction).
  3. Keep-alive heartbeats. These demonstrate that the device is still functioning and that the communications link is “up”. They may also contain status information such as battery level, heap size etc.

Bandwidth

“Big data begins with little data” as CTO of ARM plc Mike Muller likes to say. The payload of each message is typically tiny – often just a few bytes – but they add-up, because in most cases they are sent continuously. A device sending one message every 30 seconds (24/7) will over a year send one million messages. And of course a million such devices will over a year send a trillion messages. Big numbers.

Once embedded in verbose communications and security protocols such as HTTPS, each message may expand to kilobytes in length. Licensed cellular communications is paid for by-the-byte, and to fit within monthly budgets of typically a few dollars or less, total traffic must be limited to a few Gigabytes. Unlicensed spectrum is, ironically, typically even more precious, as the space-bandwidth available is extremely limited.

Software uploads are the obvious big “budget breaker”, though patching schemes (partial uploads) can keep this under control. Intelligent throttling of telemetry can send data only when it has changed. And of course data can be tightly bit-packed if necessary.

Storage

When the data arrives in the cloud, it will probably at least be persisted (the most-recent value kept). And quite possibly all the readings received over time will also be stored, creating a so-called “time-series”. There will be debates about whether to “de-res” or even discard old data after a certain amount of time.

At my previous company AlertMe we build a connected-home platform, and one of its functions was to store video from in-home cameras. Everyone knows that video takes huge amounts of storage, though we only stored it when something interesting happened in the home. We also stored all the tiny messages from myriad small sensors around the home – temperature, occupancy and so forth, typically once every couple of minutes. The interesting thing was that the absolute amount of storage required for the sensors was over time quite similar to that required for the video. All those little messages do add up.

Storage roughly halves in cost every 14 months. So your costs will rise if your number of live devices increases by more than about x1.7 every year. And if you have an even-faster ramp, such as x5 (which is what we sustained at AlertMe over about 7 years) then the cost of historical data is always fairly negligible. Getting at that data efficiently is quite a different matter, which we’ll cover below. You can see a couple of scenarios explored in the charts below.

graph graph graph

Write load

As this IoT data appears in your service, it must be dealt with in a timely fashion. The high sustained write load of Connected Device applications make the shape of an IoT service quite different from classic web server applications where underlying data rarely changes.

Total service write load is a product of how often each device writes, multiplied by the number of devices.

All parts of a connected device service must tolerate failure in all other parts, and this especially applies to communications links. So it is typical for devices to buffer data if (when) they temporarily cannot contact the server.

Some “gotcha” areas worth paying attention to include:

  1. Connection session setup: Setting-up a reliable Internet connection (e.g. TCP) requires a significant amount of message exchange. Establishing a secure connection (e.g. SSL/TLS) adds even greater overhead, in calculation as well as bandwidth. Therefore it is often a good idea to preserve existing connections rather than treat each new message as a new connection.
  2. A classic failure-pattern for services with high sustained write load is that – whilst they cope fine with average load – they fail to cope with worst-case load. When the service falls-over and has to be restarted from cold (which it will), it then experiences the worst-case, as all the devices try to connect simultaneously and dump their buffered data, and may fail to start. Because loads grow gradually, this danger area is entered without any warning. It can be important to ensure that devices back-off gracefully.
  3. Data can suddenly change in historical records as devices which have been offline for a long time suddenly come online again and provide their buffered data. This can challenge business processes (e.g. billing) which assume that the past is immutable.

Well-designed services typically “scale out” as the write load grows, i.e. adding more servers allows more load to be handled.

Read load

Whilst the write load is typically high, constant, predictable and to some extent controllable (if you’re in charge of the device code too), read load is anything-but. It reflects the more classic web-server pattern, e.g. at midnight there may be almost no load, but at other times the load may be very much higher – completely at the whim of your users.

You can’t control the user load directly (except by throttling, which may make your users unhappy). But you can constrain the questions that your users can ask – in essence, your application is a constrained view into the data. Therefore the principle technique to address read load is to ensure that the answers to the questions you know your users will ask are already waiting. Google doesn’t search the entire web each time you type in a query, because it constrains in advance what kinds of query a user can make, allowing it to build data structures ahead of time which can answer them efficiently.

This can done by pre-calculating the answers as data is written, or done in some offline batch process (e.g. once a day), or even done on-demand and then cached. The pre-calculation may be done within the service’s database (using e.g. indexes, aggregations and materialised views) or by the application itself.

Transactional reliability is another consideration – is it absolutely essential that all device messages are reliably stored ? Even through service outages? Whilst this is entirely possible (as in the banking world) it places much higher burden on your service, as data must be written to persistent storage before being acknowledged to the device. Often it is acceptable for status messages to be dropped on rare occasions, as another one will come along shortly.

Latency is another consideration which can affect architectural design – how before an update written from a device becomes included in a read from the service? An answer of “seconds” requires a very different architecture from “hours”.

Rollups

I mentioned above the idea of pre-calculating the answers to the questions allowed by your application, and a classic IoT example is aggregation-over-time. For example, say a sensor measures energy consumption every second and sends it for storage in the service, and then the application allows questions like “how much energy did I use last year?”. Naively, this requires fetching 31 million values from a database and summing them, which is unlikely to happen quickly-enough for an impatient user (web services aim to have response times of hundreds of milliseconds, as Google achieves). However whilst writing the data we can aggregate it into buckets reflecting time-periods that the user will want. So for example, when each 60th per-second reading is received, we roll-up the 60 readings from the past minute (which is quite a trivial query) and write a new entry to a “minute” series. And so on with hours, days, weeks etc. So then when we want an answer to our “year” question, we’re just pulling a single value out of the database – super-quick.

Cross-series aggregation

The above is an example of “in-series aggregation” – the optimisation is happening entirely within each data series. Things get more complex if we want to perform aggregations across series. To continue the above example, imagine that we now want to ask “how much energy did all users of type X use?”. The most challenging aspect of this is that there may be arbitrary ways to select groups of users, so there is no way to pre-calculate the result. This takes us firmly into the realm of Big Data Analytics, and typically outside the realm of real-time response.

Security

Security is almost always an essential consideration, and will cause big problems later if it is not given enough thought early-on. Security of communications is one thing, but you might also like to give thought to questions such as:

  1. Are you handling “personal data”, which comes with more stringent legal requirements?
  2. How do you control/audit access to customer data from your personnel, to prevent one bad apple from causing huge damage?
  3. Do you need to encode your data at rest, in case your service is breached (e.g. personal video data would be a good candidate)?

Availability

As mentioned above, no service is perfect, and perfection like all good qualities comes at a cost. So it is important to understand how good your service needs to be, so you can design something that is just good-enough, and is therefore good value.

It is common in the telco world to talk about “nines”, so for example “five nines availability”  means that your service is working 99.999% of the time. This represents very high availability which any service is unlikely to achieve without very significant engineering effort, as unreliability is a function of service software (bugs causing outages), operations management (e.g. disk-space running-out) and the performance of the underlying components (e.g. servers, whether virtual or not). From a user’s perspective it also of course includes the performance of the connected device and its connectivity.  As the end-to-end chain of functions supporting most IoT applications often includes some fairly unreliable components (e.g. wireless or broadband connections), then an availabilty figure of even “two nines” might prove quite challenging to achieve in practice.

Service availability metrics also need to quote a timescale too, because for example a single 5-minute outage might violate a service-level-agreement defined as “per-day”, but not one defined as “per-month”.

Testing

Like any other kind of technology, until you’ve tested it, your service will not work. This is true at the level of individual unit-tests for code, but it’s also true for your service as a whole. You will need two types of service-level tests:

  1. Continuous performance metrics: is your live service actually delivering the service that you claim? Will you be the first to know if it doesn’t?
  2. Stress-testing at say an order of magnitude (10x) of performance above what you’re currently delivering, as your developers prepare for that future with 10x more devices. To achieve this you may well need to build virtual devices in software which can be instantiated for a short while on a cloud service, to drive load against a test instance of your service to prove that it does work at the intended scale.

Conclusion

In this paper I’ve considered some of the major factors which will determine the “shape” of your connected device service, and therefore how much it costs to run, how well it delivers your proposition to your users – and therefore how happy they are!

Of course technology never stays still. One area of rapid change at the moment is the rise of powerful SaaS services (especially databases-as-a-service) able to underpin your service to provide potentially easier scaling and even dynamic elasticity. These often come with their own new dynamics too though, so it’s not a free lunch.

Another interesting, longer-term change is the rise of so-called “edge computing” or “fog-computing”, recognising that often a cloud-centric view of the world creates a performance pinch-point and single point of failure which can be actively unhelpful. Peer-to-peer solutions can avoid bottlenecks – though some degree of centralisation is still needed for management. We might expect frameworks to arise to facilitate moving pieces of application logic seamlessly between the edge and the centre, allowing you to change the shape of your service even after you’ve design and deployed it.

About DevicePilot

DevicePilot is the software of choice for locating, monitoring and managing connected devices at scale. DevicePilot is completely agnostic, allowing the user to connect any device across any platform, with simple and easy integration. The company draws on the significant experience of its founders who successfully scaled their previous connected-device businesses to 1 million+ end-customers in areas as diverse as mobile phones, IPTV set-top-boxes and the connected home. Contact us for further information