What's wrong with existing database solutions?
When we initially started building DevicePilot our first thought was to use off-the-shelf database technologies - because everyone knows you shouldn't try to create your own database, right? We started with an open-source time-series database called Kairos (built on Cassandra) on EC2 (unmaintainable, hard to scale, high ops cost), then moved to MongoDB (doesn't like sustained writes!), then moved to InfluxDB (key issues as below), meanwhile experimenting with the SaaS DB offerings from AWS, Azure and Google cloud such as BigTable, Aurora etc. until finally we decided damn, there's nothing for it, we're going to have to build our own solution because none of them provide key features that our use-case requires.
Of course, we do use off-the-shelf databases for our housekeeping (e.g. DynamoDB) but the heart of DevicePilot is the ingestion, storage and fast querying of very large, continuous streams of structured, time-stamped data, so that's our primary requirement.
SQLIn principle, any database question can be answered with an SQL database, if you just add sufficient layers of indirection to it. If we were a huge company then perhaps we'd have bought some Oracle database and eventually been able to shape it to our needs.
But we were pretty sure that this would wire-in a requirement for full-time data analysts and devops people to keep it all running, making our business expensive and hard-to-scale. The only advantage a startup has is it's agility, which isn't something to throw away lightly.
TSDBsThis is why TSDBs (Time-Series Databases) were invented: they understand time as a fundamental dimension, so are organised internally to make time-based operations such as selecting a day's worth of data much more efficient than SQL.
Classic TSDBs are great at doing "maths on columns" (e.g. average all the temperatures over the past month) but - ironically - are generally very bad at answering time-based questions.
It's all about the questions you need to answer1. A fundamental question for managing any kind of IoT estate is:
"what is my uptime?"
Generally IoT devices send a heartbeat, so we might define "up" as "I've heard from the device in the last hour" and then ask "what is the uptime of my device estate over the last week?" - wanting an answer like e.g. 95% or whatever. TSDBs just have no way to express this kind of question. DevicePilot does, and indeed we make it easy, because it's such a common use-case for IoT.
Classic TSDBs also require you to define up-front what properties will be treated as telemetry (fast-changing and continuous) and what properties as metadata (slow-changing and discrete). Why does the difference matter? Because you can only do e.g. GROUP BY operations on metadata, not on telemetry. This is to avoid "cardinality explosion", i.e. trying to keep track of too many groups. Why is this a problem for IoT customers?
2. Following-on from the above, another example of a fundamental IoT question is:
"is the signal strength of my wireless devices affecting my uptime?"
Note that this question is correlating two properties/metrics each of which varies continuously over time (signal strength is a telemetry property streaming from each device, and uptime is a streaming metric derived from streaming telemetry as measured above). So by the above definition both are telemetry, so a classic TSDB cannot answer the question.
It is our experience that in IoT metadata and telemetry are a continuum, not a binary choice, and it's not even obvious to our customers up-front where any particular property might fall on that spectrum. For example, for a device which never moves then lat/lon is clearly metadata, but for a device which moves continuously then lat/lon is clearly telemetry, because you want to be able to e.g. measure how far it has moved by doing maths operations on the individual locations. So what about devices which might move just occasionally?
"Software version" might be considered metadata, but of course it does change over time whenever a device is upgraded. That's very significant in answering vital questions such as "is v3 software more reliable than v2 software?". Each device will be upgrading at different times, so you can't just select a chunk of time for each software version, you have to do your GROUP BY in a streaming sense, not an SQL sense.
3. Most classic TSDBs do not allow insertion of old data, because for scalability it is desirable that once data is written it remains unchanged ever-after ("immutable"). However a common pattern in IoT is that whilst generally data arrives in real-time (i.e. in time-stamp order), it doesn't always do so. If an IoT device or gateway goes offline, then generally it will buffer data until it comes back online, so that data will be back-dated. DevicePilot has gone to a lot of effort to ensure that we can deal with this case.
Our own costs, scalability and reliability matter to you
Having decided we had to build our own engine, the question then was how to do so whilst still riding on the shoulders of giants.
We chose to leverage modern serverless approaches to the max, principally with a combination of AWS lambda and S3, to deliver an extremely scalable and reliable service which has very low ops costs (no disks to fill up, no servers to crash).