This article is from the original draft of my speech at ServerlessDays London. I’ll embed the video when it becomes available.
The Ant Attack View of our Serverless Database
It’s Wednesday, and for the third time this week I’m sat on the sofa, at two in the morning, hacking together another desperate attempt to keep our system alive.
A few days later we will lose our first big customer.
Not to spoil anything, but we win them back- and a good few more.
DevicePilot is an application that helps companies with connected things get from being unsure of how many devices they have- to pinpointing the root cause of today’s operational incident and knowing exactly what they need to focus on to reach those fabled five-nines of reliability.
And that’s a hard data problem.
Like good cloud natives- we’d bought, not built. But the evolving demands from our customers were constantly invaliding our assumptions and constraints. Suddenly we were applying all sorts of annotations, transformations and reprocessing jobs just to keep up.
Our database wasn’t coping; and neither were we.
So we ditched it, dumped all the telemetry in a file and the three developers (we’re a small company) set out to answer all our queries from scratch.
To prove the concept, we threw together a lambda to read a JSON array from S3 and solve one of the questions that had stumped the old system: Device up-time, as a percentage, grouped by firmware revision.
Three hours later we had achieved a fully functional up-time-by-firmware-serverless-database. At a not completely awful 5’000 points per second. However, in IoT people tend to have a lot of devices; generating even more data. We had to up our game…
Lambda is wonderfully elastic- from a standing start you can enlist 3000 machines to do a few seconds work, and immediately throw them away.
Which meant, if we split our data between a bunch of lambdas, our prototype could reach a respectable 15 million points per second without a single piece of optimisation. (For those of you who are offended by the inefficiencies- don’t worry, we would later do some optimisation.)
S3 makes an amazing datastore, however anyone who has been on the receiving end of
ListObjectsV2 will know that it isn’t the best for discoverability. So the next day we setup a time-based file index in dynamo. Which meant- given a time range, we could now find which files we had to process.
A couple of weeks in a dark room with Kinesis Firehose, S3 lambda triggers and a lot of swearing at documentation, will get you an ingestion to storage layer that writes point files of X length with metadata generated, all nicely catalogued in dynamo.
That’s all you need for your own serverless database.
And while you might not be working with time-series- this’ll work just as well with a transaction log or an events stream.
Of course- the devil is always in the detail. So here are a few lessons learnt for anyone thinking of doing the same:
- Read the fine-print, it doesn’t matter how high your concurrency limit is, lambda is locked to a 3000 burst over a minute.
- That figure is mostly guidance (we regularly find AWS falls short,) write you code assuming invocation failure and be pleasantly surprised when it works.
- Run your database in its own account- especially if you’ve got a serverless architecture; of face bits of your infrastructure randomly failing as your lambda pool starves
But maybe the more interesting question is why should you do it?
If you’re doing something that’s off the beaten track when it comes to data, buying into a service or architecture means up-front trade-offs before you’ve had the chance to really discover what those compromises mean.
Being able to walk into meetings, engage customers and work with them to find out exactly what they want to know- safe in the knowledge that, whatever they ask for, we can easily extend our architecture to deliver it; has been invaluable.
Through this we’ve been rapidly building a composite collection of features; enjoying an extended period of discovery.
And that’s what’s new- lambda gives you the ability to trade efficiency for agility, in a surprisingly cost effective way.
We’re under no delusions- one day we’ll out grow our home-spun solution; either because we’ll need to start processing billions of points per second, or because the cost will no longer balance the opportunity gain.
But at that point we’ll be coming into the problem with a phenomenal amount of learning and insights. Enough to ensure that, whatever solution we choose, my 2am sofa grove remains abandoned.