Designing for Scale

By Pilgrim - December 04, 2019

Introduction

While the number of humans on the planet remains relatively constant, the number of connected devices grows exponentially at a CAGR of 25%. In 2008 the number of connected devices surpassed the number of humans and continues to grow ten-fold for every passing decade. A decade ago you had one connected device in your life - your PC. Today you have perhaps ten, including your smartphone, tablet, TV, sound system, home security system, connected car etc. And in ten more years you’ll probably have a hundred and then a decade later a thousand, as the costs of connecting devices continue to fall and the benefits continue to increase.

At these numbers, it is just not possible for each of us to have the same kind of “hands-on” relationship with these new IoT devices that we had in the days of the Personal Computer when manual rebooting, upgrading and fault-finding was the norm. We cannot be the IT support for IoT because there are just too many devices – so the devices will largely have to look after themselves.

The User Experience

User Experience (or “UX”) may seem a strange place to start a discussion about designing for scale – surely the problems are technical? But in truth for most IoT applications there is no “rocket science”, no one deep technical problem to solve. Instead in practice the problems often arise because of a mismatch between the finite knowledge and attention of the users (the end-customers say, or the operational managers), and the effectively infinite number of machines, and their potential to place demands on those humans.

This must therefore be actively addressed, as the fundamental obstacle to the growth of the Internet of Things, and to ensuring that it delivers increasing benefits rather than increasing problems. Making a connected product simple to use takes more effort.

Therefore it’s very important to consider early how humans will interact with the finished design and perhaps learn where the problems lie before a lot of technical effort is invested in something which will end up being unusable. Processes such as mocking-up the design and rapidly building early working prototypes can help greatly.

Target scale?

The “best” choices of approach and technology for a product that is only ever going to sell in tens of units are very different from the best choices for a product which will sell in the 10,000’s, let alone the 10,000,000’s. With every order of magnitude increase in scale, you can expect your unit-costs to reduce by say 30%, but this improvement doesn’t just happen by magically making more - it takes very significant effort and investment. To avoid the substantial time and cost of “changing gear” by re-engineering your entire product as it scales-up, at the start you will need some idea of what scale you are likely to achieve so you can aim for that.

Scale affects everything. There’s a tension between Effort, Efficiency and Time-to-market, and generally there’s a trade-off between them: you can have any two but not all three.

  • Efficiency is a measure of how well-tailored your product is to its purpose. Can it run on the lowest-cost devices and infrastructure? Can it consume as little power as possible, so the battery lasts as long as possible and the operational costs of changing it are minimised? Can it use as little network bandwidth as possible? Can it be fully self-managing?
  • Effort is the human effort required to achieve the above. This has real costs, and even with a large budget is constrained by the limits of organisational scaling, especially as connected products often require the co-ordination of people with many different skills.
  • Time-to-market is how long the product takes from conception to launch.

Volume products repay investment in “productisation”: if you know you’re aiming for a high-volume product, it’s worth making significant up-front investment to optimise your product and service to maximise your margins once you get to scale.
Or, if you want to get to market quickly and without huge effort, then your product probably won’t be very efficient and you risk falling the wrong side of the growth curve (making an operating loss which grows with every device sold). The potential solution to this is to compose your offering from pre-optimised off-the-shelf frameworks.

Quality, Speed, Volume

Another set of three relationships to consider are the Quality of the end-product, the Speed with which you can deliver and iterate it, and the Scale at which you make it. In this case, rather than a trade-off between these three, they are all co-dependent.
If you are making something at scale, then you have to achieve quality, else your support costs will be prohibitive and your reputation damaged, and you have to achieve speed, because you will need to iterate the product quickly to meet the evolving demands of your mass-market.

Put another way, scale is the only way to deliver quality and speed at a sane cost. It is very hard to deliver quality and speed in a low-volume process, because you have nothing against which to amortise the development cost. It is possible to make a high-quality object in low volumes, e.g. a satellite, but it will be slow and expensive.

Knock-on effects

Designing a connected product is a long chain of decisions, each having knock-on consequences for other work. Here are some examples:

  • Your target scale affects the choice of hardware: at large scale it may be worth using a system-on-chip processor to maximise your integration and minimise product cost and size. But embedded software design for system-on-chip can be more challenging.
  • Although it is tempting to optimise software to the Nth degree to maximise efficiency, remember that Moore’s Law is currently increasing CPU power by 100% every 18 months, and storage by 100% every 12 months. Increasingly, it is human time that is precious and limited, not machine time.
  • Just because you see a product for sale at a low cost, doesn’t mean that you can make it for that cost. For example, the popular Raspberry Pi embedded computer costs approximately £25 – but if you wanted similar functionality in your own similar design, you couldn’t achieve this without significant design-for-scale investment.
    • To “cheat” your way to success, consider leveraging someone else’s scale, by buying-in a module or service that already exists. Even though you may only need 10,000 units/year, if the manufacturer is selling 1,000,000 units/year then you can ride on the back of that volume, against which all the development cost is amortised.
  • Energy consumption is a big issue at scale. Even with mains-powered devices, their sheer number can mean a significant global impact if they are not efficient – witness recent laws mandating the maximum power consumption of set-top-boxes, broadband hubs etc. Moving to low-power design is hard – your code needs to spend most of its time asleep, and you have to move to a message-oriented, asynchronous style of programming which is fundamentally different from the classic imperative style.

Volume manufacturing

Making products in volume is an art in itself. Time on a manufacturing line is expensive, so it is often worth investing in automated manufacturing, programming and test equipment to reduce expensive human effort – and this will increase quality too, if you have the right processes in place. One thing worth considering hard is the extent to which each product needs to be customised. For example, if a unique physical network address (“MAC address”) has to be programmed into every board, how are you going to allocate those addresses, and ensure that you never duplicate them? Do you need to print barcodes on the outside of packaging to allow you to do stock-control (and in the worst case, rework) efficiently?

How do you minimise the cashflow challenges of “Work In Progress” (half-built product) and the logistical problems of finished product getting “stale” sitting in the supply chain. Is your manufacturing facility online or offline? If online, then you can monitor quality and make corrections from afar, but you are also more exposed to viruses and downtime if the facility goes offline.

Minimising Operational Expenditure

Operational Expenditure (OpEx) is the term for all the costs incurred by a product once it has been made, sold, and is being used for its intended purpose. This includes your own operational costs as the provider (e.g. fixing bugs and keeping the service running) and the operational costs of the end-users (e.g. manual maintenance such as changing batteries). In most commercial uses, the customer will (or at least, should) combine the up-front product cost with the ongoing OpEx cost to form a true picture of the cost/benefit. And you need to think like this too, to ensure that you are planning for positive margins.

Therefore it is worth thinking hard about all the life-cycle events that your product is likely to experience from cradle to grave and the extent to which these can be automated to avoid incurring human costs for you or your customer. This includes topics such as field-upgrades and day-to-day monitoring of performance levels.

80:20

Experience has taught us that the software code required to deliver the principle function of a connected device can often be small and simple, taking at most a few days of engineering time to develop. So getting to demo stage can be quite quick and painless.

But the real world is a messy place. For every one way an IoT device can work, there are a hundred ways that it can not work. Access points disappear, wireless interference happens, batteries run down, code has bugs in it, standards change. If there’s a situation that the designer didn’t foresee, then at scale your customers will find it.  A rare combination of events which under some circumstances causes a crash is probably not much of an issue for one device (statistically, you may never see it). But once you have a million such devices in the field, it can be a massive headache, as it only takes a small number of human-expensive problems to affect profits and corporate image.

Because of the mismatch between the number of device and the number of humans, it’s increasingly vital that devices can sort out the inevitable problems themselves automatically, without human intervention. The code to deal with all the so-called “edge-cases” of a product is often complex, therefore a source of bugs, and can over time typically consumes 80% or more of development and support costs, code space etc.

Scaling example

To illustrate some of the above points, we now present how to produce a product at a target scale which increases from “1 off” (a single unit) to a volume of one million. We’ll increase the scale in multiples of 100 – this is of course an arbitrary number, but it serves to illustrate that approaches that are practical at one scale become at larger scale not just impractical, but often downright impossible. Often, when building for a particular scale, it’s worth bearing the next couple of orders of magnitude in mind, because that will be next year’s challenge!

Because of the significant changes of gear necessary to grow through different scales, it may even make sense to view your early small-scale deployment as “disposable” work, because it may not be cost-effective (or you may not have the knowledge or relationships) to build it initially in a way that can scale very far. So it may be cheaper to replace a few hundred early units in the field once you have scaled into the thousands, than to go on supporting an outdated low-scale product.

If it helps to have a concrete example in mind, consider a battery-powered cloud-connected wireless temperature-sensor product. Just to be clear, when we say “product” we are referring not only to the hardware of the product but also its software, the cloud services that enable it, the application  and its user experience, and any human support that it requires (e.g. customer services and Ops), all of which goes to deliver a complete product offering to the market.

1 off

Making just one of something is a completely bespoke activity. The majority of your costs are development costs, and it is rarely worth optimising the cost of the actual product deliverable because there is no leverage on that time and money. You will do most things by hand. To leverage scale from elsewhere, consider buying modules or even complete product if possible. In terms of communications, use off-the-shelf solutions already at scale (e.g. WiFi or Bluetooth) and live with the limitations.

100 off

You can spend your life soldering modules together, so now it’s worth hiring someone to do this for you. Manufacturing starts to become a process which needs documentation and some kind of QA process (testing). At this volume, the most cost-effective solution to problems of poor “yield” (too many units not passing test) may be over-buying - if a unit doesn’t work, chuck it away.

At this scale, it starts to become worth creating simple tools to make your time more efficient, e.g. perhaps a firmware-programming station, some way of managing unique IDs, keys etc. Design for Manufacturing (DFM) becomes important and it will be worth designing a test harness, with bed-of-nails testing etc.

On the service side, a proper “deployment” flow will enable you to start to improve both time-to-market and quality, so it’s worth investigating CI/CD/TDD processes to automate the testing and deployment cycle.

At this scale, some level of human interaction and supervision may still be tolerable, but any manual management of edge devices and gateways will start to become a significant time burden, so it is worth investigating tools to help you automate the process of setting-up and managing your devices in the field.

10,000 off

Congratulations! You are now a member of a club of people who have done things at scale. You may not be getting much sleep. “Humans don’t scale” is now your mantra: you cannot do anything to this number of devices by hand. Manufacturing is now completely arms-length, and the quality of your documentation and QA must be sufficient to reassure you that you will not take delivery of a truck-load of non-functional devices.  Test jigs and manufacturing may require as much investment as the product itself, and it’s seriously worth considering built-in self-test.

Continuous Integration is challenging with hardware at this scale - component lead times can be several months (so you can’t suddenly change a component value), and in general no “tweaking” is allowed, because the downstream consequences on manufacturing and risk are enormous. Yield has become an issue – you can’t throw 10% away. So a continuous improvement loop is necessary to identify the cause of defects and eliminate it at source before too many devices have been made.

Bugs which were too infrequent to understand or analyse at lower volumes now become statistically significant - and a visible consumer-satisfaction issue. Failure modes require a strong understanding of the end-user. Return-to-base failures (“bricking”) are intolerable and it’s worth investing time to analyse common failures. Do you need a mock-up of the end-user environment so that you can replicate problems? Or to add sufficient “black box” instrumentation that you can diagnose most problems after the fact?

Reliance on externalities becomes critical, for example the smart phone that your app runs on or gateways and networks that your device relies upon. For example, imagine that your device uses standards in a legitimate but slightly unusual way, and a popular Telco suddenly releases a new version of gateway that doesn’t support this properly.

Compliance becomes a big topic. If you make a single device for a demo then you can get away with arbitrarily assigning MAC address etc. since it is unlikely to cause a problem. But at scale, you need to do things “properly” so your devices fit with the rest of the world. This extends to physical compliance too – it’s no-longer a prototype so it must be CE and FCC marked, UL approach and have proper RF testing. And you’ll need to qualify your suppliers and have a goods inwards QA process.

The whole device life-cycle is now your problem. This includes the ability to inventory-track, e.g. using barcodes etc. if every device is personalised. You will also require a Returns process. Stock may become outdated in the supply chain too - batteries have a shelf-life and firmware goes out-of-date. Is updating-on-first-use an acceptable customer experience? If not, you’ll need a rework plan.

All aspects of device “Management” need to be totally automated.  This includes in-field firmware upgrades, which can’t be done devices are being used, or are offline or have a dead battery. It may also include Service Level Agreement reporting – are you delivering the device and service quality that your channels and end-customers have paid for? You may need to add extra telemetry throughout system, and build or buy tools to summarise it all the data from it.

The underlying service costs are worth examining in detail. Devops (the engineers building and running your service on a daily basis) becomes a noticeable cost-centre – so it’s worth having tools to manage the tools. This is when your hosting decisions start to bite. “Cloud” can seem quite expensive – but can include a lot (redundancy, devops, h/w upgrades etc.)  and may allow you to  ride on 3rd-party evolution, instead of being side-swiped by it.

Speaking of costs, at volume you’ll start to sweat topics such as cash-flow and getting the wrong side of the curve (at volume, by definition your margins will be tight, so if you’re pricing too low then any small error will mean you’ll start to lose on every sale). This is when you might want to switch to a pure-service business model, partnering with hardware vendors.

Your service may hit a scaling wall if it has not been properly architected. For example, the music-recognition service Shazam originally did everything centrally, but as its user-base grew its cloud costs grew in proportion, which was getting in the way of its “free” model. It solved that problem by outsourcing much of the recognition heavy-lifting to the edge-devices (the phones) which by definition scale with the user-base. You also need to think about whether you should be “scaling-up” (putting more horsepower into your existing machines) or “scaling-out” (making your architecture inherently more parallel). The former can run into fundamental scaling-bottlenecks, but the latter requires more fundamental design – so best done early!

1,000,000 off

Yield is everything. Rework is incredibly expensive. Manufacturing & QA must be completely automated. Parallel production-lines need more manufacturing stations (so making and managing them efficiently becomes a task in its own right).You need pre-production test-runs etc. to fully qualify before pressing big button, because the consequences of failure are now horrendous.

You need to have well-designed plans for the whole of the life-cycle, including decommissioning and end-of-life (how much landfill tax do you want to pay?).You pro-actively invest in managing externalities, such as component end-of-life, DRAM famines etc.

You are now truly Enterprise scale, and need to have processes to match. Human time is finite and therefore the most precious resource. This theme became noticeable when we had just 100 connected products and continues to grow in importance now that we have more than a million. Software services which allow scaling without huge human intervention are vital, as are software tools which automate the process of deploying and managing connected products at scale.

The qualities and usability of those tools becomes paramount.

Conclusion

The Internet of Things is all about scale, and we hope we’ve given you some insight into the many aspects of design which will be influenced by your target scale, and how these can interact.  A key theme is the minimisation of human interaction at all stage of the device life-cycle, as this is likely to have a definitive impact on product margins - and even on product viability.

About DevicePilot

DevicePilot is the software of choice for managing connected devices at scale. DevicePilot is completely agnostic, allowing the user to connect any device across any platform, with simple and easy integration. The company draws on the significant experience of its founders who successfully scaled their previous connected-device businesses to 1 million+ end-customers in areas as diverse as mobile phones, IPTV set-top-boxes and the connected home. Contact us for further information

Comments

See how DevicePilot can make the difference

 

Industry leaders trust DevicePilot to help them improve the quality of the service they deliver at scale.

  • Eliminate revenue loss
  • Deliver a better service with the same human resource
  • Focus on growth and not firefighting
  • Get customer satisfaction through the roof

Book your personalised demo now and discover how DevicePilot can help you scale your connected business

Erik in a circle-1

Erik Fairbairn, CEO at POD Point:
Achieved 99% uptime across device estate

"We're totally data driven at POD Point, and if we can answer a question using data then we think that’s the best way - there’s no guesswork and you can use the facts.

Our DevicePilot dashboards have really let us get that actionable insight out of our devices."