Let's dive straight in.
1. Do data reduction as soon as possible in the chain from device to cloud
2. Don't emit unstructured data from your devices (log strings)
Send data in structures such as JSON, because it's far easier for machines to parse (and therefore select what they're interested in).
3. Don't repeat yourself - send only values that have changed, whether it's telemetry or metadata
Your resulting database will therefore be "sparse" (not every row has a value in every column).
4. Heartbeats are the exception to the above
Heartbeats are regular messages sent by your device to prove that it's actually connected (and to measure connection reliability over time). Note that if you use a reliable data transport, it may do the heartbeats for you (such as TCP, where they're called "keepalives") so you can rely on "connected established/dropped" messages.
5. Reliable communications (like TCP!) which have acknowledgements and retires cost more than unreliable comms
Such as UDP, which is spray-and-pray... you might want to use a mixture of the two.
6. You might need to send different data at different rates
For example, temperature every five minutes, battery levels once a day, and movement events as detected. These decisions are based on:
- The time-resolution needed for subsequent analysis
- How quickly you need to know when something has changed
7. Subscribing to what AWS IoT calls a "shadow" and Azure IoT calls a "device twin" can help "de-deuplicate" repeated data, since by default subscribers to these brokers only receive changes
In many cases, this can reduce data by an order of magnitude.
8. Upgrading firmware generally requires massive amounts of data
9. To diagnose difficult problems, you might need to massively increase data rates
For example, rare bugs or complex failures such as mesh network instability. Consider creating a mechanism which does this just on one device or site for 24 hours, giving you a detailed view without hugely increasing your average data-rate across all devices.