Compound Availability: Why More Services Mean More Downtime

When a system depends on multiple independent services each with their own uptime percentage, the combined availability is the product of the individual numbers, not the minimum. Three services at 99.9 percent each give a combined 99.7 percent — roughly 26 hours of downtime per year versus 8.7 for a single service.

Service availability is usually quoted as a percentage of uptime: 99.9 percent ('three nines'), 99.99 percent ('four nines'), and so on. The intuitive trap is to assume that depending on several services that each promise 99.9 percent uptime gives a system that is also roughly 99.9 percent available. It does not. For services whose failures are statistically independent, the combined availability is the product of the individual availabilities. Three independent services at 99.9 percent each give 0.999 × 0.999 × 0.999 ≈ 0.997, or 99.7 percent. In hours per year that is the difference between roughly 8.76 hours of downtime (single 99.9 percent service) and roughly 26.3 hours (three of them chained as hard dependencies). This is sometimes called compound availability or the multiplicative availability rule. It assumes the failures are not correlated; in practice failures often are correlated (shared power, network, cloud region), which can either make things worse, when correlated outages cluster, or better, when one shared event takes down everything in a single incident window rather than several spread out over the year. The operational implication is one of the under-discussed arguments for monolithic data stores and against microservice sprawl. Every additional system in the critical path of a request — a separate cache, a separate search cluster, a separate vector database, a separate queue broker — multiplies its availability number into the product. To keep an end-user-facing service at four nines while depending on four backend systems, each of those backends needs to be at roughly 99.9975 percent, an order of magnitude harder than the user-facing target. The rule does not say specialized systems are wrong, only that each one carries an availability cost that needs to be paid for by either its specific performance benefit or by investment in much higher per-service reliability than the headline target.

Compound Availability: Why More Services Mean More Downtime

Have insights to add?