After leading cloud migrations across InMobi, Koo, Tokopedia, and Bukalapak — platforms ranging from social media to e-commerce to ad tech — I’ve noticed something consistent: teams that achieve zero downtime plan for it from the beginning. Teams that don’t achieve it usually tried to bolt it on at the end.
The difference isn’t technical sophistication. It’s framing.
The Goal vs. Constraint Distinction
When zero downtime is a goal, it shapes how you measure success after the fact. “Did we have downtime?” becomes the question you answer in the postmortem.
When zero downtime is a constraint, it shapes design decisions from the first architecture review. “Does this design allow us to migrate without downtime?” becomes a question you answer before you write a single line of Terraform.
This isn’t semantic. It changes what you build.
What the Constraint Forces You to Do
Dual-running architecture: If downtime is a constraint, you need to be able to run source and target simultaneously. This means your application has to be designed — or adapted — to tolerate dual writes, state synchronization, or read splitting. You discover this requirement early, when you can still design for it.
Validated rollback before cutover: You cannot cut over to a new platform without a tested rollback path. Not a documented rollback path — a tested one. I’ve seen “rollback: redeploy old version” in runbooks that nobody had run in months. That’s not a rollback. That’s a hope.
Observable cutovers: A zero-downtime cutover requires knowing — in near real-time — whether the cutover is working. Error rates, latency percentiles, queue depths, database replication lag. If your observability isn’t in place before you cut over, you’re flying blind.
Incremental traffic shifting: Big-bang cutovers are the enemy of zero downtime. Moving 5% of traffic, watching for five minutes, then moving to 20% — this is the pattern. DNS-based, load-balancer-based, or feature-flag-based: the mechanism matters less than the incrementality.
The Rollback Trigger Problem
One thing teams consistently underinvest in is defining their rollback triggers before the cutover window.
“We’ll roll back if something goes wrong” is not a trigger. It’s a feeling.
A rollback trigger looks like: “If p99 latency exceeds 800ms for 3 consecutive minutes, or error rate exceeds 0.1%, we rollback immediately without debate.”
Defining this in advance removes the most dangerous moment in a cutover: the discussion about whether this anomaly is real or transient while users are being affected.
When the Constraint Is Genuinely Impossible
Sometimes it isn’t. Shared databases with no replication support, schema migrations that require a table lock, stateful applications with no dual-run path — these are cases where some downtime is unavoidable.
In those cases, the honest answer is: minimize the window, make it planned, and make sure stakeholders understand the trade-off explicitly. A 10-minute planned maintenance window is not a failure. Pretending you can deliver zero downtime when you can’t, and then having an unplanned outage, is.
Practical Starting Points
If you’re planning a migration, start here:
-
List every stateful dependency. These are your highest-risk components. How do you migrate them without a downtime window? If you can’t answer that, your architecture isn’t ready.
-
Define rollback criteria before writing runbooks. Not after. The rollback trigger shapes every decision about observable thresholds and monitoring.
-
Run a dry-run cutover on a staging environment that mirrors production. Not to validate that it works, but to discover what you don’t know.
Zero downtime is achievable. But only if you treat it as a constraint that shapes your entire approach — not a goal you hope to achieve on execution day.