Koo: AWS to GCP in 4 Months, Zero Downtime

Overview

Koo was one of India’s largest social media platforms — a Twitter-like microblogging service with millions of active users across India and Brazil. When the business decision was made to migrate from AWS to GCP, there were four months to execute.

I led the end-to-end migration: infrastructure design, service migration, database migration, and production cutover — while users remained active across time zones throughout.

Context

Koo’s platform was complex: a social graph, real-time feeds, media processing, and an ML inference layer for content recommendations. The stack was deeply AWS-native. Moving it to GCP was not a lift-and-shift — it required redesigning how each component would operate in the target environment.

Problem Statement

Migrate a live, multi-region social media platform from AWS to GCP:

40+ application services
48 ML services (inference + training)
14 RDS databases
7 DynamoDB tables
Active users in India and Brazil — no extended downtime window

Constraints

Four-month hard deadline (business and commercial constraint)
Cross-region active users throughout migration
A shared central database with dependencies across multiple services
No dedicated migration budget for parallel infrastructure beyond the cutover window

Architecture Design

The GCP Landing Zone was designed for Koo’s specific workload profile:

GKE for containerized application and ML inference services
Cloud SQL (Postgres, MySQL) replacing RDS — with replication-based cutover to minimize downtime
Cloud Bigtable for select high-throughput workloads previously on DynamoDB
Cloud Storage + Vertex AI for ML model storage and serving
Cloud CDN for media delivery replacing CloudFront

The migration was sequenced by dependency layer: infrastructure first, stateless services second, stateful databases last.

Decision Rationale

Database migration approach: Rather than big-bang database cutovers, we used replication to keep source and target databases in sync, then did a coordinated cutover per database — minimizing the window where a service was offline.

ML services first: Counter-intuitively, migrating ML services early reduced risk. They were read-heavy and stateless relative to the transactional database layer, making them easier to migrate, validate, and roll back.

Terraform per domain: Infrastructure was organized by domain (platform, ML, data) rather than by service. This allowed teams to work in parallel without stepping on each other.

Challenges

The shared database problem: Multiple services depended on a single central database. We couldn’t migrate the database without migrating all its consumers simultaneously — or accepting temporary inconsistency. The solution was a short coordinated cutover window with a pre-validated rollback plan.

Four-month timeline: This was genuinely tight. Scope was fixed — you can’t un-migrate services. Execution had to be right the first time. Daily standups, dependency tracking, and weekly risk reviews were non-negotiable.

Cross-region users: Active users in Brazil while migrating Indian infrastructure meant cutover windows were narrow. We used geo-based DNS routing to shift traffic by region rather than globally.

Outcomes

40+ application services successfully migrated
48 ML services migrated and validated
14 RDS → Cloud SQL, 7 DynamoDB → Bigtable/Cloud SQL
Zero downtime — no user-visible incidents during migration
4-month delivery timeline met
Improved operational cost efficiency post-migration

Business Impact

The migration aligned Koo’s infrastructure with their cost optimization goals and positioned them on a platform with stronger AI/ML tooling — directly relevant for a recommendation-driven social platform.

Lessons Learned

Dependency mapping is not a one-time activity. Services that looked independent had hidden shared state (caches, queues, shared config buckets). Continuous dependency validation — not just upfront discovery — was essential.

Timeline pressure clarifies priorities. Four months forced ruthless triage. Not everything could be modernized during migration. Understanding what to migrate as-is vs. what to re-architect was as important as the migration execution itself.