Overview
Koo was one of India’s largest social media platforms — a Twitter-like microblogging service with millions of active users across India and Brazil. When the business decision was made to migrate from AWS to GCP, there were four months to execute.
I led the end-to-end migration: infrastructure design, service migration, database migration, and production cutover — while users remained active across time zones throughout.
Context
Koo’s platform was complex: a social graph, real-time feeds, media processing, and an ML inference layer for content recommendations. The stack was deeply AWS-native. Moving it to GCP was not a lift-and-shift — it required redesigning how each component would operate in the target environment.
Problem Statement
Migrate a live, multi-region social media platform from AWS to GCP:
- 40+ application services
- 48 ML services (inference + training)
- 14 RDS databases
- 7 DynamoDB tables
- Active users in India and Brazil — no extended downtime window
Constraints
- Four-month hard deadline (business and commercial constraint)
- Cross-region active users throughout migration
- A shared central database with dependencies across multiple services
- No dedicated migration budget for parallel infrastructure beyond the cutover window
Architecture Design
The GCP Landing Zone was designed for Koo’s specific workload profile:
- GKE for containerized application and ML inference services
- Cloud SQL (Postgres, MySQL) replacing RDS — with replication-based cutover to minimize downtime
- Cloud Bigtable for select high-throughput workloads previously on DynamoDB
- Cloud Storage + Vertex AI for ML model storage and serving
- Cloud CDN for media delivery replacing CloudFront
The migration was sequenced by dependency layer: infrastructure first, stateless services second, stateful databases last.
Decision Rationale
Database migration approach: Rather than big-bang database cutovers, we used replication to keep source and target databases in sync, then did a coordinated cutover per database — minimizing the window where a service was offline.
ML services first: Counter-intuitively, migrating ML services early reduced risk. They were read-heavy and stateless relative to the transactional database layer, making them easier to migrate, validate, and roll back.
Terraform per domain: Infrastructure was organized by domain (platform, ML, data) rather than by service. This allowed teams to work in parallel without stepping on each other.
Challenges
The shared database problem: Multiple services depended on a single central database. We couldn’t migrate the database without migrating all its consumers simultaneously — or accepting temporary inconsistency. The solution was a short coordinated cutover window with a pre-validated rollback plan.
Four-month timeline: This was genuinely tight. Scope was fixed — you can’t un-migrate services. Execution had to be right the first time. Daily standups, dependency tracking, and weekly risk reviews were non-negotiable.
Cross-region users: Active users in Brazil while migrating Indian infrastructure meant cutover windows were narrow. We used geo-based DNS routing to shift traffic by region rather than globally.
Outcomes
- 40+ application services successfully migrated
- 48 ML services migrated and validated
- 14 RDS → Cloud SQL, 7 DynamoDB → Bigtable/Cloud SQL
- Zero downtime — no user-visible incidents during migration
- 4-month delivery timeline met
- Improved operational cost efficiency post-migration
Business Impact
The migration aligned Koo’s infrastructure with their cost optimization goals and positioned them on a platform with stronger AI/ML tooling — directly relevant for a recommendation-driven social platform.
Lessons Learned
Dependency mapping is not a one-time activity. Services that looked independent had hidden shared state (caches, queues, shared config buckets). Continuous dependency validation — not just upfront discovery — was essential.
Timeline pressure clarifies priorities. Four months forced ruthless triage. Not everything could be modernized during migration. Understanding what to migrate as-is vs. what to re-architect was as important as the migration execution itself.