Home / Projects / Koo: AWS to GCP in 4 Months, Zero Downtime
Cloud Migration 2023 · Bombinate Technologies / Koo (via Ollion)

Koo: AWS to GCP in 4 Months, Zero Downtime

Migrated Koo's entire production platform — 40+ app services, 48 ML services, and 21 databases — from AWS to GCP within a strict four-month timeline with zero downtime.

Key Impact

40+ services, 48 ML models, 21 databases migrated in 4 months with zero downtime

GCPAWSTerraformJenkinsKubernetesRDSDynamoDBCloud SQL

Overview

Koo was one of India’s largest social media platforms — a Twitter-like microblogging service with millions of active users across India and Brazil. When the business decision was made to migrate from AWS to GCP, there were four months to execute.

I led the end-to-end migration: infrastructure design, service migration, database migration, and production cutover — while users remained active across time zones throughout.

Context

Koo’s platform was complex: a social graph, real-time feeds, media processing, and an ML inference layer for content recommendations. The stack was deeply AWS-native. Moving it to GCP was not a lift-and-shift — it required redesigning how each component would operate in the target environment.

Problem Statement

Migrate a live, multi-region social media platform from AWS to GCP:

  • 40+ application services
  • 48 ML services (inference + training)
  • 14 RDS databases
  • 7 DynamoDB tables
  • Active users in India and Brazil — no extended downtime window

Constraints

  • Four-month hard deadline (business and commercial constraint)
  • Cross-region active users throughout migration
  • A shared central database with dependencies across multiple services
  • No dedicated migration budget for parallel infrastructure beyond the cutover window

Architecture Design

The GCP Landing Zone was designed for Koo’s specific workload profile:

  • GKE for containerized application and ML inference services
  • Cloud SQL (Postgres, MySQL) replacing RDS — with replication-based cutover to minimize downtime
  • Cloud Bigtable for select high-throughput workloads previously on DynamoDB
  • Cloud Storage + Vertex AI for ML model storage and serving
  • Cloud CDN for media delivery replacing CloudFront

The migration was sequenced by dependency layer: infrastructure first, stateless services second, stateful databases last.

Decision Rationale

Database migration approach: Rather than big-bang database cutovers, we used replication to keep source and target databases in sync, then did a coordinated cutover per database — minimizing the window where a service was offline.

ML services first: Counter-intuitively, migrating ML services early reduced risk. They were read-heavy and stateless relative to the transactional database layer, making them easier to migrate, validate, and roll back.

Terraform per domain: Infrastructure was organized by domain (platform, ML, data) rather than by service. This allowed teams to work in parallel without stepping on each other.

Challenges

The shared database problem: Multiple services depended on a single central database. We couldn’t migrate the database without migrating all its consumers simultaneously — or accepting temporary inconsistency. The solution was a short coordinated cutover window with a pre-validated rollback plan.

Four-month timeline: This was genuinely tight. Scope was fixed — you can’t un-migrate services. Execution had to be right the first time. Daily standups, dependency tracking, and weekly risk reviews were non-negotiable.

Cross-region users: Active users in Brazil while migrating Indian infrastructure meant cutover windows were narrow. We used geo-based DNS routing to shift traffic by region rather than globally.

Outcomes

  • 40+ application services successfully migrated
  • 48 ML services migrated and validated
  • 14 RDS → Cloud SQL, 7 DynamoDB → Bigtable/Cloud SQL
  • Zero downtime — no user-visible incidents during migration
  • 4-month delivery timeline met
  • Improved operational cost efficiency post-migration

Business Impact

The migration aligned Koo’s infrastructure with their cost optimization goals and positioned them on a platform with stronger AI/ML tooling — directly relevant for a recommendation-driven social platform.

Lessons Learned

Dependency mapping is not a one-time activity. Services that looked independent had hidden shared state (caches, queues, shared config buckets). Continuous dependency validation — not just upfront discovery — was essential.

Timeline pressure clarifies priorities. Four months forced ruthless triage. Not everything could be modernized during migration. Understanding what to migrate as-is vs. what to re-architect was as important as the migration execution itself.