Cloudythings Blog

Idempotent Migration Trains for Platform Modernization

A battle-tested pattern for orchestrating multi-month platform migrations with GitOps, idempotent runbooks, and progressive guardrails.

February 06, 2024 at 09:29 AM EST 12 min read
Platform EngineeringIdempotencyMigrationGitOpsSRE
Platform migration team working through runbooks pinned to a glass wall
Image: Leon / Unsplash

Platform migrations rarely fail because of technology; they fail because humans cannot replay them safely. Cloud region moves, Kubernetes upgrades, data center exits—each involves thousands of changes, often scripted in Confluence or run ad hoc by “heroes.” The industry has been vocal about this pain. Thought Machine’s KubeCon talks, Shopify’s replatforming series, and Netflix’s “migration trains” blog posts all point to the same solution: make migrations idempotent, observable, and driven by Git.

At Cloudythings we built an “idempotent migration train” blueprint to shepherd complex multi-month migrations. This post explains how it works.

Break migrations into immutable phases

We decompose the migration into phases called train cars. Each car has:

  • A declarative definition (Terraform, Helm, Ansible) representing desired end state.
  • Preconditions and invariants documented in a runbook.
  • Roll-forward and rollback steps tested in staging environments.

Cars advance sequentially, but each can be re-run without side effects. If a run fails mid-way, we reset to the previous known-good state and rerun the car. Git tracks the artifacts, giving auditors and SREs confidence.

Use GitOps to orchestrate trains

The migration control plane is a Git repository:

  • Train manifests describe target clusters, services, and dependencies.
  • Automation pipelines (Argo Workflows, Jenkins, or GitHub Actions) execute cars by applying manifests and runbooks.
  • Progress dashboards in Backstage visualize car status, approvals, and evidence.

We borrow heavily from Monzo’s “platform change control” playbook, where Git and pipelines provide the single source of truth.

Engineers reviewing migration train manifests in a collaborative workspace
Photo by You X Ventures on Unsplash. Visibility keeps migration anxiety low.

Bake in progressive delivery

Each car follows a progressive rollout:

  1. Dry run in an isolated environment mirroring production.
  2. Canary against a subset (single cluster, AZ, or tenant).
  3. Full rollout once SLO and feature-flag checks pass.

Argo Rollouts or Flagger handle canaries; LaunchDarkly gates risky features. We monitor burn rates and freeze the train automatically if error budgets dip, echoing the progressive delivery principles we wrote about in 2022.

Embed observability and evidence

Migrations create uncertainty. We collect:

  • Pre/post SLO snapshots for critical services.
  • Performance baselines (latency, tail distributions) stored alongside manifests.
  • System health dashboards keyed to train cars. If “Networking Car 3” impacts latency, responders know where to look.

Evidence attaches to each Git PR, enabling auditors to see that controls were followed—a pattern inspired by Keptn’s evidence-based deployments.

Automate rollback

Idempotency means rollback is deterministic:

  • Each car stores a rollback manifest and runbook.
  • On failure, pipelines revert the Git commit, reapply manifests, and validate via automated smoke tests.
  • Rollbacks run in staging weekly to ensure they stay fresh.

Netflix’s migration teams emphasized the comfort this provides—humans trust automation they practice regularly.

Align humans and process

Technology fails without people:

  • Migration control rooms (virtual war rooms) operate during cutovers with SRE, product, and compliance present.
  • Status broadcasting keeps stakeholders informed via Slack, email, or dashboards.
  • Decision logs document why trains pause or resume.

We also rotate responsibilities to avoid hero culture. Every engineer gets to run a car—after practicing in staging—spreading knowledge and resilience.

Measure success

We track:

  • Mean time per train car (aim: predictable cadence).
  • Unplanned rollback rate (target <5%).
  • Error budget consumption per phase.
  • Stakeholder confidence via surveys (did teams feel informed and in control?).

These metrics inform retrospective improvements for the next migration.

Idempotent migration trains are not theory. They synthesize practices from the SRE, DevOps, and GitOps communities into a repeatable ritual. When runbooks are executable, evidence is automatic, and humans collaborate transparently, platform modernizations stop being “big bang” events—and start feeling like reliable product releases.