Platform migrations rarely fail because of technology; they fail because humans cannot replay them safely. Cloud region moves, Kubernetes upgrades, data center exits—each involves thousands of changes, often scripted in Confluence or run ad hoc by “heroes.” The industry has been vocal about this pain. Thought Machine’s KubeCon talks, Shopify’s replatforming series, and Netflix’s “migration trains” blog posts all point to the same solution: make migrations idempotent, observable, and driven by Git.
At Cloudythings we built an “idempotent migration train” blueprint to shepherd complex multi-month migrations. This post explains how it works.
Break migrations into immutable phases
We decompose the migration into phases called train cars. Each car has:
- A declarative definition (Terraform, Helm, Ansible) representing desired end state.
- Preconditions and invariants documented in a runbook.
- Roll-forward and rollback steps tested in staging environments.
Cars advance sequentially, but each can be re-run without side effects. If a run fails mid-way, we reset to the previous known-good state and rerun the car. Git tracks the artifacts, giving auditors and SREs confidence.
Use GitOps to orchestrate trains
The migration control plane is a Git repository:
- Train manifests describe target clusters, services, and dependencies.
- Automation pipelines (Argo Workflows, Jenkins, or GitHub Actions) execute cars by applying manifests and runbooks.
- Progress dashboards in Backstage visualize car status, approvals, and evidence.
We borrow heavily from Monzo’s “platform change control” playbook, where Git and pipelines provide the single source of truth.
Bake in progressive delivery
Each car follows a progressive rollout:
- Dry run in an isolated environment mirroring production.
- Canary against a subset (single cluster, AZ, or tenant).
- Full rollout once SLO and feature-flag checks pass.
Argo Rollouts or Flagger handle canaries; LaunchDarkly gates risky features. We monitor burn rates and freeze the train automatically if error budgets dip, echoing the progressive delivery principles we wrote about in 2022.
Embed observability and evidence
Migrations create uncertainty. We collect:
- Pre/post SLO snapshots for critical services.
- Performance baselines (latency, tail distributions) stored alongside manifests.
- System health dashboards keyed to train cars. If “Networking Car 3” impacts latency, responders know where to look.
Evidence attaches to each Git PR, enabling auditors to see that controls were followed—a pattern inspired by Keptn’s evidence-based deployments.
Automate rollback
Idempotency means rollback is deterministic:
- Each car stores a rollback manifest and runbook.
- On failure, pipelines revert the Git commit, reapply manifests, and validate via automated smoke tests.
- Rollbacks run in staging weekly to ensure they stay fresh.
Netflix’s migration teams emphasized the comfort this provides—humans trust automation they practice regularly.
Align humans and process
Technology fails without people:
- Migration control rooms (virtual war rooms) operate during cutovers with SRE, product, and compliance present.
- Status broadcasting keeps stakeholders informed via Slack, email, or dashboards.
- Decision logs document why trains pause or resume.
We also rotate responsibilities to avoid hero culture. Every engineer gets to run a car—after practicing in staging—spreading knowledge and resilience.
Measure success
We track:
- Mean time per train car (aim: predictable cadence).
- Unplanned rollback rate (target <5%).
- Error budget consumption per phase.
- Stakeholder confidence via surveys (did teams feel informed and in control?).
These metrics inform retrospective improvements for the next migration.
Idempotent migration trains are not theory. They synthesize practices from the SRE, DevOps, and GitOps communities into a repeatable ritual. When runbooks are executable, evidence is automatic, and humans collaborate transparently, platform modernizations stop being “big bang” events—and start feeling like reliable product releases.