Cloudythings Blog

Closing the Loop on Infrastructure Drift with GitOps

How we pair Terraform, Crossplane, and Argo CD to detect, surface, and remediate drift before it wakes up the on-call engineer.

May 31, 2022 at 09:31 AM EST 11 min read
GitOpsInfrastructure as CodeDrift DetectionSREAutomation
Engineer reviewing infrastructure drift reports on a tablet
Image: Brooke Cagle / Unsplash

Infrastructure drift is the silent killer of reliability. Terraform plan shows a clean diff, yet the production environment behaves nothing like the code. An architect flips a cloud console switch “just this once,” or a break-glass script modifies a security group without leaving a trail. Six weeks later an incident hits and nobody remembers the change. Atlassian, HashiCorp, and Verizon have all shared stories on Medium about chasing ghosts caused by drift.

GitOps promises immutable truth, but only if we wire detection and remediation into the operating model. At Cloudythings we built a drift-control blueprint that blends Terraform, Crossplane, and Argo CD with policy, observability, and incident automation. Here is how we keep IaC honest.

Map your sources of truth

Start by cataloging the tools that declare infrastructure:

  • Terraform for cloud primitives (VPC, IAM, load balancers).
  • Crossplane for managed services exposed as Kubernetes Custom Resources.
  • Kubernetes manifests managed by Argo CD or Flux.

Each system owns a slice of the big picture. Document ownership, drift tolerance, and remediation expectations. We use a sources-of-truth.yml manifest stored in Git to track which teams govern which components.

Instrument drift detection pipelines

We run three categories of drift checks:

  1. Terraform plan audits: Nightly GitHub Actions run terraform plan -detailed-exitcode using read-only credentials. Plans that detect drift publish summaries to Slack and create GitHub issues annotated with impacted resources.
  2. Argo CD diff alerts: Argo’s application status exposes SyncStatus: OutOfSync. We scrape this data and alert when drift persists longer than agreed (usually 2 hours). Teams like Intuit and Adobe rely on the same signals, as described in their KubeCon talks.
  3. Crossplane health checks: Crossplane’s status conditions tell us when the actual cloud resource deviates from the desired spec. We stream these conditions into Prometheus and define SLOs measuring reconciliation time.

Automation is critical. Manual review scales poorly and leaves blind spots.

Operations team reviewing nightly drift detection reports on large monitors
Photo by Taylor Vick on Unsplash. Drift reports become useful when they feed into visible dashboards.

Create remediation runbooks

Detection is noise without action. We codify remediation pathways:

  • Auto-heal safe resources. For stateless components (e.g., ConfigMaps, IAM policies) Argo CD and Crossplane auto-reconcile. If drift reappears, we escalate to human review.
  • Open PRs for risky changes. Terraform drifts open pull requests with the generated plan file attached. Engineers review, approve, and merge to reapply state—no console fiddling.
  • Document context for exceptions. Sometimes drift is intentional (e.g., emergency firewall block). We record the context in Git as a temporary override with an expiry timestamp, similar to the “time bomb” pattern Slack documented for feature flags.

Runbooks live alongside code. Step-by-step instructions include links to Grafana dashboards, Argo apps, and Terraform Cloud workspaces. Everything is idempotent: reapply the manifest, watch the state converge.

Integrate policy-as-code

Policy prevents drift at the source:

  • Cloud Config Rules / Azure Policy detect out-of-band changes and trigger cloud-native remediation.
  • OPA/Conftest in CI block PRs that attempt to modify resources directly in Kubernetes without accompanying Terraform updates.
  • Audit webhooks capture manual console actions. We stream AWS CloudTrail, GCP Audit Logs, and Azure Activity Logs into a data lake tagged with Terraform resource IDs. When someone toggles a console switch, we know.

We took cues from HashiCorp’s Sentinel examples and Google Cloud’s security blueprints. Policy-as-code moves the conversation from detective work to preventive guardrails.

Surface drift with empathy

Drift alerts can demoralize teams if they feel like constant nagging. We present information in context:

  • Backstage plugin: Engineers see their service’s drift posture—pending plans, Argo diffs, Crossplane conditions—in one dashboard.
  • Weekly scorecards: We share how many drifts auto-healed vs. required intervention, similar to the “Ops Review” format Etsy popularized.
  • Mean time to reconciliation (MTTR): We treat drift like an incident metric. Our goal is <24 hours for human-reviewed drifts and <15 minutes for auto-healed ones.

Transparency turns drift from a blame game into an operational KPI.

Close the loop with incident response

When drift triggers an incident, we make the linkage explicit:

  • Incident timeline enrichment: Our PagerDuty automation posts the last drift report touching the impacted resource directly into the incident Slack channel.
  • Post-incident retrospectives: We tag incidents with root causes (e.g., “manual console change,” “incomplete Terraform module”) and feed them into platform OKRs.
  • Game days: Quarterly, we intentionally drift safe resources to verify detection and remediation still work. It is a chaos engineering practice that keeps teams sharp.

The outcomes

Teams that embrace this blueprint report:

  • 60–80% fewer “mystery incidents” caused by unknown infrastructure changes.
  • Confidence during audits because every manual change has a paper trail.
  • Faster delivery velocity—engineers trust GitOps pipelines, knowing drift will not surprise them in the middle of a launch.

The lesson we keep relearning from industry case studies—whether from Medium, Segment, or Thought Machine—is that infrastructure is only as reliable as its feedback loops. GitOps gives you the declarative backbone. Drift detection, policy-as-code, and empathetic surfacing are what keep that backbone healthy. Do that, and your on-call engineers will finally sleep through the night.