Cloudythings Blog

CI/CD for Data Platforms that SREs Can Trust

Adapting GitOps, contract testing, and observability to deliver reliable data pipelines, warehouses, and lakehouse transformations.

May 28, 2024 at 09:33 AM EST 12 min read
CI/CDData EngineeringGitOpsObservabilityTesting
Data engineers orchestrating pipelines on shared screens
Image: Kaleidico / Unsplash

Data pipelines used to live outside the CI/CD conversation. Airflow DAGs deployed from laptops, dbt models run ad hoc, lakehouse tables mutated manually. Then incidents happened—broken schemas, missing partitions, compliance breaches. Companies like Netflix, Shopify, and Monzo responded by treating data platforms like software: versioned, tested, observable.

We helped a media company overhaul its data platform using GitOps-inspired CI/CD. Here is the toolkit.

Declarative everything

We store:

  • dbt models and macros in Git with environment-specific configs managed through YAML overlays.
  • Airflow/Prefect DAG definitions as code synced via GitOps.
  • Warehouse permissions (Snowflake, BigQuery) expressed as Terraform modules.

Every change flows through pull requests reviewed by data engineers and SREs. Promotions happen via Argo CD or Terraform Cloud.

Engineer reviewing data pipeline deployment dashboards
Photo by Taylor Vick on Unsplash. Data pipelines deserve real deployment tooling.

Shift-left testing

CI pipelines run:

  • Unit tests with dbt’s run-operation and Pytest for transformations.
  • Contract tests verifying source schemas using Great Expectations or Soda Core.
  • Data diffing (Datafold, Elementary) comparing sample outputs between branches.
  • Static analysis to catch expensive queries or missing incremental filters.

Test evidence attaches to PRs. Merges require zero critical test failures and approval from data owners.

Deploy progressively

We adapt progressive delivery:

  • Shadow runs execute new pipelines alongside production, writing to isolated schemas.
  • Gradual promotion copies validated partitions into prod tables after checks pass.
  • Feature flags control data exposure to downstream consumers (Looker, Tableau, ML models).

Deployments monitor row counts, freshness, and quality KPIs. If anomalies appear, promotion halts automatically.

Observability and on-call

We instrument data pipelines with:

  • OpenLineage or Marquez to trace data dependencies.
  • Metric alerts for freshness, volume, schema drift, and cost (Snowflake credits, BigQuery slots).
  • Incident automation tying data quality failures to PagerDuty incidents with runbook links.

SREs treat data incidents like application incidents—burn-rate alerts, postmortems, action items.

Govern with policy

OPA/Conftest rules enforce:

  • PII handling policies (masking, tokenization).
  • Access controls (least privilege).
  • Cost limits (query slots, compute hours).

Policies run in CI before merge and in runtime via warehouse-native guards (Row Access Policies, Dynamic Data Masking). Compliance teams review logs through dashboards.

CI/CD for data platforms tightens feedback loops, reduces surprise outages, and aligns SRE with data engineering. Treat data like code, and reliability follows.