Cloudythings Blog

Kubernetes Multitenancy Without the Midnight Pager

Patterns, guardrails, and observability for running multi-tenant clusters that keep compliance, SRE, and platform teams on the same page.

May 03, 2022 at 09:43 AM EST 12 min read
KubernetesMultitenancyPlatform EngineeringPolicy as CodeSRE
Platform engineers drawing multi-tenant cluster boundaries on a whiteboard
Image: Leon / Unsplash

Every quarter another platform team posts a Medium retrospective about a multi-tenant Kubernetes outage. The script rarely changes: an overeager namespace saturated cluster resources, a noisy neighbor triggered a cascading failure, or a compliance auditor found that all tenants shared the same RBAC policy. Running shared clusters is tempting—it eliminates the sprawl of one-cluster-per-team—but it courts chaos without discipline.

At Cloudythings we help product organizations adopt multi-tenancy without losing sleep. We learned from pioneers like Spotify (Backstage multi-tenancy), Adobe (fleet-scale clusters), and the maintainers of the CNCF Multi-Tenancy SIG. This article distills six disciplines that keep shared clusters reliable, auditable, and SRE-friendly.

1. Model tenancy profiles up front

Not all tenants need identical guarantees. We define tenancy personas, borrowing from VMware and Red Hat’s multi-tenancy frameworks:

  • Sandbox tenants experiment with minimal guardrails.
  • Product tenants serve customers and demand high availability.
  • Regulated tenants (finance, healthcare) require hardened isolation.

Each persona maps to a Kubernetes profile: namespaces, network policies, runtime classes, and resource quotas. We encode these profiles as Helm charts or Crossplane compositions so developers inherit the right guardrails every time.

2. Create hard resource boundaries

Resource starvation is the most common failure mode. We combat it with layered limits:

  • Namespace-level quotas covering CPU, memory, pods, PVCs, and load balancers. We use Fairwinds’ Goldilocks and Datadog’s Kubernetes Autoscaling guides as baselines for sizing.
  • LimitRanges coercing per-pod requests/limits. Teams cannot deploy an unbounded container even if the quota allows it.
  • PriorityClasses per persona. Critical workloads preempt sandbox workloads during resource crunches, a technique Shopify described when sharing how they prioritize checkout traffic.
  • Vertical Pod Autoscaler (VPA) in recommendation mode. We surface right-sized recommendations in Backstage so teams adjust before SREs page them.

3. Enforce policy with code, not guidelines

Manual reviews do not scale. We adopt policy-as-code using Gatekeeper and Kyverno:

  • Admission policies ensure pods run as non-root, use approved runtime classes (gVisor or Firecracker for sensitive tenants), and mount only sanctioned volumes.
  • NetworkPolicies default to zero trust. Teams request egress via GitOps PRs; OPA validates destination lists against an allowlist maintained by security.
  • Image signature checks using Sigstore’s cosigned webhook. Only signed images tied to workload identities may run, aligning with the supply-chain hardening practices Chainguard keeps advocating.

Policies ship with regression tests (Conftest, Kyverno CLI). When a platform engineer updates a rule, the test suite replays representative manifests from every tenant to avoid surprises.

4. Layer observability per tenant

SREs need visibility without drowning in cardinality. We:

  • Label everything with tenant, persona, and service. Prometheus and Loki aggregations key off these labels, making it easy to isolate a noisy neighbor.
  • Expose “tenant health scorecards” in Grafana or DataDog, inspired by Medium’s observability dashboards. The scorecard tracks SLO compliance, incident volume, resource usage, and policy violations per tenant.
  • Stream audit events (Kubernetes Audit, Gatekeeper, Falco) into a SIEM with tenant tags. When a security team traces risky activity, they know exactly which tenant triggered it.

Tenants also get self-service dashboards via Backstage or Port. Transparency reduces the number of “why was I throttled?” tickets.

SRE examining multi-tenant observability dashboards on a large screen
Photo by Israel Andrade on Unsplash. Per-tenant scorecards keep accountability clear.

5. Harden isolation with the right runtime

Namespaces and policies are necessary but insufficient for regulated workloads. We supplement with:

  • RuntimeClasses pointing to gVisor, Kata Containers, or Firecracker microVMs (via AWS Bottlerocket or Weaveworks Ignite). Shopify and Fly.io have both documented how microVMs protect multi-tenant workloads from kernel exploits.
  • Node pools per persona with taints/tolerations. Sensitive tenants schedule only on hardened nodes with SELinux/AppArmor enforced and dedicated patch windows.
  • Distroless base images signed via Cosign. Removing shells and package managers shrinks the attack surface, mirroring the supply-chain defense patterns we wrote about in 2021.

We publish isolation guarantees to auditors. Clear documentation makes compliance reviews boring—a win for everyone.

6. Automate incident response drills

Even with guardrails, incidents happen. We prepare by:

  • Simulating noisy neighbors. A chaos job floods the cluster with CPU or network load; we verify quotas and alerts fire correctly.
  • Running cross-tenant failover drills. If a regulated tenant’s namespace locks up, can we evacuate workloads into a warm standby cluster within SLO?
  • Blameless postmortems per persona. We categorize incidents by guardrail gaps, capacity planning misses, or tenant misconfigurations.

These drills feed into roadmaps. If the same failure pattern repeats, it becomes a platform OKR. We stole that idea from the Azure reliability blog, which treats repeated outages as signals for platform investment—not human blame.

Governance that scales

Multi-tenancy succeeds when governance is automated and visible. We run a “Tenancy Control Plane” Git repository containing:

  • Persona definitions and quotas.
  • Policy modules (OPA, Kyverno) and their tests.
  • Terraform modules provisioning namespaces, service accounts, and secrets via External Secrets Operator.
  • Runbooks and SLAs per persona.

Changes flow through PRs with reviewers from SRE, security, and the tenant community. This GitOps approach mirrors the financial regulation-friendly setups Intuit and Swisscom discussed at KubeCon 2021.

The payoff

  • Platform efficiency: Shared clusters run at 65–75% utilization without breaching SLOs.
  • Audit confidence: Every control has evidence—logs, policies, tests—tied to Git commits.
  • Developer happiness: Tenants self-serve namespaces with predictable guardrails instead of filing tickets.
  • SRE breathing room: Incidents drop because guardrails catch risky deployments before they land.

Multi-tenancy is not the easy path, but it can be the smart one. By applying policy-as-code, runtime isolation, and transparent observability, you get the economics of shared infrastructure without the pager fatigue. The success stories are out there; we simply codify them into a disciplined operating model that treats reliability as a first-class requirement.