Cloudythings Blog

MicroVM Ready: Running Firecracker KVM for High-Density Serverless

What it takes to run Firecracker microVMs in production, from image pipelines to multi-tenant isolation for internal platforms.

May 27, 2021 at 09:29 AM EST 11 min read
FirecrackerMicroVMServerlessVirtualizationPlatform Engineering
Rows of data center servers with blue accent lighting representing microVM density
Image: Taylor Vick / Unsplash

When AWS announced Firecracker at re:Invent 2018, the serverless world buzzed about 125 ms cold-starts and dramatic density gains. Fast-forward to 2021 and the community has gone from curiosity to production reality. Companies like Fly.io, Shopify, and Bloomberg have shared Medium and CNCF blog posts describing how microVMs—virtual machines that boot as quickly as containers—power multi-tenant platforms with strong isolation. At Cloudythings we have spent the past year standing up Firecracker-backed platforms for internal developer portals and ephemeral environments. This article distills those lessons: the architecture, the trade-offs, and the day-two operations.

Why microVMs?

Containers give us density, but they share a kernel. Virtual machines give us isolation, but they are heavy. Firecracker sits in the gap. It runs lightweight VMs with a minimal device model (virtio-net, virtio-block) on top of KVM. MicroVMs boot in <200 ms, consume <5 MB of memory for the virtual machine monitor, and support jailer-based isolation akin to gVisor or Kata Containers. Lambda, Fargate, and many internal platforms now rely on Firecracker to isolate untrusted workloads.

Our clients adopt Firecracker for three reasons:

  1. Security isolation: Multi-tenant CI/CD runners or customer workloads gain a dedicated kernel boundary.
  2. Density: MicroVMs can pack dozens of tenants per host without the bloat of traditional VMs.
  3. Determinism: Images are pre-built, signed, and boot reproducibly, aligning with immutable infrastructure goals.

Building the image pipeline

Firecracker boot sources typically include:

  • A kernel binary compiled with the right config (e.g., cgroups v2, overlayfs).
  • A root filesystem packaged as an ext4 or SquashFS image.

We treat these artifacts like any other supply-chain component—signed, versioned, and delivered via GitOps.

  1. Define a Packer or Image Builder pipeline that compiles the kernel (often from the Amazon Linux or mainline tree) and assembles the rootfs using Buildroot or Docker-to-rootfs tools. We prefer Buildroot for deterministic outputs, guided by the Firecracker team’s Medium tutorials.
  2. Inject observability hooks (OpenTelemetry agents, Fluent Bit) into the rootfs. When microVMs have no shell, logs and traces are lifelines.
  3. Sign artifacts with Cosign. We store the kernel and rootfs in S3 or an OCI registry (e.g., ghcr.io/cloudythings/firecracker/rootfs:2021-05-24) with signatures recorded in Rekor.
  4. Automate BOM generation with Syft. Even minimal rootfs builds should ship with SBOMs for compliance.
Engineer testing microVM kernels on bare metal servers
Photo by Claudio Schwarz on Unsplash. Firecracker loves bare metal, but cloud instances work too.

Orchestrating microVMs

Firecracker does not ship with an orchestrator. You can use:

  • Kubernetes via Kata Containers or Firecracker Containerd runtime classes, which wrap Firecracker microVMs behind familiar pod specs.
  • Nomad driver for Firecracker, used by HashiCorp customers to schedule microVM tasks.
  • Custom orchestrators such as Weave Ignite, Fly.io’s open-source flyd, or bespoke controllers.

For internal platforms we prefer Firecracker Containerd. It integrates with containerd and uses snapshotters to manage rootfs layers. Our deployment looks like:

  1. Host provisioning: We allocate dedicated EC2 bare-metal or “metal” instances (e.g., m5.metal). Nitro’s virtualization stack already supports nested virtualization, but bare-metal keeps performance predictable.
  2. Runtime configuration: Each host runs containerd with the Firecracker runtime shim. The shim manages microVM lifecycles, connecting to Firecracker over the API socket.
  3. Network setup: We configure cni-plugins (bridge, ptp) and optionally [CNI plugins for AWS VPC] to map microVMs into our VPC.
  4. Storage: We attach block devices via Firecracker’s drive API, pointing to EBS volumes or snapshotters for ephemeral disks.

Kubernetes integration offers the richest ecosystem. By defining a RuntimeClass that points to the Firecracker runtime, we let developers deploy pods as usual. Admission controllers enforce which namespaces may use the runtime. Observability agents (Datadog, Prometheus node exporters) run on the host, scraping microVM metrics via fodder or custom agents.

Multi-tenant isolation story

Isolation lives or dies on defaults:

  • Jailer enforcement: Firecracker’s jailer drops privileges, applies seccomp, and uses chroot. We ensure every microVM is started through the jailer, never via direct API calls.
  • cgroups & CPU pinning: We assign each microVM CPU shares and memory limits, preventing noisy-neighbor effects. Some clients require cpuset pinning to guarantee performance.
  • Network segmentation: MicroVMs land in dedicated subnets or use Calico/NetworkPolicy to restrict east-west traffic.
  • Image provenance: Rootfs images are immutable and signed; platform teams cannot scp into them. Configuration is done via cloud-init-style metadata or baked settings.

Firecracker’s security whitepaper (published on aws.amazon.com and amplified by numerous community blog posts) stresses audit logging. We log every API call to Firecracker and ship it to CloudWatch or Loki. Combined with Cosign signatures we can prove exactly which image booted for a customer and when.

Managing cold starts and warm pools

MicroVMs boot fast, but they still need to load kernels and mount rootfs. We implement warm pools:

  • Idle microVM pools per workload or tenant. An event-driven scaler (KEDA, custom Lambda) monitors queue depth and starts microVMs ahead of demand.
  • Snapshotting: Firecracker’s snapshot feature saves a VM’s state after boot. Restoring from a snapshot yields <50 ms start times, as documented by the original Firecracker team on the AWS Compute Blog.
  • Prewarming pipelines: Nightly jobs build new snapshots with updated application code. Because rootfs images are immutable, we generate snapshots per release and roll them out via GitOps.

For ephemeral CI runners, we prefer snapshots. A GitHub Actions workflow requests a microVM, the orchestrator clones a snapshot, executes the job, collects artifacts, and tears the VM down. Warm pools ensure developers seldom wait more than a few seconds for isolation-grade compute.

Observability and debugging

MicroVMs lack a full OS, so observability must be externalized:

  • Serial console streaming: We capture stdout/stderr from the guest to a host log aggregator. Firecracker’s API exposes log FIFOs that we tap via Fluent Bit.
  • Metrics: Firecracker emits host-level metrics (CPU, memory, IO) via Prometheus exporters. Inside the guest, we integrate OpenTelemetry to push traces to Honeycomb or Tempo.
  • Crash forensics: We enable Kdump or kernel crash dumps in the guest, storing them in an object store for later analysis.
Engineer reviewing observability dashboards for Firecracker microVM workloads
Photo by Luke Chesser on Unsplash. Instrument the host and the guest.

When deeper debugging is required we rely on gdbserver or Delve running inside the guest, exposed via secure tunnels. These tools are baked into a special debug rootfs, not production images, maintaining immutability for live traffic.

Integrating with platform products

MicroVMs shine when integrated into internal platforms:

  • Internal developer platforms: Combine Backstage or Port portals with a “Launch workspace” button that provisions a Firecracker-backed environment for experimentation. Self-service virtualization without giving up isolation.
  • Feature environments: Instead of heavy VM snapshots, we pre-build microVM images containing the entire service stack. A GitOps promotion updates the image reference; platform tooling spins up the environment in minutes.
  • Secure data processing: Analytics teams run Python or Spark workloads inside microVMs with VPC endpoints, satisfying data governance requirements. Firecracker’s minimal attack surface keeps security teams comfortable.

We have even paired microVMs with mountebank service virtualization. Test suites boot a microVM containing mocked upstream dependencies, allowing integration tests to run without hitting production APIs—a pattern inspired by Twilio’s and Thoughtworks’ testing blogs.

Operations checklist

Running Firecracker in production introduces new runbook items:

  • Capacity planning: Track memory fragmentation and CPU headroom. Because microVMs reserve RAM, we target 70% utilization ceilings.
  • Host patching: Apply Kernel Live Patching or roll hosts weekly. We use Bottlerocket OS to reduce drift; its API integrates well with Firecracker.
  • Security scanning: Scan rootfs images regularly. Even minimal builds may inherit vulnerabilities in libc or language runtimes.
  • Lifecycle automation: Ensure old snapshots and images are garbage-collected. We tag artifacts with expiration dates enforced by CI.
  • Compliance evidence: Maintain logs proving that only signed kernels/rootfs booted. Rekor transparency logs and Argo CD histories make audits painless.

Adoption roadmap

  1. Pilot on non-production workloads. Start with CI runners or ephemeral preview environments where isolation is valuable but risk is lower.
  2. Invest in image pipelines early. Without deterministic builds and signatures, microVM trust collapses.
  3. Educate developers. Document new debugging workflows, cross-train on observability dashboards, and hold Brown Bag sessions.
  4. Automate warm pools. Nothing kills enthusiasm like cold start delays. Dynamic scaling is table stakes.
  5. Integrate with policy & GitOps. Use Kyverno or Gatekeeper to ensure only signed microVM images run. Promote image versions through pull requests.

Resources worth bookmarking

  • Firecracker documentation and the team’s Medium posts on production tuning.
  • Weaveworks Ignite as a reference implementation of GitOps for Firecracker.
  • Fly.io’s blog detailing how they built a platform atop microVMs.
  • AWS Security Whitepaper for Firecracker covering threat models and hardening.
  • Thought Machine’s KubeCon talk on microVM-backed banking workloads, a great case study for regulated industries.

MicroVMs are not a silver bullet, but they unlock a compelling middle ground: the security of virtual machines with the speed and density of containers. When paired with signed images, GitOps promotion, and thoughtful developer workflows, Firecracker becomes a foundational building block for modern platform engineering. Whether you are powering internal serverless functions, sandboxed machine learning jobs, or untrusted customer workloads, microVMs let you tighten the isolation screws without slowing down your builders.