Observability Scorecards that Actually Change Behavior

Ask an engineering leader which team owns observability and you will hear sighs. Some treat it as a tooling problem, others as an SRE mandate, and many just hope the alerts are quiet tonight. The best companies—from Honeycomb to Monzo to DoorDash—treat observability as a shared product with clear outcomes. Their blogs are filled with “observability scorecards,” dashboards that show how teams are doing against reliability expectations.

At Cloudythings we adopted scorecards to keep our clients honest and inspired. This article explains how we build them, the metrics that matter, and the rituals that make teams care.

Define the behaviors you want

Scorecards should drive decisions, not shame teams. We start by identifying behaviors that correlate with reliability in DORA research and Google’s SRE literature:

Healthy SLOs with burn-rate alerts.
Well-tuned alerting (low noise, high actionability).
Telemetry coverage (logs, metrics, traces for critical journeys).
On-call sustainability (rotation health, incident load).
Post-incident follow-through (action item closure rates).

These categories become scorecard pillars. We avoid vanity metrics like “number of dashboards created” because they rarely correlate with outcomes.

Engineer presenting a reliability review with scorecard insights — Photo by Leon on Unsplash. Scorecards work when leaders use them to make prioritization calls.

Source trustworthy data

We pull data from:

SLO tooling (Nobl9, Lightstep, or custom Sloth configs) for burn-rate, error budget remaining, and compliance.
Paging platforms (PagerDuty, Incident.io) for alert volume, acknowledgement times, and incident ownership.
Telemetry platforms (Grafana, Honeycomb) for trace coverage and dashboard usage.
Ticketing systems (Jira, Linear) for post-incident action item completion.

The data flows into a warehouse (BigQuery or Snowflake) and surfaces through Looker or Grafana. Automation is everything; no spreadsheets.

Craft meaningful indicators

Each pillar has a few indicators with target bands. Examples we deploy:

Burn-rate control: % of rolling 28-day windows where high-severity alerts fired before budget exhaustion. Target: >90%. This tells us whether burn-rate alerts are protecting teams.
Alert actionability: Ratio of paging alerts that resulted in documented customer impact. Target: >70%. Shopify’s SRE team highlighted this metric when reducing alert fatigue.
Trace coverage: % of critical services emitting traces sampled into Honeycomb. Target: >85%. Without traces, high-cardinality debugging fails.
On-call load: Mean incidents per engineer per week. Target: <2. Derived from PagerDuty involvements.
Action item closure: % of Sev1/Sev2 post-incident tasks resolved within 30 days. Target: >80%.

Each indicator includes qualitative context and links to documentation so teams learn how to improve.

Review in public

Scorecards lose power if they sit in Confluence. We run monthly “Reliability Review” sessions:

Platform/SRE presents scorecards across product groups.
Teams share experiments that improved metrics (e.g., migrating to OpenTelemetry, tuning alerts).
Leadership commits to removing blockers (staffing, tooling budgets).

We celebrate wins publicly, echoing the positive reinforcement culture Netflix and Atlassian write about.

Tie to OKRs and roadmaps

Scorecards inform roadmaps:

Low trace coverage? Add an OKR to standardize instrumentation libraries.
High on-call load? Prioritize toil reduction or shift-left testing.
Poor action item closure? Tweak incident process or allocate capacity for follow-through.

Because the metrics are objective, teams negotiate priorities with data, not gut feelings. This alignment mirrors what DoorDash shared about their reliability “business reviews.”

Iterate quickly

Scorecards are products. We collect feedback via surveys and adjust indicators. When we introduced “dashboard views” as a metric, teams gamed it by refreshing panels. We replaced it with “mean time from alert to relevant dashboard view,” which actually correlated with MTTR. Continuous improvement keeps the scorecard respected.

Keep empathy front and center

Numbers cannot capture everything. We pair scorecards with qualitative insights:

Rotations report fatigue levels.
Teams note upcoming launches that may stress SLOs.
SREs highlight systemic risks (e.g., missing chaos coverage).

The result is a holistic picture of reliability health. Scorecards act as a compass, not a hammer.

Observability is a team sport. When we treat scorecards as living products, anchored in trusted data and reviewed with empathy, reliability stops being a mystery. Engineers see where to invest, leaders see the ROI, and customers feel the difference.