Cloudythings Blog

AI Copilots for Incident Response

How we integrate LLM copilots into incident management without sacrificing rigor, blamelessness, or human judgment.

March 11, 2025 at 09:51 AM EST 12 min read
Incident ResponseAI/MLSREAutomationObservability
Incident commanders using AI copilots during a war room session
Image: Christopher Gower / Unsplash

2024 saw a flood of AI copilots for incident management. PagerDuty launched Runbook Automation AI, Atlassian rolled out Jira Service Management copilots, and startups shipped LLM assistants for SREs. Yet many teams worry about hallucinations or eroding postmortem quality. At Cloudythings we built an AI-assisted incident workflow that keeps humans in control while leveraging machine intelligence for speed.

Establish trust boundaries

We define what the copilot can and cannot do:

  • Allowed: summarize telemetry, suggest remediation steps, draft status updates, recommend runbooks.
  • Forbidden: execute production changes, close incidents, modify SLO targets.

Policies are codified in the incident platform and reviewed by legal/compliance. AI suggestions always require human acknowledgement.

Photo by Annie Spratt on Unsplash. Humans stay accountable.

Feed high-quality context

Copilots thrive on rich data:

  • Structured incident timelines from PagerDuty, Incident.io, or FireHydrant.
  • Observability exports (Grafana snapshots, Honeycomb traces) captured via APIs.
  • Runbook metadata stored in Git with machine-readable summaries.
  • SLO dashboards providing burn-rate numbers.

We sanitize and label data before sending to the LLM, respecting privacy and compliance.

Integrate into incident rituals

Copilots offer:

  • Real-time summaries posted in Slack channels every 10 minutes, highlighting hypothesis, impact, mitigation progress.
  • Query generation: prompts that craft Grafana queries or kubectl commands based on natural language.
  • Runbook recommendations ranked by historical success.

Responders accept or reject suggestions; rejections feed feedback loops to retrain prompts.

Guard against hallucinations

We enforce:

  • Citation requirements—copilot outputs must reference source telemetry or runbooks.
  • Confidence scoring—LLM responses include probability estimates; low confidence triggers human review reminders.
  • Audit logging—every suggestion, acceptance, and rejection is logged for post-incident analysis.

These guardrails align with the reliability principles shared by Google’s DeepMind and Microsoft’s Responsible AI teams.

Enhance post-incident learning

After the incident, the copilot helps:

  • Draft retrospectives with timelines, contributing factors, and action item suggestions.
  • Identify recurring themes by clustering incidents.
  • Recommend training content for responders based on gaps observed.

Humans edit final documents, ensuring accuracy and nuance.

AI copilots accelerate incident response when paired with clear boundaries, quality data, and robust review processes. Keep humans accountable, and the copilot becomes a force multiplier rather than a liability.