2024 saw a flood of AI copilots for incident management. PagerDuty launched Runbook Automation AI, Atlassian rolled out Jira Service Management copilots, and startups shipped LLM assistants for SREs. Yet many teams worry about hallucinations or eroding postmortem quality. At Cloudythings we built an AI-assisted incident workflow that keeps humans in control while leveraging machine intelligence for speed.
Establish trust boundaries
We define what the copilot can and cannot do:
- Allowed: summarize telemetry, suggest remediation steps, draft status updates, recommend runbooks.
- Forbidden: execute production changes, close incidents, modify SLO targets.
Policies are codified in the incident platform and reviewed by legal/compliance. AI suggestions always require human acknowledgement.
Feed high-quality context
Copilots thrive on rich data:
- Structured incident timelines from PagerDuty, Incident.io, or FireHydrant.
- Observability exports (Grafana snapshots, Honeycomb traces) captured via APIs.
- Runbook metadata stored in Git with machine-readable summaries.
- SLO dashboards providing burn-rate numbers.
We sanitize and label data before sending to the LLM, respecting privacy and compliance.
Integrate into incident rituals
Copilots offer:
- Real-time summaries posted in Slack channels every 10 minutes, highlighting hypothesis, impact, mitigation progress.
- Query generation: prompts that craft Grafana queries or
kubectlcommands based on natural language. - Runbook recommendations ranked by historical success.
Responders accept or reject suggestions; rejections feed feedback loops to retrain prompts.
Guard against hallucinations
We enforce:
- Citation requirements—copilot outputs must reference source telemetry or runbooks.
- Confidence scoring—LLM responses include probability estimates; low confidence triggers human review reminders.
- Audit logging—every suggestion, acceptance, and rejection is logged for post-incident analysis.
These guardrails align with the reliability principles shared by Google’s DeepMind and Microsoft’s Responsible AI teams.
Enhance post-incident learning
After the incident, the copilot helps:
- Draft retrospectives with timelines, contributing factors, and action item suggestions.
- Identify recurring themes by clustering incidents.
- Recommend training content for responders based on gaps observed.
Humans edit final documents, ensuring accuracy and nuance.
AI copilots accelerate incident response when paired with clear boundaries, quality data, and robust review processes. Keep humans accountable, and the copilot becomes a force multiplier rather than a liability.