Files
sic/knowledge/runbooks/incident-response-long.md
rikrdo 62728b2200 Initial commit: SIC harness (backend, web, pi-adapter, configs, docs)
- pnpm monorepo: apps/api (Fastify + SQLite + SSE), apps/web (React+Vite), packages/shared, packages/pi-adapter
- Local auth (admin/webhook-runner roles) + Keycloak JWT ready
- Multi-session chat with reliable history (user persisted before LLM, assistant persisted after stream)
- Markdown knowledge base with /api/docs/search + /api/docs/:id
- YAML webhook catalog with backend-only execution, retry/backoff, audit (webhook_runs), and per-user rate limit
- Skills config (sre-on-call, blameless-postmortem, security-incident) injected into LLM system prompt
- LLM provider failover chain (config/models.yml fallback + LLM_FALLBACK_CHAIN override)
- Context-aware webhooks panel + backend id-mention safety net
- Per-message stats (time/duration/tokens/model), Markdown+GFM render, code & table copy/download buttons
- Vitest suite, end-to-end smoke test (scripts/smoke.mjs), per-session system prompt override
- /metrics Prometheus endpoint + /api/metrics JSON, request-id correlation
- dotenv with explicit repo-root path; envString/envNumber helpers (handles empty-string env)
- Runbooks + SOPs under knowledge/ in English; README, docs, and INDEX.md in English
2026-06-29 16:20:53 +02:00

281 lines
8.2 KiB
Markdown

---
title: Production Incident Response Runbook (long-form)
tags: [incident, production, sre, on-call, runbook, master]
owner: sre
updated: 2026-06-28
---
# Production Incident Response Runbook (long-form)
> This runbook is designed to exercise the UI: it contains nested headings, lists, tables, code blocks, blockquotes, links, and enough volume to force scroll in the modal. Use it as a reference during drills and to validate the look of the documentation viewer.
## Table of contents
1. [Purpose and scope](#purpose-and-scope)
2. [Severities and SLAs](#severities-and-slas)
3. [Roles and responsibilities](#roles-and-responsibilities)
4. [Response flow](#response-flow)
5. [Initial diagnosis](#initial-diagnosis)
6. [Common incident patterns](#common-incident-patterns)
7. [Useful commands](#useful-commands)
8. [Available webhooks](#available-webhooks)
9. [Escalation](#escalation)
10. [Post-mortem](#post-mortem)
11. [Appendix: glossary](#appendix-glossary)
## Purpose and scope
This runbook defines the standard procedure for responding to production incidents that affect the availability, integrity, or performance of critical services. It applies to every engineering and operations team that maintains services in scope of SIC.
### When to use this runbook
- Partial or total service outages.
- Severe performance degradation (p99 latency > agreed SLA).
- Confirmed or suspected data loss or corruption.
- Security alerts with production impact.
### When NOT to use this runbook
- Failures in dev or staging environments without user impact.
- Change requests or scheduled maintenance.
- HR or administrative process incidents.
## Severities and SLAs
| Severity | Definition | Ack SLA | Mitigation SLA | Communication |
| --- | --- | --- | --- | --- |
| **SEV-1** | Total outage or data loss | 5 minutes | 60 minutes | Every 15 min |
| **SEV-2** | Severe degradation, affects > 30% of users | 10 minutes | 2 hours | Every 30 min |
| **SEV-3** | Partial degradation, affects < 30% of users | 30 minutes | 8 hours | Every 2 hours |
| **SEV-4** | Cosmetic, no functional impact | 1 business day | Next sprint | Async |
> **Important**: severity can go up or down as the incident evolves. Document every change in the incident channel with a timestamp.
## Roles and responsibilities
- **Incident Commander (IC)**: coordinates the response, does not run technical tasks. The only person who can declare the incident resolved.
- **Comms Lead**: handles communication to stakeholders, status page, and customers.
- **Tech Lead**: leads the technical investigation, assigns tasks to the response team.
- **Subject Matter Expert (SME)**: provides system-specific knowledge for the affected service.
- **Scribe**: documents the incident timeline in real time.
## Response flow
1. **Detect**: alert, user report, or proactive monitoring.
2. **Triage**: classify severity and assign an IC in under 5 minutes.
3. **Convene**: open a bridge and the #inc-YYYYMMDD-XX channel.
4. **Mitigate**: apply changes to restore service. The root cause can wait.
5. **Resolve**: confirm the service is stable. Close the incident.
6. **Post-mortem**: within 5 business days, blameless.
### Flow diagram
```mermaid
graph TD
A[Detect] --> B{Triage}
B -->|SEV-1/2| C[Open bridge]
B -->|SEV-3/4| D[Assign owner]
C --> E[Investigate]
D --> E
E --> F{Mitigation?}
F -->|Yes| G[Apply fix]
F -->|No| H[Escalate]
G --> I[Monitor]
I --> J{Stable?}
J -->|Yes| K[Close]
J -->|No| E
H --> E
K --> L[Post-mortem]
```
## Initial diagnosis
Before going deeper, run the following steps in order:
1. Check the overall service health dashboard.
2. Review the last hour of production changes (`deploy log`).
3. Check active alerts in the monitoring system.
4. Confirm the failure is not user-side (DNS, local network).
### Triage checklist
- [ ] Affected service identified
- [ ] Severity assigned
- [ ] IC identified
- [ ] Bridge open
- [ ] Communication channel created
- [ ] Status page updated
- [ ] Comms lead assigned
## Common incident patterns
### Pattern A: latency spike
**Symptoms**: p99 latency rises from 200 ms to > 2 s without proportional traffic increase.
**Typical causes**:
- DB connection pool saturation.
- Massive cache miss (accidental invalidation).
- Long JVM garbage collection.
**Immediate actions**:
1. Check DB metrics (connections, locks, slow queries).
2. Validate cache hit rate.
3. If no cause is identified in 5 min, escalate to the service SME.
### Pattern B: cascading 5xx errors
**Symptoms**: sudden increase of HTTP 500/502/503 on one or more endpoints.
**Typical causes**:
- Upstream service down.
- Invalid configuration deployed.
- External resource (third-party API) unavailable.
**Immediate actions**:
1. Identify the failing upstream service.
2. Review the last deploy touching that path.
3. If the deploy is to blame, consider a rollback.
### Pattern C: data loss
**Symptoms**: customers report missing or inconsistent data.
**Typical causes**:
- Cleanup job that deleted more than intended.
- Schema migration executed with a bug.
- Bug in business logic.
**Immediate actions**:
1. **Stop** any job that could make things worse.
2. Evaluate whether a recent and viable backup can be restored.
3. Escalate immediately to the engineering lead.
## Useful commands
### Check connectivity
```bash
# DNS
dig +short example.com
# Basic HTTP
curl -sSI https://api.example.com/health
# TCP to a specific port
nc -zv db.internal 5432
```
### Inspect logs live
```bash
# Last 100 lines and follow
kubectl logs -n prod deploy/api --tail=100 -f
# Logs from the last 5 minutes
kubectl logs -n prod deploy/api --since=5m
# Logs of a specific pod
kubectl logs -n prod api-7d4f8b9c-x2k9n --tail=200
```
### Quick metrics
```bash
# CPU per pod
kubectl top pods -n prod
# Memory per pod
kubectl top pods -n prod --containers
# Disk usage of a node
ssh node-01 df -h
```
## Available webhooks
| Webhook | When to use it | Requires confirmation |
| --- | --- | --- |
| `vpn-diagnostic` | VPN access issues | Yes |
| `service-restart` | Hung or zombie service | Yes |
| `dns-flush` | Broken DNS resolution | No |
| `disk-cleanup` | Disk > 90% | Yes |
| `log-tail` | Need logs in real time | No |
| `cache-purge` | Stale or corrupt cache | Yes |
> Remember: webhook execution always requires explicit confirmation from the user who triggers it. The LLM can only recommend them; it must never execute them directly.
## Escalation
If the incident is not mitigated within the agreed SLA:
1. Notify the area's on-call manager.
2. If it exceeds 2 hours, notify the engineering director.
3. If customers are impacted, involve Customer Success.
4. If there is monetary or data loss, notify Legal and the C-level.
### Emergency contacts
```text
SRE on-call: +54 11 5555-0001
Platform lead: +54 11 5555-0002
Security IR: +54 11 5555-0003
CTO: +54 11 5555-0004
```
## Post-mortem
Within 5 business days after closing the incident:
1. Schedule a meeting with everyone involved.
2. Share the post-mortem document 24 h in advance.
3. During the meeting: review the timeline, identify the root cause.
4. Document an action plan with owners and dates.
5. Share learnings with the rest of the organization.
### Post-mortem template
```markdown
# Post-mortem: <title>
## Summary
<2-3 sentences about what happened and what the impact was>
## Timeline
- HH:MM - <event>
- HH:MM - <event>
## Root cause
<technical description of the cause>
## What went well
- <item>
- <item>
## What went wrong
- <item>
- <item>
## Corrective actions
- [ ] <action> - owner: <person> - due: <date>
- [ ] <action> - owner: <person> - due: <date>
## Lessons learned
<actionable insights for the team and the organization>
```
## Appendix: glossary
- **IC**: Incident Commander.
- **SME**: Subject Matter Expert.
- **SLA**: Service Level Agreement.
- **p99**: 99th percentile of latency.
- **Blameless**: culture where the post-mortem looks for systemic causes, not blame.
- **Rollback**: reverting a change to the previous version.
- **Mitigation**: action to reduce impact, not necessarily the root cause.
- **Resolution**: confirmation that the system is stable.
---
> If you find outdated or missing information in this runbook, edit the file and notify the SRE team. The source of truth is always the repository, not PDFs attached in Confluence.