Files
sic/knowledge/runbooks/incident-response.md
rikrdo 62728b2200 Initial commit: SIC harness (backend, web, pi-adapter, configs, docs)
- pnpm monorepo: apps/api (Fastify + SQLite + SSE), apps/web (React+Vite), packages/shared, packages/pi-adapter
- Local auth (admin/webhook-runner roles) + Keycloak JWT ready
- Multi-session chat with reliable history (user persisted before LLM, assistant persisted after stream)
- Markdown knowledge base with /api/docs/search + /api/docs/:id
- YAML webhook catalog with backend-only execution, retry/backoff, audit (webhook_runs), and per-user rate limit
- Skills config (sre-on-call, blameless-postmortem, security-incident) injected into LLM system prompt
- LLM provider failover chain (config/models.yml fallback + LLM_FALLBACK_CHAIN override)
- Context-aware webhooks panel + backend id-mention safety net
- Per-message stats (time/duration/tokens/model), Markdown+GFM render, code & table copy/download buttons
- Vitest suite, end-to-end smoke test (scripts/smoke.mjs), per-session system prompt override
- /metrics Prometheus endpoint + /api/metrics JSON, request-id correlation
- dotenv with explicit repo-root path; envString/envNumber helpers (handles empty-string env)
- Runbooks + SOPs under knowledge/ in English; README, docs, and INDEX.md in English
2026-06-29 16:20:53 +02:00

37 lines
856 B
Markdown

---
title: Incident Response Framework
tags: [incident, response, framework, sev, runbook]
owner: sre
updated: 2026-06-20
---
# Incident Response Framework
## Severities
- **SEV1**: total outage. Page on-call. Mitigate first, post-mortem after.
- **SEV2**: significant degradation. Ticket + stakeholder communication.
- **SEV3**: minor impact. Normal ticket.
## Steps
1. **Detect**: automatic alert or report.
2. **Triage**: identify scope and severity.
3. **Mitigate**: apply runbook or workaround before the root-cause fix.
4. **Communicate**: status page and stakeholders every 30 min for SEV1.
5. **Resolve**: apply the root-cause fix.
6. **Post-mortem**: blameless, within 5 business days.
## Roles
- Incident Commander
- Communications Lead
- Subject Matter Expert
## Related webhooks
- service-restart
- dns-flush
- disk-cleanup
- log-tail