Files
sic/knowledge/runbooks/incident-response.md
rikrdo 62728b2200 Initial commit: SIC harness (backend, web, pi-adapter, configs, docs)
- pnpm monorepo: apps/api (Fastify + SQLite + SSE), apps/web (React+Vite), packages/shared, packages/pi-adapter
- Local auth (admin/webhook-runner roles) + Keycloak JWT ready
- Multi-session chat with reliable history (user persisted before LLM, assistant persisted after stream)
- Markdown knowledge base with /api/docs/search + /api/docs/:id
- YAML webhook catalog with backend-only execution, retry/backoff, audit (webhook_runs), and per-user rate limit
- Skills config (sre-on-call, blameless-postmortem, security-incident) injected into LLM system prompt
- LLM provider failover chain (config/models.yml fallback + LLM_FALLBACK_CHAIN override)
- Context-aware webhooks panel + backend id-mention safety net
- Per-message stats (time/duration/tokens/model), Markdown+GFM render, code & table copy/download buttons
- Vitest suite, end-to-end smoke test (scripts/smoke.mjs), per-session system prompt override
- /metrics Prometheus endpoint + /api/metrics JSON, request-id correlation
- dotenv with explicit repo-root path; envString/envNumber helpers (handles empty-string env)
- Runbooks + SOPs under knowledge/ in English; README, docs, and INDEX.md in English
2026-06-29 16:20:53 +02:00

856 B

title, tags, owner, updated
title tags owner updated
Incident Response Framework
incident
response
framework
sev
runbook
sre 2026-06-20

Incident Response Framework

Severities

  • SEV1: total outage. Page on-call. Mitigate first, post-mortem after.
  • SEV2: significant degradation. Ticket + stakeholder communication.
  • SEV3: minor impact. Normal ticket.

Steps

  1. Detect: automatic alert or report.
  2. Triage: identify scope and severity.
  3. Mitigate: apply runbook or workaround before the root-cause fix.
  4. Communicate: status page and stakeholders every 30 min for SEV1.
  5. Resolve: apply the root-cause fix.
  6. Post-mortem: blameless, within 5 business days.

Roles

  • Incident Commander
  • Communications Lead
  • Subject Matter Expert
  • service-restart
  • dns-flush
  • disk-cleanup
  • log-tail