- pnpm monorepo: apps/api (Fastify + SQLite + SSE), apps/web (React+Vite), packages/shared, packages/pi-adapter - Local auth (admin/webhook-runner roles) + Keycloak JWT ready - Multi-session chat with reliable history (user persisted before LLM, assistant persisted after stream) - Markdown knowledge base with /api/docs/search + /api/docs/:id - YAML webhook catalog with backend-only execution, retry/backoff, audit (webhook_runs), and per-user rate limit - Skills config (sre-on-call, blameless-postmortem, security-incident) injected into LLM system prompt - LLM provider failover chain (config/models.yml fallback + LLM_FALLBACK_CHAIN override) - Context-aware webhooks panel + backend id-mention safety net - Per-message stats (time/duration/tokens/model), Markdown+GFM render, code & table copy/download buttons - Vitest suite, end-to-end smoke test (scripts/smoke.mjs), per-session system prompt override - /metrics Prometheus endpoint + /api/metrics JSON, request-id correlation - dotenv with explicit repo-root path; envString/envNumber helpers (handles empty-string env) - Runbooks + SOPs under knowledge/ in English; README, docs, and INDEX.md in English
856 B
856 B
title, tags, owner, updated
| title | tags | owner | updated | |||||
|---|---|---|---|---|---|---|---|---|
| Incident Response Framework |
|
sre | 2026-06-20 |
Incident Response Framework
Severities
- SEV1: total outage. Page on-call. Mitigate first, post-mortem after.
- SEV2: significant degradation. Ticket + stakeholder communication.
- SEV3: minor impact. Normal ticket.
Steps
- Detect: automatic alert or report.
- Triage: identify scope and severity.
- Mitigate: apply runbook or workaround before the root-cause fix.
- Communicate: status page and stakeholders every 30 min for SEV1.
- Resolve: apply the root-cause fix.
- Post-mortem: blameless, within 5 business days.
Roles
- Incident Commander
- Communications Lead
- Subject Matter Expert
Related webhooks
- service-restart
- dns-flush
- disk-cleanup
- log-tail