Initial commit: SIC harness (backend, web, pi-adapter, configs, docs)

- pnpm monorepo: apps/api (Fastify + SQLite + SSE), apps/web (React+Vite), packages/shared, packages/pi-adapter - Local auth (admin/webhook-runner roles) + Keycloak JWT ready - Multi-session chat with reliable history (user persisted before LLM, assistant persisted after stream) - Markdown knowledge base with /api/docs/search + /api/docs/:id - YAML webhook catalog with backend-only execution, retry/backoff, audit (webhook_runs), and per-user rate limit - Skills config (sre-on-call, blameless-postmortem, security-incident) injected into LLM system prompt - LLM provider failover chain (config/models.yml fallback + LLM_FALLBACK_CHAIN override) - Context-aware webhooks panel + backend id-mention safety net - Per-message stats (time/duration/tokens/model), Markdown+GFM render, code & table copy/download buttons - Vitest suite, end-to-end smoke test (scripts/smoke.mjs), per-session system prompt override - /metrics Prometheus endpoint + /api/metrics JSON, request-id correlation - dotenv with explicit repo-root path; envString/envNumber helpers (handles empty-string env) - Runbooks + SOPs under knowledge/ in English; README, docs, and INDEX.md in English
2026-06-29 16:20:53 +02:00
commit 62728b2200
89 changed files with 11992 additions and 0 deletions
--- a/knowledge/runbooks/disk-cleanup.md
+++ b/knowledge/runbooks/disk-cleanup.md
@@ -0,0 +1,25 @@
+---
+title: Disk Cleanup Runbook
+tags: [disk, cleanup, storage, operations]
+owner: sre
+updated: 2026-06-12
+---
+
+# Disk Cleanup Runbook
+
+## When to use it
+
+- `disk usage > 85%` alert on /tmp or /var.
+- Job failures with `No space left on device`.
+- Before scheduled node maintenance.
+
+## Procedure
+
+1. List candidate files: `find /tmp -type f -mtime +7`.
+2. Confirm none are in use by an active process.
+3. Run the `disk-cleanup` webhook to remove /tmp files older than 7 days.
+4. Re-verify disk usage.
+
+## Related webhooks
+
+- disk-cleanup
--- a/knowledge/runbooks/dns-flush.md
+++ b/knowledge/runbooks/dns-flush.md
@@ -0,0 +1,24 @@
+---
+title: DNS Flush Runbook
+tags: [dns, network, cache, troubleshooting]
+owner: netops
+updated: 2026-06-10
+---
+
+# DNS Flush Runbook
+
+## Symptoms
+
+- DNS resolutions return stale IPs.
+- Users report that a site "works on some machines and not on others".
+- Recent DNS changes are not propagating.
+
+## Diagnosis
+
+1. Check the local cache with `ipconfig /displaydns` or `resolvectl statistics`.
+2. Confirm the upstream resolver is responding.
+3. Run the `dns-flush` webhook on the affected machine.
+
+## Related webhooks
+
+- dns-flush
--- a/knowledge/runbooks/incident-response-long.md
+++ b/knowledge/runbooks/incident-response-long.md
@@ -0,0 +1,280 @@
+---
+title: Production Incident Response Runbook (long-form)
+tags: [incident, production, sre, on-call, runbook, master]
+owner: sre
+updated: 2026-06-28
+---
+
+# Production Incident Response Runbook (long-form)
+
+> This runbook is designed to exercise the UI: it contains nested headings, lists, tables, code blocks, blockquotes, links, and enough volume to force scroll in the modal. Use it as a reference during drills and to validate the look of the documentation viewer.
+
+## Table of contents
+
+1. [Purpose and scope](#purpose-and-scope)
+2. [Severities and SLAs](#severities-and-slas)
+3. [Roles and responsibilities](#roles-and-responsibilities)
+4. [Response flow](#response-flow)
+5. [Initial diagnosis](#initial-diagnosis)
+6. [Common incident patterns](#common-incident-patterns)
+7. [Useful commands](#useful-commands)
+8. [Available webhooks](#available-webhooks)
+9. [Escalation](#escalation)
+10. [Post-mortem](#post-mortem)
+11. [Appendix: glossary](#appendix-glossary)
+
+## Purpose and scope
+
+This runbook defines the standard procedure for responding to production incidents that affect the availability, integrity, or performance of critical services. It applies to every engineering and operations team that maintains services in scope of SIC.
+
+### When to use this runbook
+
+- Partial or total service outages.
+- Severe performance degradation (p99 latency > agreed SLA).
+- Confirmed or suspected data loss or corruption.
+- Security alerts with production impact.
+
+### When NOT to use this runbook
+
+- Failures in dev or staging environments without user impact.
+- Change requests or scheduled maintenance.
+- HR or administrative process incidents.
+
+## Severities and SLAs
+
+| Severity | Definition | Ack SLA | Mitigation SLA | Communication |
+| --- | --- | --- | --- | --- |
+| **SEV-1** | Total outage or data loss | 5 minutes | 60 minutes | Every 15 min |
+| **SEV-2** | Severe degradation, affects > 30% of users | 10 minutes | 2 hours | Every 30 min |
+| **SEV-3** | Partial degradation, affects < 30% of users | 30 minutes | 8 hours | Every 2 hours |
+| **SEV-4** | Cosmetic, no functional impact | 1 business day | Next sprint | Async |
+
+> **Important**: severity can go up or down as the incident evolves. Document every change in the incident channel with a timestamp.
+
+## Roles and responsibilities
+
+- **Incident Commander (IC)**: coordinates the response, does not run technical tasks. The only person who can declare the incident resolved.
+- **Comms Lead**: handles communication to stakeholders, status page, and customers.
+- **Tech Lead**: leads the technical investigation, assigns tasks to the response team.
+- **Subject Matter Expert (SME)**: provides system-specific knowledge for the affected service.
+- **Scribe**: documents the incident timeline in real time.
+
+## Response flow
+
+1. **Detect**: alert, user report, or proactive monitoring.
+2. **Triage**: classify severity and assign an IC in under 5 minutes.
+3. **Convene**: open a bridge and the #inc-YYYYMMDD-XX channel.
+4. **Mitigate**: apply changes to restore service. The root cause can wait.
+5. **Resolve**: confirm the service is stable. Close the incident.
+6. **Post-mortem**: within 5 business days, blameless.
+
+### Flow diagram
+
+```mermaid
+graph TD
+  A[Detect] --> B{Triage}
+  B -->|SEV-1/2| C[Open bridge]
+  B -->|SEV-3/4| D[Assign owner]
+  C --> E[Investigate]
+  D --> E
+  E --> F{Mitigation?}
+  F -->|Yes| G[Apply fix]
+  F -->|No| H[Escalate]
+  G --> I[Monitor]
+  I --> J{Stable?}
+  J -->|Yes| K[Close]
+  J -->|No| E
+  H --> E
+  K --> L[Post-mortem]
+```
+
+## Initial diagnosis
+
+Before going deeper, run the following steps in order:
+
+1. Check the overall service health dashboard.
+2. Review the last hour of production changes (`deploy log`).
+3. Check active alerts in the monitoring system.
+4. Confirm the failure is not user-side (DNS, local network).
+
+### Triage checklist
+
+- [ ] Affected service identified
+- [ ] Severity assigned
+- [ ] IC identified
+- [ ] Bridge open
+- [ ] Communication channel created
+- [ ] Status page updated
+- [ ] Comms lead assigned
+
+## Common incident patterns
+
+### Pattern A: latency spike
+
+**Symptoms**: p99 latency rises from 200 ms to > 2 s without proportional traffic increase.
+
+**Typical causes**:
+- DB connection pool saturation.
+- Massive cache miss (accidental invalidation).
+- Long JVM garbage collection.
+
+**Immediate actions**:
+1. Check DB metrics (connections, locks, slow queries).
+2. Validate cache hit rate.
+3. If no cause is identified in 5 min, escalate to the service SME.
+
+### Pattern B: cascading 5xx errors
+
+**Symptoms**: sudden increase of HTTP 500/502/503 on one or more endpoints.
+
+**Typical causes**:
+- Upstream service down.
+- Invalid configuration deployed.
+- External resource (third-party API) unavailable.
+
+**Immediate actions**:
+1. Identify the failing upstream service.
+2. Review the last deploy touching that path.
+3. If the deploy is to blame, consider a rollback.
+
+### Pattern C: data loss
+
+**Symptoms**: customers report missing or inconsistent data.
+
+**Typical causes**:
+- Cleanup job that deleted more than intended.
+- Schema migration executed with a bug.
+- Bug in business logic.
+
+**Immediate actions**:
+1. **Stop** any job that could make things worse.
+2. Evaluate whether a recent and viable backup can be restored.
+3. Escalate immediately to the engineering lead.
+
+## Useful commands
+
+### Check connectivity
+
+```bash
+# DNS
+dig +short example.com
+
+# Basic HTTP
+curl -sSI https://api.example.com/health
+
+# TCP to a specific port
+nc -zv db.internal 5432
+```
+
+### Inspect logs live
+
+```bash
+# Last 100 lines and follow
+kubectl logs -n prod deploy/api --tail=100 -f
+
+# Logs from the last 5 minutes
+kubectl logs -n prod deploy/api --since=5m
+
+# Logs of a specific pod
+kubectl logs -n prod api-7d4f8b9c-x2k9n --tail=200
+```
+
+### Quick metrics
+
+```bash
+# CPU per pod
+kubectl top pods -n prod
+
+# Memory per pod
+kubectl top pods -n prod --containers
+
+# Disk usage of a node
+ssh node-01 df -h
+```
+
+## Available webhooks
+
+| Webhook | When to use it | Requires confirmation |
+| --- | --- | --- |
+| `vpn-diagnostic` | VPN access issues | Yes |
+| `service-restart` | Hung or zombie service | Yes |
+| `dns-flush` | Broken DNS resolution | No |
+| `disk-cleanup` | Disk > 90% | Yes |
+| `log-tail` | Need logs in real time | No |
+| `cache-purge` | Stale or corrupt cache | Yes |
+
+> Remember: webhook execution always requires explicit confirmation from the user who triggers it. The LLM can only recommend them; it must never execute them directly.
+
+## Escalation
+
+If the incident is not mitigated within the agreed SLA:
+
+1. Notify the area's on-call manager.
+2. If it exceeds 2 hours, notify the engineering director.
+3. If customers are impacted, involve Customer Success.
+4. If there is monetary or data loss, notify Legal and the C-level.
+
+### Emergency contacts
+
+```text
+SRE on-call:    +54 11 5555-0001
+Platform lead:  +54 11 5555-0002
+Security IR:    +54 11 5555-0003
+CTO:            +54 11 5555-0004
+```
+
+## Post-mortem
+
+Within 5 business days after closing the incident:
+
+1. Schedule a meeting with everyone involved.
+2. Share the post-mortem document 24 h in advance.
+3. During the meeting: review the timeline, identify the root cause.
+4. Document an action plan with owners and dates.
+5. Share learnings with the rest of the organization.
+
+### Post-mortem template
+
+```markdown
+# Post-mortem: <title>
+
+## Summary
+<2-3 sentences about what happened and what the impact was>
+
+## Timeline
+- HH:MM - <event>
+- HH:MM - <event>
+
+## Root cause
+<technical description of the cause>
+
+## What went well
+- <item>
+- <item>
+
+## What went wrong
+- <item>
+- <item>
+
+## Corrective actions
+- [ ] <action> - owner: <person> - due: <date>
+- [ ] <action> - owner: <person> - due: <date>
+
+## Lessons learned
+<actionable insights for the team and the organization>
+```
+
+## Appendix: glossary
+
+- **IC**: Incident Commander.
+- **SME**: Subject Matter Expert.
+- **SLA**: Service Level Agreement.
+- **p99**: 99th percentile of latency.
+- **Blameless**: culture where the post-mortem looks for systemic causes, not blame.
+- **Rollback**: reverting a change to the previous version.
+- **Mitigation**: action to reduce impact, not necessarily the root cause.
+- **Resolution**: confirmation that the system is stable.
+
+---
+
+> If you find outdated or missing information in this runbook, edit the file and notify the SRE team. The source of truth is always the repository, not PDFs attached in Confluence.
--- a/knowledge/runbooks/incident-response.md
+++ b/knowledge/runbooks/incident-response.md
@@ -0,0 +1,36 @@
+---
+title: Incident Response Framework
+tags: [incident, response, framework, sev, runbook]
+owner: sre
+updated: 2026-06-20
+---
+
+# Incident Response Framework
+
+## Severities
+
+- **SEV1**: total outage. Page on-call. Mitigate first, post-mortem after.
+- **SEV2**: significant degradation. Ticket + stakeholder communication.
+- **SEV3**: minor impact. Normal ticket.
+
+## Steps
+
+1. **Detect**: automatic alert or report.
+2. **Triage**: identify scope and severity.
+3. **Mitigate**: apply runbook or workaround before the root-cause fix.
+4. **Communicate**: status page and stakeholders every 30 min for SEV1.
+5. **Resolve**: apply the root-cause fix.
+6. **Post-mortem**: blameless, within 5 business days.
+
+## Roles
+
+- Incident Commander
+- Communications Lead
+- Subject Matter Expert
+
+## Related webhooks
+
+- service-restart
+- dns-flush
+- disk-cleanup
+- log-tail
--- a/knowledge/runbooks/service-restart.md
+++ b/knowledge/runbooks/service-restart.md
@@ -0,0 +1,32 @@
+---
+title: Service Restart Runbook
+tags: [service, restart, systemd, operations]
+owner: sre
+updated: 2026-06-15
+---
+
+# Service Restart Runbook
+
+## When to use it
+
+- The service is down or not responding to health checks.
+- Sustained performance drop that cannot be explained by load.
+- After a deploy that left the service in an inconsistent state.
+
+## Diagnosis
+
+1. Confirm the current state: `systemctl status <service>` or equivalent.
+2. Review the last 200 lines of the log.
+3. Check dependencies (DB, Redis, network).
+4. If there is no clear cause, escalate via the `service-restart` webhook.
+
+## Equivalent command
+
+```bash
+systemctl restart <service>
+```
+
+## Related webhooks
+
+- service-restart
+- log-tail
--- a/knowledge/runbooks/vpn.md
+++ b/knowledge/runbooks/vpn.md
@@ -0,0 +1,22 @@
+---
+title: VPN Runbook
+tags: [vpn, network, access]
+owner: sre
+updated: 2026-06-01
+---
+
+# VPN Runbook
+
+## Symptoms
+
+Users cannot connect to the VPN or lose access intermittently.
+
+## Diagnosis
+
+- Check the VPN service status.
+- Review gateway logs.
+- Confirm user-side connectivity.
+
+## Related webhooks
+
+- vpn-diagnostic
--- a/knowledge/sops/log-tail.md
+++ b/knowledge/sops/log-tail.md
@@ -0,0 +1,23 @@
+---
+title: Log Reading SOP
+tags: [logs, sops, troubleshooting, observability]
+owner: sre
+updated: 2026-06-05
+---
+
+# Log Reading SOP
+
+## Goal
+
+Retrieve the last N lines of a service log in under 30 seconds.
+
+## Procedure
+
+1. Identify the service and the log path.
+2. Call the `log-tail` webhook with `service` and `lines`.
+3. Look for error patterns (ERROR, CRITICAL, stack traces).
+4. If there is a matching runbook, follow it.
+
+## Related webhooks
+
+- log-tail