dozzle_n8n_logging/260517_docker_n8n_logging.md

209 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# n8n Docker-log Alerting (ntfy + WhatsApp)
## Context
The user runs a Coolify host at `twala.rahamafresh.com` with ~50 containers across ~15 logically distinct services (tracksolid telemetry pipeline, Coolify itself, n8n stacks, Supabase, Chatwoot, Evolution API, Dekart, Forgejo, Ente, Garage, etc.). They want **n8n to read Docker logs directly, segment by service, apply per-service thresholds, and notify via ntfy and WhatsApp**.
Dozzle is explicitly out of scope as an integration source — it stays as a human-facing log viewer. The integration design must not depend on it.
Why this matters: today, errors in any container are invisible until someone opens Dozzle. Critical issues (panics, OOMs, ingest failures on the tracksolid pipeline) can sit unnoticed for hours. The goal is per-service alerting with severity-aware routing, with thresholds tunable per service so that noisy services don't drown out quiet ones.
## Decisions (locked with the user)
| Choice | Decision |
| --- | --- |
| n8n instance | `n8n-o55elukmxacgp1s2xcwktyam` (queue mode: main + worker + task-runners + Postgres + Redis) |
| Docker log access | New **read-only log-proxy** container — n8n never touches `/var/run/docker.sock` |
| Service grouping | Auto-derive from each container's `COOLIFY_RESOURCE_UUID` env var |
| Channels | Self-hosted ntfy (new Coolify service) **+** existing Evolution API (WhatsApp) |
| Git | Workspace `/Users/kianiadee/Downloads/projects/03_dozzle_n8n` is **not** a git repo yet — user creates separate repo later |
## Architecture
```
n8n queue-mode (o55elukmxacgp1s2xcwktyam)
┌────────────────────────────────────────┐
Docker Engine log-proxy │ Workflow: Poll & Evaluate (per group) │
┌──────────────┐ (new svc) │ 1. GET /logs/<group>?since=<cursor> │
│ /var/run/ │ ◄─── RO socket ────► │ 2. regex → severity │
│ docker.sock │ HTTP API │ 3. threshold + cooldown via │
└──────────────┘ (internal net) ◄─┤ getWorkflowStaticData() │
│ 4. emit Alert event │
│ │
│ Workflow: Notify (single, parametric) │
│ severity=critical → ntfy + WhatsApp │
│ severity=error → ntfy │
│ severity=warn → ntfy (low prio) │
└────────────────────┬───────────────────┘
┌──────────────────────┴──────────────────────┐
▼ ▼
ntfy (self-hosted via Coolify) Evolution API (api-vc4ok...)
POST /<topic> POST /message/sendText/<instance>
```
## Components
### 1. log-proxy (new container)
**Purpose**: the only thing with `docker.sock` access. Dumb pipe — no alerting logic.
**Image**: small Python/FastAPI or Node/Fastify app (~50 lines). Build from source in this repo.
**Mount**: `/var/run/docker.sock` read-only.
**Network**: joined to the n8n stack's Coolify network so n8n can reach it by hostname; **no Traefik route** (not publicly reachable).
**API** (no auth needed — internal only; optional bearer token for defence in depth):
- `GET /services``[{ "group": "bo3no...", "name": "tracksolid", "containers": [...] }, ...]`
- Groups containers by `COOLIFY_RESOURCE_UUID` env var.
- Filtered to the allow-list in `/config/groups.yml` — UUIDs not listed are skipped entirely.
- `GET /logs/<group>?since=<unix_ts>&until=<unix_ts>&limit=2000`
- Calls Docker Engine API `GET /containers/<id>/logs?stdout=1&stderr=1&since=...&until=...&timestamps=1` for every container in the group.
- Returns NDJSON or JSON array of `{ container, ts, stream, line }`.
- `since` defaults to "now 60s" if absent; `until` defaults to "now".
- `GET /healthz`
**Why a proxy and not direct socket-into-n8n**: any n8n editor user becomes root-on-host if n8n has the socket. Proxy keeps blast radius small and the API surface inspectable.
### 2. Self-hosted ntfy
Deploy via Coolify's one-click marketplace (or as a Docker Compose service).
- Suggested FQDN (matches your existing pattern): `ntfy.rahamafresh.com`
- Auth: enable `auth-default-access: deny-all`; create per-topic users (one publisher user for n8n, plus client users for each subscriber).
- Topics: one per service group, e.g. `tracksolid-alerts`, `coolify-alerts`, `evolution-api-alerts`. Subscribe on phones via the ntfy mobile app.
### 3. n8n workflows (in `n8n-o55elukmxacgp1s2xcwktyam`)
**A. Poll & Evaluate** (one workflow per service group — easiest to tune independently)
Nodes:
1. **Schedule Trigger** — every 30s (tunable per group).
2. **Static Data Read** — pull `last_cursor` from `$getWorkflowStaticData('global').cursor`.
3. **HTTP Request**`GET http://log-proxy:8080/logs/<group>?since=<cursor>`.
4. **Function (Pattern Match)** — for each line, run severity regexes (from workflow Variables) and emit `{ severity, pattern, container, ts, line, fingerprint }` where `fingerprint = sha256(group:pattern:container)` (used for cooldown).
5. **Function (Threshold + Cooldown)**:
- `critical`: emit immediately if not in cooldown.
- `error`: count rolling matches per fingerprint over `window` minutes; emit when threshold crossed.
- `warn`: same but larger window / threshold.
- Cooldown: `staticData.cooldowns[fingerprint] = now + cooldown_minutes`; skip while still hot.
6. **Static Data Write** — update `cursor = max(ts seen)` and `cooldowns`.
7. **Execute Workflow** — call the **Notify** workflow once per emitted Alert.
**B. Notify** (single parametric workflow; called by each Poll workflow)
Input: `{ group, severity, pattern, container, ts, line, fingerprint }`
Nodes:
1. **Switch** on `severity`.
2. **critical** branch:
- **HTTP Request** → ntfy: `POST https://ntfy.rahamafresh.com/<group>-alerts` with priority=5, tags=`rotating_light`.
- **HTTP Request** → Evolution API: `POST https://<evolution-api-fqdn>/message/sendText/<instance>` with `{ number, text }`. Credentials via n8n credentials store.
3. **error** branch: ntfy only, priority=4.
4. **warn** branch: ntfy only, priority=3.
5. **Append-row** (Postgres node, optional) → `alerts_audit` table for history.
### 4. Defaults (tunable per group via workflow Variables)
| Severity | Default patterns | Threshold | Cooldown | Routing |
| --- | --- | --- | --- | --- |
| critical | `panic`, `FATAL`, `OOMKilled`, `out of memory`, `segmentation fault` | immediate (1 match) | 30 min | ntfy + WhatsApp |
| error | `\bERROR\b`, `Exception`, `Traceback`, `5\d\d ` (HTTP 5xx) | 10 / 5 min | 15 min | ntfy |
| warn | `\bWARN(ING)?\b`, `deadlock`, `timeout` | 50 / 15 min | 30 min | ntfy (low prio) |
These live as a JSON object in each workflow's Variables, so per-group tuning is one edit.
### 5. Group naming
Friendly names mapped from Coolify resource UUID — sourced from `groups.yml` mounted into log-proxy. **`groups.yml` is also the allow-list**: only UUIDs listed here are monitored. Anything else the proxy sees on the host is ignored — non-mission-critical apps don't generate noise or burn polling cycles.
```yaml
bo3nov2ija7g8wn9b1g2paxs: tracksolid
o55elukmxacgp1s2xcwktyam: n8n-prod
usoksgg8o40044g0cw08s8wc: n8n-simple
vc4ok84gw4s0kcgwwg8gooco: evolution-api
ks4sc8k4804swk0c0c4kk44c: chatwoot
foo048cw4skg8kswwsowwo0c: forgejo
u7rj0du43d33ncurig2t6ni1: dekart
e11bva63bu7swlq6zyfckxm3: rustfs
now8k08wcs044scwggos0wos: dozzle
# Coolify core, Supabase, shutterdiplomacy → handled as their own groups
#
# Explicitly NOT monitored (non-mission-critical, per user 2026-05-17):
# dy82njm7qgb5f2m573d1u3rh garage
# r77s24tgmfifmpfqe86xyqsp ente
# vw0wk0cg8gkwgwogsg4k0gsg excalidraw
```
Implication on the proxy: `GET /services` returns only allow-listed groups; `GET /logs/<group>` 404s for non-allow-listed UUIDs. To start monitoring a service later, add a single line to `groups.yml` and clone a Poll workflow.
## Workspace layout
```
/Users/kianiadee/Downloads/projects/03_dozzle_n8n/ ← no git yet
├── log-proxy/
│ ├── Dockerfile
│ ├── app.py (FastAPI: /services, /logs/<group>, /healthz)
│ ├── requirements.txt
│ └── groups.yml (UUID → friendly-name map)
├── ntfy/
│ └── README.md (Coolify deploy notes + topic / user setup)
├── n8n/
│ └── workflows/
│ ├── poll-tracksolid.json
│ ├── poll-coolify.json
│ ├── poll-evolution.json
│ ├── poll-<group>.json ← one per group, derived from a template
│ └── notify.json ← parametric fan-out
├── coolify/
│ └── log-proxy.compose.yml (for Coolify "Docker Compose" service)
└── README.md (operating runbook: how to add a group, tune thresholds, rotate ntfy creds)
```
## Implementation steps (ordered)
1. **Build log-proxy** locally (`log-proxy/`). Test against the remote docker socket via `docker context` or just deploy and iterate.
2. **Deploy log-proxy via Coolify** as a Docker Compose service. Attach to the same network as `n8n-o55...`. No Traefik route. Verify `GET /services` and `GET /logs/<group>` from inside the n8n container (`docker exec n8n-o55... wget -qO- http://log-proxy:8080/services`).
3. **Deploy self-hosted ntfy via Coolify** at `ntfy.rahamafresh.com`. Configure deny-all default and one publisher user. Subscribe phones to test topic.
4. **Build the parametric Notify workflow** in n8n. Add credentials: `ntfy_publisher` (HTTP basic), `evolution_api` (header auth). Test by manually firing each branch.
5. **Build the Poll & Evaluate workflow** for **one group first** (suggest `tracksolid` — highest business value). Validate thresholds with a synthetic log line (`docker exec ingest_events-bo3no... sh -c 'echo FATAL test'` or similar).
6. **Clone the Poll workflow per remaining group**. Tune patterns / thresholds in Variables.
7. **Tune & quiet**: run for 24h, capture false positives, adjust regex / thresholds.
8. **Document** in `README.md` how to add a new group when Coolify spins up a new service.
## Critical files
- `log-proxy/app.py` — the only thing with docker.sock access. Treat as security-sensitive; no write endpoints, no shell-out.
- `log-proxy/groups.yml` — single source of truth for UUID → friendly name. Keep in sync as Coolify services are added.
- `n8n/workflows/notify.json` — fan-out logic; any new channel (Slack, email) is added here, not in each poll workflow.
- `n8n/workflows/poll-<group>.json` — per-group thresholds. Variables block at the top is the only thing operators normally edit.
- `coolify/log-proxy.compose.yml` — controls log-proxy deployment + network attachment. Misconfiguring network = n8n can't reach proxy.
## Reused / existing infrastructure
- **n8n queue mode** `n8n-o55elukmxacgp1s2xcwktyam` — runs the workflows; its built-in Postgres + Redis cover persistence and queueing. No new DB needed.
- **Evolution API** `api-vc4ok84gw4s0kcgwwg8gooco` — already deployed; we only consume its REST API.
- **Coolify Sentinel** `coolify-sentinel` — left untouched; could later feed container-down events into the same Notify workflow if desired.
- **Coolify networks + Traefik** — handle internal service discovery and TLS for ntfy.
- **All Coolify-managed containers already carry `COOLIFY_RESOURCE_UUID`** — confirmed via `docker inspect` on the Dozzle container in the previous session. This is what makes auto-grouping possible without a hand-written container list.
## Open items to gather at implementation time
- `ntfy.rahamafresh.com` DNS record (or chosen FQDN).
- Evolution API: instance name, API key, target WhatsApp number(s).
- Confirmation of which Coolify network `n8n-o55...` runs on (read from `docker inspect` at implementation start).
- Optional: bearer token value for log-proxy if defence-in-depth is wanted.
## Verification
1. **log-proxy unit checks**: from inside n8n container, `curl http://log-proxy:8080/services` returns all groups; `curl http://log-proxy:8080/logs/tracksolid?since=$(date -d '5 minutes ago' +%s)` returns recent lines from all tracksolid containers.
2. **End-to-end critical alert**: run `docker run --rm alpine sh -c 'echo "FATAL synthetic test from $(date)"'` inside a tracksolid container; within 30s, ntfy topic `tracksolid-alerts` receives a high-priority message AND WhatsApp number receives the same.
3. **Threshold smoke test**: emit 11 lines containing `ERROR` to a single container over 30s; expect exactly one ntfy notification, not eleven.
4. **Cooldown smoke test**: trigger the same critical alert twice within the cooldown window; expect only one notification.
5. **Cursor durability**: restart the n8n worker; confirm cursor in `getWorkflowStaticData` persisted in Postgres and no logs were re-processed or skipped.
6. **Per-group isolation**: deliberately spam errors in one group; confirm other groups' workflows are unaffected (separate workflow = separate static data, separate schedule).
7. **Read-only safety**: from inside n8n, attempt `POST http://log-proxy:8080/anything` — expect 404/405. Confirm `docker.sock` is not mounted inside n8n.