kianiadee 1bc6e57374 Initial: plan, log-proxy app, README, gitignore

2026-05-17 23:29:55 +03:00

14 KiB

Raw Blame History

n8n Docker-log Alerting (ntfy + WhatsApp)

Context

The user runs a Coolify host at twala.rahamafresh.com with ~50 containers across ~15 logically distinct services (tracksolid telemetry pipeline, Coolify itself, n8n stacks, Supabase, Chatwoot, Evolution API, Dekart, Forgejo, Ente, Garage, etc.). They want n8n to read Docker logs directly, segment by service, apply per-service thresholds, and notify via ntfy and WhatsApp.

Dozzle is explicitly out of scope as an integration source — it stays as a human-facing log viewer. The integration design must not depend on it.

Why this matters: today, errors in any container are invisible until someone opens Dozzle. Critical issues (panics, OOMs, ingest failures on the tracksolid pipeline) can sit unnoticed for hours. The goal is per-service alerting with severity-aware routing, with thresholds tunable per service so that noisy services don't drown out quiet ones.

Decisions (locked with the user)

Choice	Decision
n8n instance	`n8n-o55elukmxacgp1s2xcwktyam` (queue mode: main + worker + task-runners + Postgres + Redis)
Docker log access	New read-only log-proxy container — n8n never touches `/var/run/docker.sock`
Service grouping	Auto-derive from each container's `COOLIFY_RESOURCE_UUID` env var
Channels	Self-hosted ntfy (new Coolify service) + existing Evolution API (WhatsApp)
Git	Workspace `/Users/kianiadee/Downloads/projects/03_dozzle_n8n` is not a git repo yet — user creates separate repo later

Architecture

                                            n8n queue-mode (o55elukmxacgp1s2xcwktyam)
                                            ┌────────────────────────────────────────┐
   Docker Engine          log-proxy         │  Workflow: Poll & Evaluate (per group) │
   ┌──────────────┐       (new svc)         │    1. GET /logs/<group>?since=<cursor> │
   │ /var/run/    │  ◄─── RO socket ────►   │    2. regex → severity                 │
   │  docker.sock │       HTTP API          │    3. threshold + cooldown via         │
   └──────────────┘       (internal net)  ◄─┤       getWorkflowStaticData()          │
                                            │    4. emit Alert event                 │
                                            │                                        │
                                            │  Workflow: Notify (single, parametric) │
                                            │    severity=critical → ntfy + WhatsApp │
                                            │    severity=error    → ntfy            │
                                            │    severity=warn     → ntfy (low prio) │
                                            └────────────────────┬───────────────────┘
                                                                 │
                                          ┌──────────────────────┴──────────────────────┐
                                          ▼                                             ▼
                                ntfy (self-hosted via Coolify)             Evolution API (api-vc4ok...)
                                POST /<topic>                              POST /message/sendText/<instance>

Components

1. log-proxy (new container)

Purpose: the only thing with docker.sock access. Dumb pipe — no alerting logic.

Image: small Python/FastAPI or Node/Fastify app (~50 lines). Build from source in this repo.

Mount: /var/run/docker.sock read-only.

Network: joined to the n8n stack's Coolify network so n8n can reach it by hostname; no Traefik route (not publicly reachable).

API (no auth needed — internal only; optional bearer token for defence in depth):

GET /services — [{ "group": "bo3no...", "name": "tracksolid", "containers": [...] }, ...]
- Groups containers by COOLIFY_RESOURCE_UUID env var.
- Filtered to the allow-list in /config/groups.yml — UUIDs not listed are skipped entirely.
GET /logs/<group>?since=<unix_ts>&until=<unix_ts>&limit=2000
- Calls Docker Engine API GET /containers/<id>/logs?stdout=1&stderr=1&since=...&until=...&timestamps=1 for every container in the group.
- Returns NDJSON or JSON array of { container, ts, stream, line }.
- since defaults to "now − 60s" if absent; until defaults to "now".
GET /healthz

Why a proxy and not direct socket-into-n8n: any n8n editor user becomes root-on-host if n8n has the socket. Proxy keeps blast radius small and the API surface inspectable.

2. Self-hosted ntfy

Deploy via Coolify's one-click marketplace (or as a Docker Compose service).

Suggested FQDN (matches your existing pattern): ntfy.rahamafresh.com
Auth: enable auth-default-access: deny-all; create per-topic users (one publisher user for n8n, plus client users for each subscriber).
Topics: one per service group, e.g. tracksolid-alerts, coolify-alerts, evolution-api-alerts. Subscribe on phones via the ntfy mobile app.

3. n8n workflows (in `n8n-o55elukmxacgp1s2xcwktyam`)

A. Poll & Evaluate (one workflow per service group — easiest to tune independently)

Nodes:

Schedule Trigger — every 30s (tunable per group).
Static Data Read — pull last_cursor from $getWorkflowStaticData('global').cursor.
HTTP Request — GET http://log-proxy:8080/logs/<group>?since=<cursor>.
Function (Pattern Match) — for each line, run severity regexes (from workflow Variables) and emit { severity, pattern, container, ts, line, fingerprint } where fingerprint = sha256(group:pattern:container) (used for cooldown).
Function (Threshold + Cooldown):
- critical: emit immediately if not in cooldown.
- error: count rolling matches per fingerprint over window minutes; emit when threshold crossed.
- warn: same but larger window / threshold.
- Cooldown: staticData.cooldowns[fingerprint] = now + cooldown_minutes; skip while still hot.
Static Data Write — update cursor = max(ts seen) and cooldowns.
Execute Workflow — call the Notify workflow once per emitted Alert.

B. Notify (single parametric workflow; called by each Poll workflow)

Input: { group, severity, pattern, container, ts, line, fingerprint }

Nodes:

Switch on severity.
critical branch:
- HTTP Request → ntfy: POST https://ntfy.rahamafresh.com/<group>-alerts with priority=5, tags=rotating_light.
- HTTP Request → Evolution API: POST https://<evolution-api-fqdn>/message/sendText/<instance> with { number, text }. Credentials via n8n credentials store.
error branch: ntfy only, priority=4.
warn branch: ntfy only, priority=3.
Append-row (Postgres node, optional) → alerts_audit table for history.

4. Defaults (tunable per group via workflow Variables)

Severity	Default patterns	Threshold	Cooldown	Routing
critical	`panic`, `FATAL`, `OOMKilled`, `out of memory`, `segmentation fault`	immediate (1 match)	30 min	ntfy + WhatsApp
error	`\bERROR\b`, `Exception`, `Traceback`, `5\d\d` (HTTP 5xx)	10 / 5 min	15 min	ntfy
warn	`\bWARN(ING)?\b`, `deadlock`, `timeout`	50 / 15 min	30 min	ntfy (low prio)

These live as a JSON object in each workflow's Variables, so per-group tuning is one edit.

5. Group naming

Friendly names mapped from Coolify resource UUID — sourced from groups.yml mounted into log-proxy. groups.yml is also the allow-list: only UUIDs listed here are monitored. Anything else the proxy sees on the host is ignored — non-mission-critical apps don't generate noise or burn polling cycles.

bo3nov2ija7g8wn9b1g2paxs: tracksolid
o55elukmxacgp1s2xcwktyam: n8n-prod
usoksgg8o40044g0cw08s8wc: n8n-simple
vc4ok84gw4s0kcgwwg8gooco: evolution-api
ks4sc8k4804swk0c0c4kk44c: chatwoot
foo048cw4skg8kswwsowwo0c: forgejo
u7rj0du43d33ncurig2t6ni1: dekart
e11bva63bu7swlq6zyfckxm3: rustfs
now8k08wcs044scwggos0wos: dozzle
# Coolify core, Supabase, shutterdiplomacy → handled as their own groups
#
# Explicitly NOT monitored (non-mission-critical, per user 2026-05-17):
#   dy82njm7qgb5f2m573d1u3rh  garage
#   r77s24tgmfifmpfqe86xyqsp  ente
#   vw0wk0cg8gkwgwogsg4k0gsg  excalidraw

Implication on the proxy: GET /services returns only allow-listed groups; GET /logs/<group> 404s for non-allow-listed UUIDs. To start monitoring a service later, add a single line to groups.yml and clone a Poll workflow.

Workspace layout

/Users/kianiadee/Downloads/projects/03_dozzle_n8n/   ← no git yet
├── log-proxy/
│   ├── Dockerfile
│   ├── app.py            (FastAPI: /services, /logs/<group>, /healthz)
│   ├── requirements.txt
│   └── groups.yml        (UUID → friendly-name map)
├── ntfy/
│   └── README.md         (Coolify deploy notes + topic / user setup)
├── n8n/
│   └── workflows/
│       ├── poll-tracksolid.json
│       ├── poll-coolify.json
│       ├── poll-evolution.json
│       ├── poll-<group>.json   ← one per group, derived from a template
│       └── notify.json         ← parametric fan-out
├── coolify/
│   └── log-proxy.compose.yml   (for Coolify "Docker Compose" service)
└── README.md             (operating runbook: how to add a group, tune thresholds, rotate ntfy creds)

Implementation steps (ordered)

Build log-proxy locally (log-proxy/). Test against the remote docker socket via docker context or just deploy and iterate.
Deploy log-proxy via Coolify as a Docker Compose service. Attach to the same network as n8n-o55.... No Traefik route. Verify GET /services and GET /logs/<group> from inside the n8n container (docker exec n8n-o55... wget -qO- http://log-proxy:8080/services).
Deploy self-hosted ntfy via Coolify at ntfy.rahamafresh.com. Configure deny-all default and one publisher user. Subscribe phones to test topic.
Build the parametric Notify workflow in n8n. Add credentials: ntfy_publisher (HTTP basic), evolution_api (header auth). Test by manually firing each branch.
Build the Poll & Evaluate workflow for one group first (suggest tracksolid — highest business value). Validate thresholds with a synthetic log line (docker exec ingest_events-bo3no... sh -c 'echo FATAL test' or similar).
Clone the Poll workflow per remaining group. Tune patterns / thresholds in Variables.
Tune & quiet: run for 24h, capture false positives, adjust regex / thresholds.
Document in README.md how to add a new group when Coolify spins up a new service.

Critical files

log-proxy/app.py — the only thing with docker.sock access. Treat as security-sensitive; no write endpoints, no shell-out.
log-proxy/groups.yml — single source of truth for UUID → friendly name. Keep in sync as Coolify services are added.
n8n/workflows/notify.json — fan-out logic; any new channel (Slack, email) is added here, not in each poll workflow.
n8n/workflows/poll-<group>.json — per-group thresholds. Variables block at the top is the only thing operators normally edit.
coolify/log-proxy.compose.yml — controls log-proxy deployment + network attachment. Misconfiguring network = n8n can't reach proxy.

Reused / existing infrastructure

n8n queue mode n8n-o55elukmxacgp1s2xcwktyam — runs the workflows; its built-in Postgres + Redis cover persistence and queueing. No new DB needed.
Evolution API api-vc4ok84gw4s0kcgwwg8gooco — already deployed; we only consume its REST API.
Coolify Sentinel coolify-sentinel — left untouched; could later feed container-down events into the same Notify workflow if desired.
Coolify networks + Traefik — handle internal service discovery and TLS for ntfy.
All Coolify-managed containers already carry COOLIFY_RESOURCE_UUID — confirmed via docker inspect on the Dozzle container in the previous session. This is what makes auto-grouping possible without a hand-written container list.

Open items to gather at implementation time

ntfy.rahamafresh.com DNS record (or chosen FQDN).
Evolution API: instance name, API key, target WhatsApp number(s).
Confirmation of which Coolify network n8n-o55... runs on (read from docker inspect at implementation start).
Optional: bearer token value for log-proxy if defence-in-depth is wanted.

Verification

log-proxy unit checks: from inside n8n container, curl http://log-proxy:8080/services returns all groups; curl http://log-proxy:8080/logs/tracksolid?since=$(date -d '5 minutes ago' +%s) returns recent lines from all tracksolid containers.
End-to-end critical alert: run docker run --rm alpine sh -c 'echo "FATAL synthetic test from $(date)"' inside a tracksolid container; within 30s, ntfy topic tracksolid-alerts receives a high-priority message AND WhatsApp number receives the same.
Threshold smoke test: emit 11 lines containing ERROR to a single container over 30s; expect exactly one ntfy notification, not eleven.
Cooldown smoke test: trigger the same critical alert twice within the cooldown window; expect only one notification.
Cursor durability: restart the n8n worker; confirm cursor in getWorkflowStaticData persisted in Postgres and no logs were re-processed or skipped.
Per-group isolation: deliberately spam errors in one group; confirm other groups' workflows are unaffected (separate workflow = separate static data, separate schedule).
Read-only safety: from inside n8n, attempt POST http://log-proxy:8080/anything — expect 404/405. Confirm docker.sock is not mounted inside n8n.

14 KiB Raw Blame History Unescape Escape