14 KiB
n8n Docker-log Alerting (ntfy + WhatsApp)
Context
The user runs a Coolify host at twala.rahamafresh.com with ~50 containers across ~15 logically distinct services (tracksolid telemetry pipeline, Coolify itself, n8n stacks, Supabase, Chatwoot, Evolution API, Dekart, Forgejo, Ente, Garage, etc.). They want n8n to read Docker logs directly, segment by service, apply per-service thresholds, and notify via ntfy and WhatsApp.
Dozzle is explicitly out of scope as an integration source — it stays as a human-facing log viewer. The integration design must not depend on it.
Why this matters: today, errors in any container are invisible until someone opens Dozzle. Critical issues (panics, OOMs, ingest failures on the tracksolid pipeline) can sit unnoticed for hours. The goal is per-service alerting with severity-aware routing, with thresholds tunable per service so that noisy services don't drown out quiet ones.
Decisions (locked with the user)
| Choice | Decision |
|---|---|
| n8n instance | n8n-o55elukmxacgp1s2xcwktyam (queue mode: main + worker + task-runners + Postgres + Redis) |
| Docker log access | New read-only log-proxy container — n8n never touches /var/run/docker.sock |
| Service grouping | Auto-derive from each container's COOLIFY_RESOURCE_UUID env var |
| Channels | Self-hosted ntfy (new Coolify service) + existing Evolution API (WhatsApp) |
| Git | Workspace /Users/kianiadee/Downloads/projects/03_dozzle_n8n is not a git repo yet — user creates separate repo later |
Architecture
n8n queue-mode (o55elukmxacgp1s2xcwktyam)
┌────────────────────────────────────────┐
Docker Engine log-proxy │ Workflow: Poll & Evaluate (per group) │
┌──────────────┐ (new svc) │ 1. GET /logs/<group>?since=<cursor> │
│ /var/run/ │ ◄─── RO socket ────► │ 2. regex → severity │
│ docker.sock │ HTTP API │ 3. threshold + cooldown via │
└──────────────┘ (internal net) ◄─┤ getWorkflowStaticData() │
│ 4. emit Alert event │
│ │
│ Workflow: Notify (single, parametric) │
│ severity=critical → ntfy + WhatsApp │
│ severity=error → ntfy │
│ severity=warn → ntfy (low prio) │
└────────────────────┬───────────────────┘
│
┌──────────────────────┴──────────────────────┐
▼ ▼
ntfy (self-hosted via Coolify) Evolution API (api-vc4ok...)
POST /<topic> POST /message/sendText/<instance>
Components
1. log-proxy (new container)
Purpose: the only thing with docker.sock access. Dumb pipe — no alerting logic.
Image: small Python/FastAPI or Node/Fastify app (~50 lines). Build from source in this repo.
Mount: /var/run/docker.sock read-only.
Network: joined to the n8n stack's Coolify network so n8n can reach it by hostname; no Traefik route (not publicly reachable).
API (no auth needed — internal only; optional bearer token for defence in depth):
GET /services—[{ "group": "bo3no...", "name": "tracksolid", "containers": [...] }, ...]- Groups containers by
COOLIFY_RESOURCE_UUIDenv var. - Filtered to the allow-list in
/config/groups.yml— UUIDs not listed are skipped entirely.
- Groups containers by
GET /logs/<group>?since=<unix_ts>&until=<unix_ts>&limit=2000- Calls Docker Engine API
GET /containers/<id>/logs?stdout=1&stderr=1&since=...&until=...×tamps=1for every container in the group. - Returns NDJSON or JSON array of
{ container, ts, stream, line }. sincedefaults to "now − 60s" if absent;untildefaults to "now".
- Calls Docker Engine API
GET /healthz
Why a proxy and not direct socket-into-n8n: any n8n editor user becomes root-on-host if n8n has the socket. Proxy keeps blast radius small and the API surface inspectable.
2. Self-hosted ntfy
Deploy via Coolify's one-click marketplace (or as a Docker Compose service).
- Suggested FQDN (matches your existing pattern):
ntfy.rahamafresh.com - Auth: enable
auth-default-access: deny-all; create per-topic users (one publisher user for n8n, plus client users for each subscriber). - Topics: one per service group, e.g.
tracksolid-alerts,coolify-alerts,evolution-api-alerts. Subscribe on phones via the ntfy mobile app.
3. n8n workflows (in n8n-o55elukmxacgp1s2xcwktyam)
A. Poll & Evaluate (one workflow per service group — easiest to tune independently)
Nodes:
- Schedule Trigger — every 30s (tunable per group).
- Static Data Read — pull
last_cursorfrom$getWorkflowStaticData('global').cursor. - HTTP Request —
GET http://log-proxy:8080/logs/<group>?since=<cursor>. - Function (Pattern Match) — for each line, run severity regexes (from workflow Variables) and emit
{ severity, pattern, container, ts, line, fingerprint }wherefingerprint = sha256(group:pattern:container)(used for cooldown). - Function (Threshold + Cooldown):
critical: emit immediately if not in cooldown.error: count rolling matches per fingerprint overwindowminutes; emit when threshold crossed.warn: same but larger window / threshold.- Cooldown:
staticData.cooldowns[fingerprint] = now + cooldown_minutes; skip while still hot.
- Static Data Write — update
cursor = max(ts seen)andcooldowns. - Execute Workflow — call the Notify workflow once per emitted Alert.
B. Notify (single parametric workflow; called by each Poll workflow)
Input: { group, severity, pattern, container, ts, line, fingerprint }
Nodes:
- Switch on
severity. - critical branch:
- HTTP Request → ntfy:
POST https://ntfy.rahamafresh.com/<group>-alertswith priority=5, tags=rotating_light. - HTTP Request → Evolution API:
POST https://<evolution-api-fqdn>/message/sendText/<instance>with{ number, text }. Credentials via n8n credentials store.
- HTTP Request → ntfy:
- error branch: ntfy only, priority=4.
- warn branch: ntfy only, priority=3.
- Append-row (Postgres node, optional) →
alerts_audittable for history.
4. Defaults (tunable per group via workflow Variables)
| Severity | Default patterns | Threshold | Cooldown | Routing |
|---|---|---|---|---|
| critical | panic, FATAL, OOMKilled, out of memory, segmentation fault |
immediate (1 match) | 30 min | ntfy + WhatsApp |
| error | \bERROR\b, Exception, Traceback, 5\d\d (HTTP 5xx) |
10 / 5 min | 15 min | ntfy |
| warn | \bWARN(ING)?\b, deadlock, timeout |
50 / 15 min | 30 min | ntfy (low prio) |
These live as a JSON object in each workflow's Variables, so per-group tuning is one edit.
5. Group naming
Friendly names mapped from Coolify resource UUID — sourced from groups.yml mounted into log-proxy. groups.yml is also the allow-list: only UUIDs listed here are monitored. Anything else the proxy sees on the host is ignored — non-mission-critical apps don't generate noise or burn polling cycles.
bo3nov2ija7g8wn9b1g2paxs: tracksolid
o55elukmxacgp1s2xcwktyam: n8n-prod
usoksgg8o40044g0cw08s8wc: n8n-simple
vc4ok84gw4s0kcgwwg8gooco: evolution-api
ks4sc8k4804swk0c0c4kk44c: chatwoot
foo048cw4skg8kswwsowwo0c: forgejo
u7rj0du43d33ncurig2t6ni1: dekart
e11bva63bu7swlq6zyfckxm3: rustfs
now8k08wcs044scwggos0wos: dozzle
# Coolify core, Supabase, shutterdiplomacy → handled as their own groups
#
# Explicitly NOT monitored (non-mission-critical, per user 2026-05-17):
# dy82njm7qgb5f2m573d1u3rh garage
# r77s24tgmfifmpfqe86xyqsp ente
# vw0wk0cg8gkwgwogsg4k0gsg excalidraw
Implication on the proxy: GET /services returns only allow-listed groups; GET /logs/<group> 404s for non-allow-listed UUIDs. To start monitoring a service later, add a single line to groups.yml and clone a Poll workflow.
Workspace layout
/Users/kianiadee/Downloads/projects/03_dozzle_n8n/ ← no git yet
├── log-proxy/
│ ├── Dockerfile
│ ├── app.py (FastAPI: /services, /logs/<group>, /healthz)
│ ├── requirements.txt
│ └── groups.yml (UUID → friendly-name map)
├── ntfy/
│ └── README.md (Coolify deploy notes + topic / user setup)
├── n8n/
│ └── workflows/
│ ├── poll-tracksolid.json
│ ├── poll-coolify.json
│ ├── poll-evolution.json
│ ├── poll-<group>.json ← one per group, derived from a template
│ └── notify.json ← parametric fan-out
├── coolify/
│ └── log-proxy.compose.yml (for Coolify "Docker Compose" service)
└── README.md (operating runbook: how to add a group, tune thresholds, rotate ntfy creds)
Implementation steps (ordered)
- Build log-proxy locally (
log-proxy/). Test against the remote docker socket viadocker contextor just deploy and iterate. - Deploy log-proxy via Coolify as a Docker Compose service. Attach to the same network as
n8n-o55.... No Traefik route. VerifyGET /servicesandGET /logs/<group>from inside the n8n container (docker exec n8n-o55... wget -qO- http://log-proxy:8080/services). - Deploy self-hosted ntfy via Coolify at
ntfy.rahamafresh.com. Configure deny-all default and one publisher user. Subscribe phones to test topic. - Build the parametric Notify workflow in n8n. Add credentials:
ntfy_publisher(HTTP basic),evolution_api(header auth). Test by manually firing each branch. - Build the Poll & Evaluate workflow for one group first (suggest
tracksolid— highest business value). Validate thresholds with a synthetic log line (docker exec ingest_events-bo3no... sh -c 'echo FATAL test'or similar). - Clone the Poll workflow per remaining group. Tune patterns / thresholds in Variables.
- Tune & quiet: run for 24h, capture false positives, adjust regex / thresholds.
- Document in
README.mdhow to add a new group when Coolify spins up a new service.
Critical files
log-proxy/app.py— the only thing with docker.sock access. Treat as security-sensitive; no write endpoints, no shell-out.log-proxy/groups.yml— single source of truth for UUID → friendly name. Keep in sync as Coolify services are added.n8n/workflows/notify.json— fan-out logic; any new channel (Slack, email) is added here, not in each poll workflow.n8n/workflows/poll-<group>.json— per-group thresholds. Variables block at the top is the only thing operators normally edit.coolify/log-proxy.compose.yml— controls log-proxy deployment + network attachment. Misconfiguring network = n8n can't reach proxy.
Reused / existing infrastructure
- n8n queue mode
n8n-o55elukmxacgp1s2xcwktyam— runs the workflows; its built-in Postgres + Redis cover persistence and queueing. No new DB needed. - Evolution API
api-vc4ok84gw4s0kcgwwg8gooco— already deployed; we only consume its REST API. - Coolify Sentinel
coolify-sentinel— left untouched; could later feed container-down events into the same Notify workflow if desired. - Coolify networks + Traefik — handle internal service discovery and TLS for ntfy.
- All Coolify-managed containers already carry
COOLIFY_RESOURCE_UUID— confirmed viadocker inspecton the Dozzle container in the previous session. This is what makes auto-grouping possible without a hand-written container list.
Open items to gather at implementation time
ntfy.rahamafresh.comDNS record (or chosen FQDN).- Evolution API: instance name, API key, target WhatsApp number(s).
- Confirmation of which Coolify network
n8n-o55...runs on (read fromdocker inspectat implementation start). - Optional: bearer token value for log-proxy if defence-in-depth is wanted.
Verification
- log-proxy unit checks: from inside n8n container,
curl http://log-proxy:8080/servicesreturns all groups;curl http://log-proxy:8080/logs/tracksolid?since=$(date -d '5 minutes ago' +%s)returns recent lines from all tracksolid containers. - End-to-end critical alert: run
docker run --rm alpine sh -c 'echo "FATAL synthetic test from $(date)"'inside a tracksolid container; within 30s, ntfy topictracksolid-alertsreceives a high-priority message AND WhatsApp number receives the same. - Threshold smoke test: emit 11 lines containing
ERRORto a single container over 30s; expect exactly one ntfy notification, not eleven. - Cooldown smoke test: trigger the same critical alert twice within the cooldown window; expect only one notification.
- Cursor durability: restart the n8n worker; confirm cursor in
getWorkflowStaticDatapersisted in Postgres and no logs were re-processed or skipped. - Per-group isolation: deliberately spam errors in one group; confirm other groups' workflows are unaffected (separate workflow = separate static data, separate schedule).
- Read-only safety: from inside n8n, attempt
POST http://log-proxy:8080/anything— expect 404/405. Confirmdocker.sockis not mounted inside n8n.