210 lines
14 KiB
Markdown
210 lines
14 KiB
Markdown
|
|
# n8n Docker-log Alerting (ntfy + WhatsApp)
|
|||
|
|
|
|||
|
|
## Context
|
|||
|
|
|
|||
|
|
The user runs a Coolify host at `twala.rahamafresh.com` with ~50 containers across ~15 logically distinct services (tracksolid telemetry pipeline, Coolify itself, n8n stacks, Supabase, Chatwoot, Evolution API, Dekart, Forgejo, Ente, Garage, etc.). They want **n8n to read Docker logs directly, segment by service, apply per-service thresholds, and notify via ntfy and WhatsApp**.
|
|||
|
|
|
|||
|
|
Dozzle is explicitly out of scope as an integration source — it stays as a human-facing log viewer. The integration design must not depend on it.
|
|||
|
|
|
|||
|
|
Why this matters: today, errors in any container are invisible until someone opens Dozzle. Critical issues (panics, OOMs, ingest failures on the tracksolid pipeline) can sit unnoticed for hours. The goal is per-service alerting with severity-aware routing, with thresholds tunable per service so that noisy services don't drown out quiet ones.
|
|||
|
|
|
|||
|
|
## Decisions (locked with the user)
|
|||
|
|
|
|||
|
|
| Choice | Decision |
|
|||
|
|
| --- | --- |
|
|||
|
|
| n8n instance | `n8n-o55elukmxacgp1s2xcwktyam` (queue mode: main + worker + task-runners + Postgres + Redis) |
|
|||
|
|
| Docker log access | New **read-only log-proxy** container — n8n never touches `/var/run/docker.sock` |
|
|||
|
|
| Service grouping | Auto-derive from each container's `COOLIFY_RESOURCE_UUID` env var |
|
|||
|
|
| Channels | Self-hosted ntfy (new Coolify service) **+** existing Evolution API (WhatsApp) |
|
|||
|
|
| Git | Workspace `/Users/kianiadee/Downloads/projects/03_dozzle_n8n` is **not** a git repo yet — user creates separate repo later |
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
n8n queue-mode (o55elukmxacgp1s2xcwktyam)
|
|||
|
|
┌────────────────────────────────────────┐
|
|||
|
|
Docker Engine log-proxy │ Workflow: Poll & Evaluate (per group) │
|
|||
|
|
┌──────────────┐ (new svc) │ 1. GET /logs/<group>?since=<cursor> │
|
|||
|
|
│ /var/run/ │ ◄─── RO socket ────► │ 2. regex → severity │
|
|||
|
|
│ docker.sock │ HTTP API │ 3. threshold + cooldown via │
|
|||
|
|
└──────────────┘ (internal net) ◄─┤ getWorkflowStaticData() │
|
|||
|
|
│ 4. emit Alert event │
|
|||
|
|
│ │
|
|||
|
|
│ Workflow: Notify (single, parametric) │
|
|||
|
|
│ severity=critical → ntfy + WhatsApp │
|
|||
|
|
│ severity=error → ntfy │
|
|||
|
|
│ severity=warn → ntfy (low prio) │
|
|||
|
|
└────────────────────┬───────────────────┘
|
|||
|
|
│
|
|||
|
|
┌──────────────────────┴──────────────────────┐
|
|||
|
|
▼ ▼
|
|||
|
|
ntfy (self-hosted via Coolify) Evolution API (api-vc4ok...)
|
|||
|
|
POST /<topic> POST /message/sendText/<instance>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Components
|
|||
|
|
|
|||
|
|
### 1. log-proxy (new container)
|
|||
|
|
|
|||
|
|
**Purpose**: the only thing with `docker.sock` access. Dumb pipe — no alerting logic.
|
|||
|
|
|
|||
|
|
**Image**: small Python/FastAPI or Node/Fastify app (~50 lines). Build from source in this repo.
|
|||
|
|
|
|||
|
|
**Mount**: `/var/run/docker.sock` read-only.
|
|||
|
|
|
|||
|
|
**Network**: joined to the n8n stack's Coolify network so n8n can reach it by hostname; **no Traefik route** (not publicly reachable).
|
|||
|
|
|
|||
|
|
**API** (no auth needed — internal only; optional bearer token for defence in depth):
|
|||
|
|
|
|||
|
|
- `GET /services` — `[{ "group": "bo3no...", "name": "tracksolid", "containers": [...] }, ...]`
|
|||
|
|
- Groups containers by `COOLIFY_RESOURCE_UUID` env var.
|
|||
|
|
- Filtered to the allow-list in `/config/groups.yml` — UUIDs not listed are skipped entirely.
|
|||
|
|
- `GET /logs/<group>?since=<unix_ts>&until=<unix_ts>&limit=2000`
|
|||
|
|
- Calls Docker Engine API `GET /containers/<id>/logs?stdout=1&stderr=1&since=...&until=...×tamps=1` for every container in the group.
|
|||
|
|
- Returns NDJSON or JSON array of `{ container, ts, stream, line }`.
|
|||
|
|
- `since` defaults to "now − 60s" if absent; `until` defaults to "now".
|
|||
|
|
- `GET /healthz`
|
|||
|
|
|
|||
|
|
**Why a proxy and not direct socket-into-n8n**: any n8n editor user becomes root-on-host if n8n has the socket. Proxy keeps blast radius small and the API surface inspectable.
|
|||
|
|
|
|||
|
|
### 2. Self-hosted ntfy
|
|||
|
|
|
|||
|
|
Deploy via Coolify's one-click marketplace (or as a Docker Compose service).
|
|||
|
|
|
|||
|
|
- Suggested FQDN (matches your existing pattern): `ntfy.rahamafresh.com`
|
|||
|
|
- Auth: enable `auth-default-access: deny-all`; create per-topic users (one publisher user for n8n, plus client users for each subscriber).
|
|||
|
|
- Topics: one per service group, e.g. `tracksolid-alerts`, `coolify-alerts`, `evolution-api-alerts`. Subscribe on phones via the ntfy mobile app.
|
|||
|
|
|
|||
|
|
### 3. n8n workflows (in `n8n-o55elukmxacgp1s2xcwktyam`)
|
|||
|
|
|
|||
|
|
**A. Poll & Evaluate** (one workflow per service group — easiest to tune independently)
|
|||
|
|
|
|||
|
|
Nodes:
|
|||
|
|
|
|||
|
|
1. **Schedule Trigger** — every 30s (tunable per group).
|
|||
|
|
2. **Static Data Read** — pull `last_cursor` from `$getWorkflowStaticData('global').cursor`.
|
|||
|
|
3. **HTTP Request** — `GET http://log-proxy:8080/logs/<group>?since=<cursor>`.
|
|||
|
|
4. **Function (Pattern Match)** — for each line, run severity regexes (from workflow Variables) and emit `{ severity, pattern, container, ts, line, fingerprint }` where `fingerprint = sha256(group:pattern:container)` (used for cooldown).
|
|||
|
|
5. **Function (Threshold + Cooldown)**:
|
|||
|
|
- `critical`: emit immediately if not in cooldown.
|
|||
|
|
- `error`: count rolling matches per fingerprint over `window` minutes; emit when threshold crossed.
|
|||
|
|
- `warn`: same but larger window / threshold.
|
|||
|
|
- Cooldown: `staticData.cooldowns[fingerprint] = now + cooldown_minutes`; skip while still hot.
|
|||
|
|
6. **Static Data Write** — update `cursor = max(ts seen)` and `cooldowns`.
|
|||
|
|
7. **Execute Workflow** — call the **Notify** workflow once per emitted Alert.
|
|||
|
|
|
|||
|
|
**B. Notify** (single parametric workflow; called by each Poll workflow)
|
|||
|
|
|
|||
|
|
Input: `{ group, severity, pattern, container, ts, line, fingerprint }`
|
|||
|
|
|
|||
|
|
Nodes:
|
|||
|
|
|
|||
|
|
1. **Switch** on `severity`.
|
|||
|
|
2. **critical** branch:
|
|||
|
|
- **HTTP Request** → ntfy: `POST https://ntfy.rahamafresh.com/<group>-alerts` with priority=5, tags=`rotating_light`.
|
|||
|
|
- **HTTP Request** → Evolution API: `POST https://<evolution-api-fqdn>/message/sendText/<instance>` with `{ number, text }`. Credentials via n8n credentials store.
|
|||
|
|
3. **error** branch: ntfy only, priority=4.
|
|||
|
|
4. **warn** branch: ntfy only, priority=3.
|
|||
|
|
5. **Append-row** (Postgres node, optional) → `alerts_audit` table for history.
|
|||
|
|
|
|||
|
|
### 4. Defaults (tunable per group via workflow Variables)
|
|||
|
|
|
|||
|
|
| Severity | Default patterns | Threshold | Cooldown | Routing |
|
|||
|
|
| --- | --- | --- | --- | --- |
|
|||
|
|
| critical | `panic`, `FATAL`, `OOMKilled`, `out of memory`, `segmentation fault` | immediate (1 match) | 30 min | ntfy + WhatsApp |
|
|||
|
|
| error | `\bERROR\b`, `Exception`, `Traceback`, `5\d\d ` (HTTP 5xx) | 10 / 5 min | 15 min | ntfy |
|
|||
|
|
| warn | `\bWARN(ING)?\b`, `deadlock`, `timeout` | 50 / 15 min | 30 min | ntfy (low prio) |
|
|||
|
|
|
|||
|
|
These live as a JSON object in each workflow's Variables, so per-group tuning is one edit.
|
|||
|
|
|
|||
|
|
### 5. Group naming
|
|||
|
|
|
|||
|
|
Friendly names mapped from Coolify resource UUID — sourced from `groups.yml` mounted into log-proxy. **`groups.yml` is also the allow-list**: only UUIDs listed here are monitored. Anything else the proxy sees on the host is ignored — non-mission-critical apps don't generate noise or burn polling cycles.
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
bo3nov2ija7g8wn9b1g2paxs: tracksolid
|
|||
|
|
o55elukmxacgp1s2xcwktyam: n8n-prod
|
|||
|
|
usoksgg8o40044g0cw08s8wc: n8n-simple
|
|||
|
|
vc4ok84gw4s0kcgwwg8gooco: evolution-api
|
|||
|
|
ks4sc8k4804swk0c0c4kk44c: chatwoot
|
|||
|
|
foo048cw4skg8kswwsowwo0c: forgejo
|
|||
|
|
u7rj0du43d33ncurig2t6ni1: dekart
|
|||
|
|
e11bva63bu7swlq6zyfckxm3: rustfs
|
|||
|
|
now8k08wcs044scwggos0wos: dozzle
|
|||
|
|
# Coolify core, Supabase, shutterdiplomacy → handled as their own groups
|
|||
|
|
#
|
|||
|
|
# Explicitly NOT monitored (non-mission-critical, per user 2026-05-17):
|
|||
|
|
# dy82njm7qgb5f2m573d1u3rh garage
|
|||
|
|
# r77s24tgmfifmpfqe86xyqsp ente
|
|||
|
|
# vw0wk0cg8gkwgwogsg4k0gsg excalidraw
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Implication on the proxy: `GET /services` returns only allow-listed groups; `GET /logs/<group>` 404s for non-allow-listed UUIDs. To start monitoring a service later, add a single line to `groups.yml` and clone a Poll workflow.
|
|||
|
|
|
|||
|
|
## Workspace layout
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
/Users/kianiadee/Downloads/projects/03_dozzle_n8n/ ← no git yet
|
|||
|
|
├── log-proxy/
|
|||
|
|
│ ├── Dockerfile
|
|||
|
|
│ ├── app.py (FastAPI: /services, /logs/<group>, /healthz)
|
|||
|
|
│ ├── requirements.txt
|
|||
|
|
│ └── groups.yml (UUID → friendly-name map)
|
|||
|
|
├── ntfy/
|
|||
|
|
│ └── README.md (Coolify deploy notes + topic / user setup)
|
|||
|
|
├── n8n/
|
|||
|
|
│ └── workflows/
|
|||
|
|
│ ├── poll-tracksolid.json
|
|||
|
|
│ ├── poll-coolify.json
|
|||
|
|
│ ├── poll-evolution.json
|
|||
|
|
│ ├── poll-<group>.json ← one per group, derived from a template
|
|||
|
|
│ └── notify.json ← parametric fan-out
|
|||
|
|
├── coolify/
|
|||
|
|
│ └── log-proxy.compose.yml (for Coolify "Docker Compose" service)
|
|||
|
|
└── README.md (operating runbook: how to add a group, tune thresholds, rotate ntfy creds)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Implementation steps (ordered)
|
|||
|
|
|
|||
|
|
1. **Build log-proxy** locally (`log-proxy/`). Test against the remote docker socket via `docker context` or just deploy and iterate.
|
|||
|
|
2. **Deploy log-proxy via Coolify** as a Docker Compose service. Attach to the same network as `n8n-o55...`. No Traefik route. Verify `GET /services` and `GET /logs/<group>` from inside the n8n container (`docker exec n8n-o55... wget -qO- http://log-proxy:8080/services`).
|
|||
|
|
3. **Deploy self-hosted ntfy via Coolify** at `ntfy.rahamafresh.com`. Configure deny-all default and one publisher user. Subscribe phones to test topic.
|
|||
|
|
4. **Build the parametric Notify workflow** in n8n. Add credentials: `ntfy_publisher` (HTTP basic), `evolution_api` (header auth). Test by manually firing each branch.
|
|||
|
|
5. **Build the Poll & Evaluate workflow** for **one group first** (suggest `tracksolid` — highest business value). Validate thresholds with a synthetic log line (`docker exec ingest_events-bo3no... sh -c 'echo FATAL test'` or similar).
|
|||
|
|
6. **Clone the Poll workflow per remaining group**. Tune patterns / thresholds in Variables.
|
|||
|
|
7. **Tune & quiet**: run for 24h, capture false positives, adjust regex / thresholds.
|
|||
|
|
8. **Document** in `README.md` how to add a new group when Coolify spins up a new service.
|
|||
|
|
|
|||
|
|
## Critical files
|
|||
|
|
|
|||
|
|
- `log-proxy/app.py` — the only thing with docker.sock access. Treat as security-sensitive; no write endpoints, no shell-out.
|
|||
|
|
- `log-proxy/groups.yml` — single source of truth for UUID → friendly name. Keep in sync as Coolify services are added.
|
|||
|
|
- `n8n/workflows/notify.json` — fan-out logic; any new channel (Slack, email) is added here, not in each poll workflow.
|
|||
|
|
- `n8n/workflows/poll-<group>.json` — per-group thresholds. Variables block at the top is the only thing operators normally edit.
|
|||
|
|
- `coolify/log-proxy.compose.yml` — controls log-proxy deployment + network attachment. Misconfiguring network = n8n can't reach proxy.
|
|||
|
|
|
|||
|
|
## Reused / existing infrastructure
|
|||
|
|
|
|||
|
|
- **n8n queue mode** `n8n-o55elukmxacgp1s2xcwktyam` — runs the workflows; its built-in Postgres + Redis cover persistence and queueing. No new DB needed.
|
|||
|
|
- **Evolution API** `api-vc4ok84gw4s0kcgwwg8gooco` — already deployed; we only consume its REST API.
|
|||
|
|
- **Coolify Sentinel** `coolify-sentinel` — left untouched; could later feed container-down events into the same Notify workflow if desired.
|
|||
|
|
- **Coolify networks + Traefik** — handle internal service discovery and TLS for ntfy.
|
|||
|
|
- **All Coolify-managed containers already carry `COOLIFY_RESOURCE_UUID`** — confirmed via `docker inspect` on the Dozzle container in the previous session. This is what makes auto-grouping possible without a hand-written container list.
|
|||
|
|
|
|||
|
|
## Open items to gather at implementation time
|
|||
|
|
|
|||
|
|
- `ntfy.rahamafresh.com` DNS record (or chosen FQDN).
|
|||
|
|
- Evolution API: instance name, API key, target WhatsApp number(s).
|
|||
|
|
- Confirmation of which Coolify network `n8n-o55...` runs on (read from `docker inspect` at implementation start).
|
|||
|
|
- Optional: bearer token value for log-proxy if defence-in-depth is wanted.
|
|||
|
|
|
|||
|
|
## Verification
|
|||
|
|
|
|||
|
|
1. **log-proxy unit checks**: from inside n8n container, `curl http://log-proxy:8080/services` returns all groups; `curl http://log-proxy:8080/logs/tracksolid?since=$(date -d '5 minutes ago' +%s)` returns recent lines from all tracksolid containers.
|
|||
|
|
2. **End-to-end critical alert**: run `docker run --rm alpine sh -c 'echo "FATAL synthetic test from $(date)"'` inside a tracksolid container; within 30s, ntfy topic `tracksolid-alerts` receives a high-priority message AND WhatsApp number receives the same.
|
|||
|
|
3. **Threshold smoke test**: emit 11 lines containing `ERROR` to a single container over 30s; expect exactly one ntfy notification, not eleven.
|
|||
|
|
4. **Cooldown smoke test**: trigger the same critical alert twice within the cooldown window; expect only one notification.
|
|||
|
|
5. **Cursor durability**: restart the n8n worker; confirm cursor in `getWorkflowStaticData` persisted in Postgres and no logs were re-processed or skipped.
|
|||
|
|
6. **Per-group isolation**: deliberately spam errors in one group; confirm other groups' workflows are unaffected (separate workflow = separate static data, separate schedule).
|
|||
|
|
7. **Read-only safety**: from inside n8n, attempt `POST http://log-proxy:8080/anything` — expect 404/405. Confirm `docker.sock` is not mounted inside n8n.
|