infra(db-roles): dedicated non-superuser roles for the six apps on postgres

Six service connections run as the postgres SUPERUSER across two databases on the
shared 100-connection server — the root of the "too many connections" peaks and a
standing least-privilege risk. Superuser sessions ignore per-role CONNECTION LIMIT
and can consume the superuser-reserved slots.

Drafts (apply as postgres; nothing applied here):
- scripts/app_roles_tracksolid_db.sql — webhook_app, ingest_app, worker_app,
  dashboard_app. Capability groups (ts_app_read / ts_app_write), per-app NOSUPERUSER
  login roles with hard CONNECTION LIMIT + bounded GUCs (statement_timeout,
  idle_session_timeout, idle_in_transaction, lock_timeout).
- scripts/app_roles_fleet_platform.sql — gateway_app, cron_app (the apps on the
  separate fleet_platform DB), fp_app_rw group over its schemas.
- scripts/MIGRATE_APPS_OFF_SUPERUSER.md — runbook: discovery (what each app actually
  writes / whether it runs DDL), connection-budget table (sum ≈ 81 < 100), the
  object-ownership step for migration-running apps (reassign app schemas to the
  existing tracksolid_owner — scoped, never REASSIGN OWNED globally), one-at-a-time
  cutover, and instant rollback (DATABASE_URL only).

Grants are best-effort by app function and explicitly call out where to verify before
cutover; all objects are postgres-owned, so row DML works but DDL needs the ownership
step. See the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
kiania 2026-06-19 23:51:52 +03:00
parent 0355047fdd
commit e1472adc3a
3 changed files with 334 additions and 0 deletions

View file

@ -0,0 +1,144 @@
# Migrating the stack apps off the `postgres` superuser
## Why
The Postgres server (`timescale_db`) has `max_connections = 100`. Six service
connections run as the **`postgres` superuser**, each with a persistent pool that
sits idle for hours. That's the root of the intermittent `FATAL: sorry, too many
clients already`:
- superuser sessions can use the **`superuser_reserved_connections`** slots, so the
server can fill completely with no admin headroom;
- you can't put a per-role **`CONNECTION LIMIT`** or enforce timeouts on them
effectively;
- and it's a standing least-privilege risk (any of these apps can read/write/DROP
anything in any database).
Giving each app a dedicated **NOSUPERUSER** role with a hard `CONNECTION LIMIT` fixes
all three.
## The six connections (confirmed live)
| Service | Database | Current user | New role | Conn limit |
|---|---|---|---|---|
| `webhook_receiver` | tracksolid_db | postgres | `webhook_app` | 10 |
| `ingest_worker` | tracksolid_db | postgres | `ingest_app` | 10 |
| `worker` | tracksolid_db | postgres | `worker_app` (read) | 5 |
| `dashboard_api` (prod backend) | tracksolid_db | postgres | `dashboard_app` (or reuse `dashboard_ro`) | 8 |
| `gateway` | **fleet_platform** | postgres | `gateway_app` | 15 |
| `cron` | **fleet_platform** | postgres | `cron_app` | 5 |
> Note `gateway`/`cron` use a **different database** (`fleet_platform`) on the same
> server — they still count against the shared 100-slot ceiling.
### Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)
```
webhook_app 10 + ingest_app 10 + worker_app 5 + dashboard_app 8 = 33 (tracksolid_db)
gateway_app 15 + cron_app 5 = 20 (fleet_platform)
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28 (existing)
TOTAL ≈ 81 ✅
```
Tune the `CONNECTION LIMIT`s in the SQL to your real pool sizes; the point is the sum
is now **bounded and visible**, not open-ended superuser pools.
## Step 1 — Discover what each app actually needs (do NOT skip)
The drafted grants are best-effort (ingestion = write telemetry; gateway/cron = RW
app state; worker/dashboard = read). Confirm before cutover:
```sql
-- (a) Which tables does each app WRITE? Reset stats, run the app for a bit, re-check:
SELECT schemaname, relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
ORDER BY 1,2;
-- (b) Does the app run DDL/migrations at deploy? Check its code/entrypoint for
-- CREATE/ALTER/DROP or a migrations runner (e.g. run_migrations.py, alembic).
-- If yes → it needs object OWNERSHIP, see Step 3.
```
Or temporarily set `log_statement = 'ddl'` (or `'mod'`) and watch one deploy cycle.
## Step 2 — Create the roles (no app impact yet)
Generate a password per role (host-only, 0600), then apply the SQL as postgres:
```bash
for r in webhook_app ingest_app worker_app dashboard_app gateway_app cron_app; do
[ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
done
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
-v webhook_pw="$(cat ~/.webhook_app.pw)" -v ingest_pw="$(cat ~/.ingest_app.pw)" \
-v worker_pw="$(cat ~/.worker_app.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
< scripts/app_roles_tracksolid_db.sql
docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
-v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
< scripts/app_roles_fleet_platform.sql
```
## Step 3 — (Only if an app runs migrations) give its role object ownership
All objects are owned by `postgres`, so a non-superuser role can write **rows** but
not `ALTER`/`DROP` existing tables. If discovery showed an app issues DDL, reassign
the **app schemas** to the existing non-superuser owner role and add the app role to
it. **Scope this to the app schemas — never `REASSIGN OWNED BY postgres` globally**
(that would also try to move TimescaleDB/system objects).
```sql
-- tracksolid_db: make tracksolid_owner own the app objects, then add the ingestors.
DO $$
DECLARE r record;
BEGIN
FOR r IN
SELECT n.nspname, c.relname,
CASE c.relkind WHEN 'v' THEN 'VIEW' WHEN 'm' THEN 'MATERIALIZED VIEW' ELSE 'TABLE' END AS kind
FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace
WHERE n.nspname IN ('tracksolid','reporting') AND c.relkind IN ('r','p','v','m')
LOOP
EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', r.kind, r.nspname, r.relname);
END LOOP;
END $$;
GRANT CREATE ON SCHEMA tracksolid, reporting TO tracksolid_owner;
GRANT tracksolid_owner TO webhook_app, ingest_app; -- they inherit ownership rights
```
(Do the analogous reassignment in `fleet_platform` to a `fleet_platform_owner` role
if `gateway`/`cron` run migrations. Keep `reporting.v_trips` owned by
`reporting_refresher` if that role refreshes it.)
Test one deploy/migration as the new role **before** cutting over all apps.
## Step 4 — Cut over one app at a time
For each service, change its `DATABASE_URL` user/password from `postgres:…` to the new
role (same host/port/dbname), redeploy **just that one**, and watch its logs for
`permission denied` (→ widen the group grant) and the DB for connection count:
```bash
# in the app's env (Coolify secret or compose):
# tracksolid_db: postgresql://webhook_app:<pw>@timescale_db:5432/tracksolid_db
# fleet_platform: postgresql://gateway_app:<pw>@timescale_db:5432/fleet_platform
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
"SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
```
Order: start with the **lowest-risk reader** (`worker`/`dashboard_api`), then the
ingestors, then `gateway`/`cron`.
## Rollback (instant)
Each app's only change is its `DATABASE_URL`. If anything misbehaves, set it back to
the `postgres:…` DSN and redeploy that one app — no DB change required. The roles are
additive; to remove one entirely: `DROP ROLE <app>;` (after nothing uses it).
## After all six are migrated
- Add `idle_session_timeout` is already covered by the per-role GUCs above.
- Consider **rotating the `postgres` superuser password** and restricting it to admin
use only (it should no longer appear in any app's env).
- Re-check the budget: `SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1;`
— no app should exceed its `CONNECTION LIMIT`, and the total should sit comfortably
under 100. This is also when PgBouncer (separate PR) becomes optional rather than
necessary.

View file

@ -0,0 +1,71 @@
-- app_roles_fleet_platform.sql — dedicated NON-SUPERUSER login roles for the apps
-- that connect to the fleet_platform database as the `postgres` SUPERUSER.
-- ─────────────────────────────────────────────────────────────────────────────
-- Sibling of app_roles_tracksolid_db.sql, for the OTHER database on the same server.
-- gateway + cron (the fleet_platform Coolify app) connect here as postgres. Same
-- rationale: least privilege + a hard per-role CONNECTION LIMIT so they can't
-- exhaust the server-wide 100-connection ceiling.
--
-- Schemas in fleet_platform: auth, domain, events, geo, ops, serve, slo, state
-- (all owned by postgres). gateway (the API) and cron (scheduled jobs) almost
-- certainly READ+WRITE app state across these, so they get DML; widen/narrow per
-- the discovery step in MIGRATE_APPS_OFF_SUPERUSER.md. As with the sibling file,
-- this does NOT change object ownership, so it does not grant DDL on existing
-- (postgres-owned) objects — see step 3 of the runbook if these apps run migrations.
--
-- Run as the postgres SUPERUSER, on the fleet_platform database:
-- docker exec -i <timescale_db> psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
-- -v gateway_pw="$(cat ~/.gateway_app.pw)" \
-- -v cron_pw="$(cat ~/.cron_app.pw)" \
-- < scripts/app_roles_fleet_platform.sql
\set ON_ERROR_STOP on
-- ── 1. Capability group (read + write across the app schemas) ───────────────────
DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='fp_app_rw') THEN CREATE ROLE fp_app_rw NOLOGIN; END IF;
END $$;
DO $grants$
DECLARE s text;
BEGIN
FOREACH s IN ARRAY ARRAY['auth','domain','events','geo','ops','serve','slo','state'] LOOP
EXECUTE format('GRANT USAGE ON SCHEMA %I TO fp_app_rw', s);
EXECUTE format('GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA %I TO fp_app_rw', s);
EXECUTE format('GRANT USAGE, SELECT, UPDATE ON ALL SEQUENCES IN SCHEMA %I TO fp_app_rw', s);
EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO fp_app_rw', s);
EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO fp_app_rw', s);
EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT USAGE, SELECT, UPDATE ON SEQUENCES TO fp_app_rw', s);
EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT EXECUTE ON FUNCTIONS TO fp_app_rw', s);
END LOOP;
END $grants$;
-- ── 2. Per-app LOGIN roles ──────────────────────────────────────────────────────
-- gateway — the request-facing API (latency-sensitive: short statement_timeout).
DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='gateway_app') THEN
CREATE ROLE gateway_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
END IF; END $$;
ALTER ROLE gateway_app WITH LOGIN PASSWORD :'gateway_pw' CONNECTION LIMIT 15;
GRANT CONNECT ON DATABASE fleet_platform TO gateway_app;
GRANT fp_app_rw TO gateway_app;
ALTER ROLE gateway_app SET statement_timeout = '15s';
ALTER ROLE gateway_app SET idle_in_transaction_session_timeout = '30s';
ALTER ROLE gateway_app SET idle_session_timeout = '5min';
ALTER ROLE gateway_app SET lock_timeout = '3s';
-- cron — scheduled/background jobs (longer queries tolerated).
DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='cron_app') THEN
CREATE ROLE cron_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
END IF; END $$;
ALTER ROLE cron_app WITH LOGIN PASSWORD :'cron_pw' CONNECTION LIMIT 5;
GRANT CONNECT ON DATABASE fleet_platform TO cron_app;
GRANT fp_app_rw TO cron_app;
ALTER ROLE cron_app SET statement_timeout = '120s';
ALTER ROLE cron_app SET idle_in_transaction_session_timeout = '120s';
ALTER ROLE cron_app SET idle_session_timeout = '10min';
ALTER ROLE cron_app SET lock_timeout = '5s';
-- ── 3. Verify ───────────────────────────────────────────────────────────────────
-- \du+

View file

@ -0,0 +1,119 @@
-- app_roles_tracksolid_db.sql — dedicated NON-SUPERUSER login roles for the apps
-- that currently connect to tracksolid_db as the `postgres` SUPERUSER.
-- ─────────────────────────────────────────────────────────────────────────────
-- WHY: six stack services connect to this Postgres server as the postgres superuser
-- (webhook_receiver, ingest_worker, worker, the prod dashboard_api backend on
-- tracksolid_db; gateway + cron on fleet_platform — see the sibling file). That is
-- both a least-privilege problem AND the root of the "too many connections" error:
-- superuser sessions ignore per-role connection caps and can exhaust the 100-slot
-- ceiling (incl. the superuser-reserved slots). Dedicated roles let us pin a hard
-- CONNECTION LIMIT and timeouts per app.
--
-- WHAT THIS DOES (run as the postgres SUPERUSER, on tracksolid_db):
-- * creates capability GROUP roles (NOLOGIN) for read vs. read-write,
-- * creates one LOGIN role per app, NOSUPERUSER, with a CONNECTION LIMIT and
-- bounded GUCs, as a member of the group it needs,
-- * grants the groups SELECT / DML on the operational schemas.
--
-- WHAT IT DOES *NOT* DO: change object ownership. All objects here are owned by
-- `postgres`, so a non-superuser role can write ROWS but cannot ALTER/DROP existing
-- tables (i.e. run migrations). If an app runs DDL at deploy, see step 3 in
-- MIGRATE_APPS_OFF_SUPERUSER.md (reassign the app schemas to `tracksolid_owner` and
-- add the app role to it). Roles here INHERIT, so membership grants apply directly.
--
-- Idempotent. Passwords are supplied as psql vars (never stored in the repo):
-- docker exec -i <timescale_db> psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
-- -v webhook_pw="$(cat ~/.webhook_app.pw)" \
-- -v ingest_pw="$(cat ~/.ingest_app.pw)" \
-- -v worker_pw="$(cat ~/.worker_app.pw)" \
-- -v dash_pw="$(cat ~/.dashboard_app.pw)" \
-- < scripts/app_roles_tracksolid_db.sql
\set ON_ERROR_STOP on
-- ── 1. Capability groups (NOLOGIN; apps inherit privileges via membership) ──────
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='ts_app_read') THEN CREATE ROLE ts_app_read NOLOGIN; END IF;
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='ts_app_write') THEN CREATE ROLE ts_app_write NOLOGIN; END IF;
END $$;
-- Read surface: telemetry + curated reporting layer.
GRANT USAGE ON SCHEMA tracksolid, reporting TO ts_app_read;
GRANT SELECT ON ALL TABLES IN SCHEMA tracksolid, reporting TO ts_app_read;
GRANT SELECT ON reporting.v_trips TO ts_app_read; -- matview (not in ALL TABLES)
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA reporting TO ts_app_read;
ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA tracksolid, reporting GRANT SELECT ON TABLES TO ts_app_read;
-- Write surface for ingestion: row DML on telemetry (NOT DDL — see header).
GRANT ts_app_read TO ts_app_write; -- write implies read
GRANT INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA tracksolid TO ts_app_write;
GRANT USAGE, SELECT, UPDATE ON ALL SEQUENCES IN SCHEMA tracksolid TO ts_app_write;
ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA tracksolid
GRANT INSERT, UPDATE, DELETE ON TABLES TO ts_app_write;
ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA tracksolid
GRANT USAGE, SELECT, UPDATE ON SEQUENCES TO ts_app_write;
-- ── 2. Per-app LOGIN roles ──────────────────────────────────────────────────────
-- CONNECTION LIMIT is the hard budget cap (sum across all roles must stay < 100).
-- GUCs are belt-and-braces and tunable per app.
-- webhook_receiver — ingests Tracksolid webhooks (writes telemetry; may run migrations).
DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='webhook_app') THEN
CREATE ROLE webhook_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
END IF; END $$;
ALTER ROLE webhook_app WITH LOGIN PASSWORD :'webhook_pw' CONNECTION LIMIT 10;
GRANT CONNECT ON DATABASE tracksolid_db TO webhook_app;
GRANT ts_app_write TO webhook_app;
ALTER ROLE webhook_app SET statement_timeout = '120s'; -- bulk inserts
ALTER ROLE webhook_app SET idle_in_transaction_session_timeout = '120s';
ALTER ROLE webhook_app SET idle_session_timeout = '10min';
ALTER ROLE webhook_app SET lock_timeout = '5s';
-- ingest_worker — background ingestion/normalisation (writes telemetry).
DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='ingest_app') THEN
CREATE ROLE ingest_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
END IF; END $$;
ALTER ROLE ingest_app WITH LOGIN PASSWORD :'ingest_pw' CONNECTION LIMIT 10;
GRANT CONNECT ON DATABASE tracksolid_db TO ingest_app;
GRANT ts_app_write TO ingest_app;
ALTER ROLE ingest_app SET statement_timeout = '120s';
ALTER ROLE ingest_app SET idle_in_transaction_session_timeout = '120s';
ALTER ROLE ingest_app SET idle_session_timeout = '10min';
ALTER ROLE ingest_app SET lock_timeout = '5s';
-- If ingestion REFRESHes reporting.v_trips, add it to the existing refresher role:
-- GRANT reporting_refresher TO ingest_app; -- (uncomment after confirming)
-- worker — fleet_platform worker that also reads tracksolid_db. Assumed READ-ONLY
-- here; widen to ts_app_write only if it actually writes telemetry.
DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='worker_app') THEN
CREATE ROLE worker_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
END IF; END $$;
ALTER ROLE worker_app WITH LOGIN PASSWORD :'worker_pw' CONNECTION LIMIT 5;
GRANT CONNECT ON DATABASE tracksolid_db TO worker_app;
GRANT ts_app_read TO worker_app;
ALTER ROLE worker_app SET statement_timeout = '60s';
ALTER ROLE worker_app SET idle_in_transaction_session_timeout = '60s';
ALTER ROLE worker_app SET idle_session_timeout = '10min';
ALTER ROLE worker_app SET lock_timeout = '5s';
-- dashboard_api (PROD backend, currently postgres). If it only reads, prefer the
-- existing dashboard_ro. This role is for a backend that ALSO writes app state;
-- start read-only and widen per discovery.
DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='dashboard_app') THEN
CREATE ROLE dashboard_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
END IF; END $$;
ALTER ROLE dashboard_app WITH LOGIN PASSWORD :'dash_pw' CONNECTION LIMIT 8;
GRANT CONNECT ON DATABASE tracksolid_db TO dashboard_app;
GRANT ts_app_read TO dashboard_app;
ALTER ROLE dashboard_app SET statement_timeout = '30s';
ALTER ROLE dashboard_app SET idle_in_transaction_session_timeout = '60s';
ALTER ROLE dashboard_app SET idle_session_timeout = '5min';
ALTER ROLE dashboard_app SET lock_timeout = '5s';
-- ── 3. Verify ───────────────────────────────────────────────────────────────────
-- \du+ -- inspect roles, CONNECTION LIMIT, and memberships