kiania e1472adc3a infra(db-roles): dedicated non-superuser roles for the six apps on postgres

Six service connections run as the postgres SUPERUSER across two databases on the
shared 100-connection server — the root of the "too many connections" peaks and a
standing least-privilege risk. Superuser sessions ignore per-role CONNECTION LIMIT
and can consume the superuser-reserved slots.

Drafts (apply as postgres; nothing applied here):
- scripts/app_roles_tracksolid_db.sql — webhook_app, ingest_app, worker_app,
  dashboard_app. Capability groups (ts_app_read / ts_app_write), per-app NOSUPERUSER
  login roles with hard CONNECTION LIMIT + bounded GUCs (statement_timeout,
  idle_session_timeout, idle_in_transaction, lock_timeout).
- scripts/app_roles_fleet_platform.sql — gateway_app, cron_app (the apps on the
  separate fleet_platform DB), fp_app_rw group over its schemas.
- scripts/MIGRATE_APPS_OFF_SUPERUSER.md — runbook: discovery (what each app actually
  writes / whether it runs DDL), connection-budget table (sum ≈ 81 < 100), the
  object-ownership step for migration-running apps (reassign app schemas to the
  existing tracksolid_owner — scoped, never REASSIGN OWNED globally), one-at-a-time
  cutover, and instant rollback (DATABASE_URL only).

Grants are best-effort by app function and explicitly call out where to verify before
cutover; all objects are postgres-owned, so row DML works but DDL needs the ownership
step. See the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-19 23:51:52 +03:00

6.6 KiB

Raw Blame History

Migrating the stack apps off the `postgres` superuser

Why

The Postgres server (timescale_db) has max_connections = 100. Six service connections run as the postgres superuser, each with a persistent pool that sits idle for hours. That's the root of the intermittent FATAL: sorry, too many clients already:

superuser sessions can use the superuser_reserved_connections slots, so the server can fill completely with no admin headroom;
you can't put a per-role CONNECTION LIMIT or enforce timeouts on them effectively;
and it's a standing least-privilege risk (any of these apps can read/write/DROP anything in any database).

Giving each app a dedicated NOSUPERUSER role with a hard CONNECTION LIMIT fixes all three.

The six connections (confirmed live)

Service	Database	Current user	New role	Conn limit
`webhook_receiver`	tracksolid_db	postgres	`webhook_app`	10
`ingest_worker`	tracksolid_db	postgres	`ingest_app`	10
`worker`	tracksolid_db	postgres	`worker_app` (read)	5
`dashboard_api` (prod backend)	tracksolid_db	postgres	`dashboard_app` (or reuse `dashboard_ro`)	8
`gateway`	fleet_platform	postgres	`gateway_app`	15
`cron`	fleet_platform	postgres	`cron_app`	5

Note gateway/cron use a different database (fleet_platform) on the same server — they still count against the shared 100-slot ceiling.

Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)

webhook_app 10 + ingest_app 10 + worker_app 5 + dashboard_app 8       = 33  (tracksolid_db)
gateway_app 15 + cron_app 5                                            = 20  (fleet_platform)
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28  (existing)
                                                                  TOTAL ≈ 81  ✅

Tune the CONNECTION LIMITs in the SQL to your real pool sizes; the point is the sum is now bounded and visible, not open-ended superuser pools.

Step 1 — Discover what each app actually needs (do NOT skip)

The drafted grants are best-effort (ingestion = write telemetry; gateway/cron = RW app state; worker/dashboard = read). Confirm before cutover:

-- (a) Which tables does each app WRITE? Reset stats, run the app for a bit, re-check:
SELECT schemaname, relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
ORDER BY 1,2;

-- (b) Does the app run DDL/migrations at deploy? Check its code/entrypoint for
--     CREATE/ALTER/DROP or a migrations runner (e.g. run_migrations.py, alembic).
--     If yes → it needs object OWNERSHIP, see Step 3.

Or temporarily set log_statement = 'ddl' (or 'mod') and watch one deploy cycle.

Step 2 — Create the roles (no app impact yet)

Generate a password per role (host-only, 0600), then apply the SQL as postgres:

for r in webhook_app ingest_app worker_app dashboard_app gateway_app cron_app; do
  [ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
done
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)

docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
  -v webhook_pw="$(cat ~/.webhook_app.pw)" -v ingest_pw="$(cat ~/.ingest_app.pw)" \
  -v worker_pw="$(cat ~/.worker_app.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
  < scripts/app_roles_tracksolid_db.sql

docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
  -v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
  < scripts/app_roles_fleet_platform.sql

Step 3 — (Only if an app runs migrations) give its role object ownership

All objects are owned by postgres, so a non-superuser role can write rows but not ALTER/DROP existing tables. If discovery showed an app issues DDL, reassign the app schemas to the existing non-superuser owner role and add the app role to it. Scope this to the app schemas — never REASSIGN OWNED BY postgres globally (that would also try to move TimescaleDB/system objects).

-- tracksolid_db: make tracksolid_owner own the app objects, then add the ingestors.
DO $$
DECLARE r record;
BEGIN
  FOR r IN
    SELECT n.nspname, c.relname,
           CASE c.relkind WHEN 'v' THEN 'VIEW' WHEN 'm' THEN 'MATERIALIZED VIEW' ELSE 'TABLE' END AS kind
    FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace
    WHERE n.nspname IN ('tracksolid','reporting') AND c.relkind IN ('r','p','v','m')
  LOOP
    EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', r.kind, r.nspname, r.relname);
  END LOOP;
END $$;
GRANT CREATE ON SCHEMA tracksolid, reporting TO tracksolid_owner;
GRANT tracksolid_owner TO webhook_app, ingest_app;   -- they inherit ownership rights

(Do the analogous reassignment in fleet_platform to a fleet_platform_owner role if gateway/cron run migrations. Keep reporting.v_trips owned by reporting_refresher if that role refreshes it.)

Test one deploy/migration as the new role before cutting over all apps.

Step 4 — Cut over one app at a time

For each service, change its DATABASE_URL user/password from postgres:… to the new role (same host/port/dbname), redeploy just that one, and watch its logs for permission denied (→ widen the group grant) and the DB for connection count:

# in the app's env (Coolify secret or compose):
#   tracksolid_db: postgresql://webhook_app:<pw>@timescale_db:5432/tracksolid_db
#   fleet_platform: postgresql://gateway_app:<pw>@timescale_db:5432/fleet_platform
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

Order: start with the lowest-risk reader (worker/dashboard_api), then the ingestors, then gateway/cron.

Rollback (instant)

Each app's only change is its DATABASE_URL. If anything misbehaves, set it back to the postgres:… DSN and redeploy that one app — no DB change required. The roles are additive; to remove one entirely: DROP ROLE <app>; (after nothing uses it).

After all six are migrated

Add idle_session_timeout is already covered by the per-role GUCs above.
Consider rotating the postgres superuser password and restricting it to admin use only (it should no longer appear in any app's env).
Re-check the budget: SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1; — no app should exceed its CONNECTION LIMIT, and the total should sit comfortably under 100. This is also when PgBouncer (separate PR) becomes optional rather than necessary.

6.6 KiB Raw Blame History

Migrating the stack apps off the postgres superuser

Why