fleetanalytics_mcp/scripts/MIGRATE_APPS_OFF_SUPERUSER.md
kiania e1472adc3a infra(db-roles): dedicated non-superuser roles for the six apps on postgres
Six service connections run as the postgres SUPERUSER across two databases on the
shared 100-connection server — the root of the "too many connections" peaks and a
standing least-privilege risk. Superuser sessions ignore per-role CONNECTION LIMIT
and can consume the superuser-reserved slots.

Drafts (apply as postgres; nothing applied here):
- scripts/app_roles_tracksolid_db.sql — webhook_app, ingest_app, worker_app,
  dashboard_app. Capability groups (ts_app_read / ts_app_write), per-app NOSUPERUSER
  login roles with hard CONNECTION LIMIT + bounded GUCs (statement_timeout,
  idle_session_timeout, idle_in_transaction, lock_timeout).
- scripts/app_roles_fleet_platform.sql — gateway_app, cron_app (the apps on the
  separate fleet_platform DB), fp_app_rw group over its schemas.
- scripts/MIGRATE_APPS_OFF_SUPERUSER.md — runbook: discovery (what each app actually
  writes / whether it runs DDL), connection-budget table (sum ≈ 81 < 100), the
  object-ownership step for migration-running apps (reassign app schemas to the
  existing tracksolid_owner — scoped, never REASSIGN OWNED globally), one-at-a-time
  cutover, and instant rollback (DATABASE_URL only).

Grants are best-effort by app function and explicitly call out where to verify before
cutover; all objects are postgres-owned, so row DML works but DDL needs the ownership
step. See the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 23:51:52 +03:00

6.6 KiB

Migrating the stack apps off the postgres superuser

Why

The Postgres server (timescale_db) has max_connections = 100. Six service connections run as the postgres superuser, each with a persistent pool that sits idle for hours. That's the root of the intermittent FATAL: sorry, too many clients already:

  • superuser sessions can use the superuser_reserved_connections slots, so the server can fill completely with no admin headroom;
  • you can't put a per-role CONNECTION LIMIT or enforce timeouts on them effectively;
  • and it's a standing least-privilege risk (any of these apps can read/write/DROP anything in any database).

Giving each app a dedicated NOSUPERUSER role with a hard CONNECTION LIMIT fixes all three.

The six connections (confirmed live)

Service Database Current user New role Conn limit
webhook_receiver tracksolid_db postgres webhook_app 10
ingest_worker tracksolid_db postgres ingest_app 10
worker tracksolid_db postgres worker_app (read) 5
dashboard_api (prod backend) tracksolid_db postgres dashboard_app (or reuse dashboard_ro) 8
gateway fleet_platform postgres gateway_app 15
cron fleet_platform postgres cron_app 5

Note gateway/cron use a different database (fleet_platform) on the same server — they still count against the shared 100-slot ceiling.

Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)

webhook_app 10 + ingest_app 10 + worker_app 5 + dashboard_app 8       = 33  (tracksolid_db)
gateway_app 15 + cron_app 5                                            = 20  (fleet_platform)
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28  (existing)
                                                                  TOTAL ≈ 81  ✅

Tune the CONNECTION LIMITs in the SQL to your real pool sizes; the point is the sum is now bounded and visible, not open-ended superuser pools.

Step 1 — Discover what each app actually needs (do NOT skip)

The drafted grants are best-effort (ingestion = write telemetry; gateway/cron = RW app state; worker/dashboard = read). Confirm before cutover:

-- (a) Which tables does each app WRITE? Reset stats, run the app for a bit, re-check:
SELECT schemaname, relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
ORDER BY 1,2;

-- (b) Does the app run DDL/migrations at deploy? Check its code/entrypoint for
--     CREATE/ALTER/DROP or a migrations runner (e.g. run_migrations.py, alembic).
--     If yes → it needs object OWNERSHIP, see Step 3.

Or temporarily set log_statement = 'ddl' (or 'mod') and watch one deploy cycle.

Step 2 — Create the roles (no app impact yet)

Generate a password per role (host-only, 0600), then apply the SQL as postgres:

for r in webhook_app ingest_app worker_app dashboard_app gateway_app cron_app; do
  [ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
done
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)

docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
  -v webhook_pw="$(cat ~/.webhook_app.pw)" -v ingest_pw="$(cat ~/.ingest_app.pw)" \
  -v worker_pw="$(cat ~/.worker_app.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
  < scripts/app_roles_tracksolid_db.sql

docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
  -v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
  < scripts/app_roles_fleet_platform.sql

Step 3 — (Only if an app runs migrations) give its role object ownership

All objects are owned by postgres, so a non-superuser role can write rows but not ALTER/DROP existing tables. If discovery showed an app issues DDL, reassign the app schemas to the existing non-superuser owner role and add the app role to it. Scope this to the app schemas — never REASSIGN OWNED BY postgres globally (that would also try to move TimescaleDB/system objects).

-- tracksolid_db: make tracksolid_owner own the app objects, then add the ingestors.
DO $$
DECLARE r record;
BEGIN
  FOR r IN
    SELECT n.nspname, c.relname,
           CASE c.relkind WHEN 'v' THEN 'VIEW' WHEN 'm' THEN 'MATERIALIZED VIEW' ELSE 'TABLE' END AS kind
    FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace
    WHERE n.nspname IN ('tracksolid','reporting') AND c.relkind IN ('r','p','v','m')
  LOOP
    EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', r.kind, r.nspname, r.relname);
  END LOOP;
END $$;
GRANT CREATE ON SCHEMA tracksolid, reporting TO tracksolid_owner;
GRANT tracksolid_owner TO webhook_app, ingest_app;   -- they inherit ownership rights

(Do the analogous reassignment in fleet_platform to a fleet_platform_owner role if gateway/cron run migrations. Keep reporting.v_trips owned by reporting_refresher if that role refreshes it.)

Test one deploy/migration as the new role before cutting over all apps.

Step 4 — Cut over one app at a time

For each service, change its DATABASE_URL user/password from postgres:… to the new role (same host/port/dbname), redeploy just that one, and watch its logs for permission denied (→ widen the group grant) and the DB for connection count:

# in the app's env (Coolify secret or compose):
#   tracksolid_db: postgresql://webhook_app:<pw>@timescale_db:5432/tracksolid_db
#   fleet_platform: postgresql://gateway_app:<pw>@timescale_db:5432/fleet_platform
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

Order: start with the lowest-risk reader (worker/dashboard_api), then the ingestors, then gateway/cron.

Rollback (instant)

Each app's only change is its DATABASE_URL. If anything misbehaves, set it back to the postgres:… DSN and redeploy that one app — no DB change required. The roles are additive; to remove one entirely: DROP ROLE <app>; (after nothing uses it).

After all six are migrated

  • Add idle_session_timeout is already covered by the per-role GUCs above.
  • Consider rotating the postgres superuser password and restricting it to admin use only (it should no longer appear in any app's env).
  • Re-check the budget: SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1; — no app should exceed its CONNECTION LIMIT, and the total should sit comfortably under 100. This is also when PgBouncer (separate PR) becomes optional rather than necessary.