diff --git a/scripts/MIGRATE_APPS_OFF_SUPERUSER.md b/scripts/MIGRATE_APPS_OFF_SUPERUSER.md new file mode 100644 index 0000000..001bd05 --- /dev/null +++ b/scripts/MIGRATE_APPS_OFF_SUPERUSER.md @@ -0,0 +1,128 @@ +# Migrating the stack apps off the `postgres` superuser + +## Why + +The Postgres server (`timescale_db`) has `max_connections = 100`. Six service +connections run as the **`postgres` superuser**, each with a persistent pool that +sits idle for hours. That's the root of the intermittent `FATAL: sorry, too many +clients already`: + +- superuser sessions can use the **`superuser_reserved_connections`** slots, so the + server can fill completely with no admin headroom; +- you can't put a per-role **`CONNECTION LIMIT`** or enforce timeouts on them + effectively; +- and it's a standing least-privilege risk (any of these apps can read/write/DROP + anything in any database). + +Giving each app a dedicated **NOSUPERUSER** role with a hard `CONNECTION LIMIT` fixes +all three. + +## The six connections (confirmed live 2026-06-20) + +| Service | Database | Current user | New role | Conn limit | Notes | +|---|---|---|---|---|---| +| `webhook_receiver` | tracksolid_db | postgres | **`tracksolid_owner`** | 30 (shared) | runs migrations | +| `ingest_worker` | tracksolid_db | postgres | **`tracksolid_owner`** | (shared) | runs migrations | +| `worker` | tracksolid_db | postgres | **`tracksolid_owner`** | (shared) | = ingest_worker image; runs migrations | +| `dashboard_api` (prod backend) | tracksolid_db | postgres | `dashboard_app` (read) | 8 | reader | +| `gateway` | **fleet_platform** | postgres | `gateway_app` | 15 | migration TBD | +| `cron` | **fleet_platform** | postgres | `cron_app` | 5 | migration TBD | + +> **Migrators share `tracksolid_owner`.** `webhook_receiver`, `ingest_worker`, and +> `worker` all run `run_migrations.py` (DDL) and write telemetry. Because they ALTER +> objects, they must OWN them — so they connect as the single non-superuser +> `tracksolid_owner` (the role the repo already intends to own these schemas). One +> shared role = correct ownership, no app code change, one bounded connection cap. +> `gateway`/`cron` use a **different database** (`fleet_platform`) on the same server — +> still counted against the 100-slot ceiling; confirm whether they migrate before +> cutover (apply the same owner pattern if so). + +### Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom) + +``` +tracksolid_owner 30 (shared by 3 migrators) + dashboard_app 8 = 38 (tracksolid_db) +gateway_app 15 + cron_app 5 = 20 (fleet_platform) +analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28 (existing) + TOTAL ≈ 86 ✅ +``` +Tune the `CONNECTION LIMIT`s to your real pool sizes; the point is the sum is now +**bounded and visible**, not open-ended superuser pools. + +## Step 1 — Discovery (DONE 2026-06-20) + +Confirmed live: `webhook_receiver`, `ingest_worker`, `worker` all start with +`python run_migrations.py && …` → they run **DDL** and write telemetry (`worker` is +the same image as `ingest_worker`). Writes span `tracksolid`, `reporting`, `tickets`. +`dashboard_api` (prod backend) reads. `gateway`/`cron` are on `fleet_platform` and +write `state`; their migration behaviour is **not yet confirmed** (opaque +`entrypoint.sh`) — verify before cutover with: + +```sql +-- re-run after a deploy to see writes; or set log_statement='ddl' on fleet_platform. +SELECT schemaname, sum(n_tup_ins+n_tup_upd+n_tup_del) FROM pg_stat_user_tables GROUP BY 1; +``` + +## Step 2 — Create roles + reassign ownership (no app impact yet) + +The ownership reassignment in `app_roles_tracksolid_db.sql` is **safe to run while the +apps still connect as `postgres`** — superuser bypasses ownership, so nothing breaks +until you flip a `DATABASE_URL`. It is Timescale-aware (skips linked sequences, uses +`ALTER MATERIALIZED VIEW` for continuous aggregates, leaves `reporting.v_trips` with +`reporting_refresher`) and idempotent — validated in a rolled-back transaction against +the live DB. + +```bash +for r in tracksolid_owner dashboard_app gateway_app cron_app; do + [ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw ) +done +DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1) + +# tracksolid_db: owner/migrator role + ownership reassignment + dashboard reader +docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \ + -v owner_pw="$(cat ~/.tracksolid_owner.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \ + < scripts/app_roles_tracksolid_db.sql + +# fleet_platform: gateway/cron roles (see that file's notes re: migrations) +docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \ + -v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \ + < scripts/app_roles_fleet_platform.sql +``` + +> If `gateway`/`cron` run migrations, they need the same owner treatment on +> `fleet_platform` (reassign its schemas to a `fleet_platform_owner` login role) — do +> that before cutting them over. Until confirmed, leave them on `postgres`. + +## Step 3 — Cut over one app at a time + +Change each service's `DATABASE_URL` user/password (same host/port/dbname), redeploy +**just that one**, watch its logs for `permission denied` and the DB for the count: + +``` +# the three migrators → the shared owner role: +postgresql://tracksolid_owner:@timescale_db:5432/tracksolid_db +# the dashboard backend → the reader: +postgresql://dashboard_app:@timescale_db:5432/tracksolid_db +``` +```bash +docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \ + "SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;" +``` +**Order:** `dashboard_api` (reader, lowest risk) first → confirm → then the migrators +one at a time (`ingest_worker`, then `worker`, then `webhook_receiver`), watching that +`run_migrations.py` succeeds and ingestion resumes after each. + +## Rollback (instant) + +Each app's only change is its `DATABASE_URL`. If anything misbehaves, set it back to +the `postgres:…` DSN and redeploy that one app — no DB change required. The roles are +additive; to remove one entirely: `DROP ROLE ;` (after nothing uses it). + +## After all six are migrated + +- Add `idle_session_timeout` is already covered by the per-role GUCs above. +- Consider **rotating the `postgres` superuser password** and restricting it to admin + use only (it should no longer appear in any app's env). +- Re-check the budget: `SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1;` + — no app should exceed its `CONNECTION LIMIT`, and the total should sit comfortably + under 100. This is also when PgBouncer (separate PR) becomes optional rather than + necessary. diff --git a/scripts/app_roles_fleet_platform.sql b/scripts/app_roles_fleet_platform.sql new file mode 100644 index 0000000..b156a16 --- /dev/null +++ b/scripts/app_roles_fleet_platform.sql @@ -0,0 +1,71 @@ +-- app_roles_fleet_platform.sql — dedicated NON-SUPERUSER login roles for the apps +-- that connect to the fleet_platform database as the `postgres` SUPERUSER. +-- ───────────────────────────────────────────────────────────────────────────── +-- Sibling of app_roles_tracksolid_db.sql, for the OTHER database on the same server. +-- gateway + cron (the fleet_platform Coolify app) connect here as postgres. Same +-- rationale: least privilege + a hard per-role CONNECTION LIMIT so they can't +-- exhaust the server-wide 100-connection ceiling. +-- +-- Schemas in fleet_platform: auth, domain, events, geo, ops, serve, slo, state +-- (all owned by postgres). gateway (the API) and cron (scheduled jobs) almost +-- certainly READ+WRITE app state across these, so they get DML; widen/narrow per +-- the discovery step in MIGRATE_APPS_OFF_SUPERUSER.md. As with the sibling file, +-- this does NOT change object ownership, so it does not grant DDL on existing +-- (postgres-owned) objects — see step 3 of the runbook if these apps run migrations. +-- +-- Run as the postgres SUPERUSER, on the fleet_platform database: +-- docker exec -i psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \ +-- -v gateway_pw="$(cat ~/.gateway_app.pw)" \ +-- -v cron_pw="$(cat ~/.cron_app.pw)" \ +-- < scripts/app_roles_fleet_platform.sql + +\set ON_ERROR_STOP on + +-- ── 1. Capability group (read + write across the app schemas) ─────────────────── +DO $$ BEGIN + IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='fp_app_rw') THEN CREATE ROLE fp_app_rw NOLOGIN; END IF; +END $$; + +DO $grants$ +DECLARE s text; +BEGIN + FOREACH s IN ARRAY ARRAY['auth','domain','events','geo','ops','serve','slo','state'] LOOP + EXECUTE format('GRANT USAGE ON SCHEMA %I TO fp_app_rw', s); + EXECUTE format('GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA %I TO fp_app_rw', s); + EXECUTE format('GRANT USAGE, SELECT, UPDATE ON ALL SEQUENCES IN SCHEMA %I TO fp_app_rw', s); + EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO fp_app_rw', s); + EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO fp_app_rw', s); + EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT USAGE, SELECT, UPDATE ON SEQUENCES TO fp_app_rw', s); + EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT EXECUTE ON FUNCTIONS TO fp_app_rw', s); + END LOOP; +END $grants$; + +-- ── 2. Per-app LOGIN roles ────────────────────────────────────────────────────── +-- gateway — the request-facing API (latency-sensitive: short statement_timeout). +DO $$ BEGIN + IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='gateway_app') THEN + CREATE ROLE gateway_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE; + END IF; END $$; +ALTER ROLE gateway_app WITH LOGIN PASSWORD :'gateway_pw' CONNECTION LIMIT 15; +GRANT CONNECT ON DATABASE fleet_platform TO gateway_app; +GRANT fp_app_rw TO gateway_app; +ALTER ROLE gateway_app SET statement_timeout = '15s'; +ALTER ROLE gateway_app SET idle_in_transaction_session_timeout = '30s'; +ALTER ROLE gateway_app SET idle_session_timeout = '5min'; +ALTER ROLE gateway_app SET lock_timeout = '3s'; + +-- cron — scheduled/background jobs (longer queries tolerated). +DO $$ BEGIN + IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='cron_app') THEN + CREATE ROLE cron_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE; + END IF; END $$; +ALTER ROLE cron_app WITH LOGIN PASSWORD :'cron_pw' CONNECTION LIMIT 5; +GRANT CONNECT ON DATABASE fleet_platform TO cron_app; +GRANT fp_app_rw TO cron_app; +ALTER ROLE cron_app SET statement_timeout = '120s'; +ALTER ROLE cron_app SET idle_in_transaction_session_timeout = '120s'; +ALTER ROLE cron_app SET idle_session_timeout = '10min'; +ALTER ROLE cron_app SET lock_timeout = '5s'; + +-- ── 3. Verify ─────────────────────────────────────────────────────────────────── +-- \du+ diff --git a/scripts/app_roles_tracksolid_db.sql b/scripts/app_roles_tracksolid_db.sql new file mode 100644 index 0000000..b1fb426 --- /dev/null +++ b/scripts/app_roles_tracksolid_db.sql @@ -0,0 +1,95 @@ +-- app_roles_tracksolid_db.sql — get the tracksolid_db apps off the postgres SUPERUSER. +-- ───────────────────────────────────────────────────────────────────────────── +-- DESIGN (validated against the live DB, 2026-06-20): +-- * webhook_receiver, ingest_worker, worker each run `run_migrations.py` (DDL) and +-- write telemetry. `worker` is a second copy of the ingest_worker image. Because +-- they run migrations, they need to OWN the objects they ALTER. They therefore +-- connect as the shared, NON-SUPERUSER **tracksolid_owner** (the role the repo +-- already intends to own these schemas — see analytics_ro_role.sql default privs). +-- * the prod dashboard_api backend only reads → its own read role `dashboard_app` +-- (or reuse the existing dashboard_ro). +-- +-- This file is idempotent. Section 2 (ownership reassignment) is Timescale-aware: +-- it skips table-linked sequences, uses ALTER MATERIALIZED VIEW for continuous +-- aggregates, and leaves reporting.v_trips with reporting_refresher. Reassigning +-- while the apps still run as postgres is SAFE — superuser bypasses ownership, so +-- nothing breaks until you flip each app's DATABASE_URL (see the runbook). +-- +-- Run as the postgres SUPERUSER, on tracksolid_db: +-- docker exec -i psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \ +-- -v owner_pw="$(cat ~/.tracksolid_owner.pw)" \ +-- -v dash_pw="$(cat ~/.dashboard_app.pw)" \ +-- < scripts/app_roles_tracksolid_db.sql + +\set ON_ERROR_STOP on + +-- ── 1. tracksolid_owner: the shared owner/migrator login for the ingestion apps ── +DO $$ BEGIN + IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='tracksolid_owner') THEN + CREATE ROLE tracksolid_owner LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE; + END IF; END $$; +-- LOGIN + password + a HARD connection cap (the real budget control). No +-- statement_timeout: migrations (e.g. CREATE INDEX on a hypertable) can run long. +ALTER ROLE tracksolid_owner WITH LOGIN PASSWORD :'owner_pw' CONNECTION LIMIT 30; +ALTER ROLE tracksolid_owner SET idle_in_transaction_session_timeout = '5min'; +ALTER ROLE tracksolid_owner SET idle_session_timeout = '10min'; +ALTER ROLE tracksolid_owner SET lock_timeout = '10s'; +GRANT CONNECT ON DATABASE tracksolid_db TO tracksolid_owner; +GRANT USAGE, CREATE ON SCHEMA tracksolid, reporting, tickets, fuel TO tracksolid_owner; + +-- ── 2. Reassign the app objects to tracksolid_owner (Timescale-aware, idempotent) ─ +DO $reassign$ +DECLARE r record; k text; +BEGIN + FOR r IN + SELECT n.nspname, c.relname, c.relkind, + EXISTS (SELECT 1 FROM timescaledb_information.continuous_aggregates ca + WHERE ca.view_schema=n.nspname AND ca.view_name=c.relname) AS is_cagg + FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace + WHERE n.nspname IN ('tracksolid','reporting','tickets','fuel') + AND c.relkind IN ('r','p','v','m','S') + AND pg_get_userbyid(c.relowner) <> 'tracksolid_owner' -- idempotent + AND NOT (n.nspname='reporting' AND c.relname='v_trips') -- keep with refresher + AND NOT (c.relkind='S' AND EXISTS ( -- skip linked seqs + SELECT 1 FROM pg_depend d WHERE d.objid=c.oid AND d.deptype IN ('a','i'))) + LOOP + k := CASE WHEN r.is_cagg OR r.relkind='m' THEN 'MATERIALIZED VIEW' + WHEN r.relkind='v' THEN 'VIEW' WHEN r.relkind='S' THEN 'SEQUENCE' ELSE 'TABLE' END; + EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', k, r.nspname, r.relname); + END LOOP; +END $reassign$; + +DO $fns$ +DECLARE r record; +BEGIN + FOR r IN SELECT p.oid::regprocedure AS sig + FROM pg_proc p JOIN pg_namespace n ON n.oid=p.pronamespace + WHERE n.nspname IN ('tracksolid','reporting','tickets','fuel') + AND pg_get_userbyid(p.proowner) <> 'tracksolid_owner' + LOOP EXECUTE format('ALTER FUNCTION %s OWNER TO tracksolid_owner', r.sig); END LOOP; +END $fns$; + +-- ── 3. dashboard_app: read-only role for the prod dashboard_api backend ────────── +-- (If that backend turns out to also WRITE app state, widen via a write group like +-- the fleet_platform file; start read-only.) +DO $$ BEGIN + IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='dashboard_app') THEN + CREATE ROLE dashboard_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE; + END IF; END $$; +ALTER ROLE dashboard_app WITH LOGIN PASSWORD :'dash_pw' CONNECTION LIMIT 8; +GRANT CONNECT ON DATABASE tracksolid_db TO dashboard_app; +GRANT USAGE ON SCHEMA tracksolid, reporting, tickets, fuel TO dashboard_app; +GRANT SELECT ON ALL TABLES IN SCHEMA tracksolid, reporting, tickets, fuel TO dashboard_app; +GRANT SELECT ON reporting.v_trips TO dashboard_app; +GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA reporting TO dashboard_app; +ALTER DEFAULT PRIVILEGES FOR ROLE tracksolid_owner IN SCHEMA tracksolid, reporting, tickets, fuel + GRANT SELECT ON TABLES TO dashboard_app; -- future objects (now owned by tracksolid_owner) +ALTER ROLE dashboard_app SET statement_timeout = '30s'; +ALTER ROLE dashboard_app SET idle_in_transaction_session_timeout = '60s'; +ALTER ROLE dashboard_app SET idle_session_timeout = '5min'; +ALTER ROLE dashboard_app SET lock_timeout = '5s'; + +-- ── 4. Verify ──────────────────────────────────────────────────────────────────── +-- \du+ tracksolid_owner -- LOGIN + CONNECTION LIMIT 30 +-- SELECT pg_get_userbyid(relowner), count(*) FROM pg_class +-- WHERE relnamespace IN (SELECT oid FROM pg_namespace WHERE nspname='tracksolid') GROUP BY 1;