fleetanalytics_mcp/scripts/MIGRATE_APPS_OFF_SUPERUSER.md

# Migrating the stack apps off the `postgres` superuser

## Why

The Postgres server (`timescale_db`) has `max_connections = 100`. Six service
connections run as the **`postgres` superuser**, each with a persistent pool that
sits idle for hours. That's the root of the intermittent `FATAL: sorry, too many
clients already`:

- superuser sessions can use the **`superuser_reserved_connections`** slots, so the
  server can fill completely with no admin headroom;
- you can't put a per-role **`CONNECTION LIMIT`** or enforce timeouts on them
  effectively;
- and it's a standing least-privilege risk (any of these apps can read/write/DROP
  anything in any database).

Giving each app a dedicated **NOSUPERUSER** role with a hard `CONNECTION LIMIT` fixes
all three.

## The six connections (confirmed live)

| Service | Database | Current user | New role | Conn limit |
|---|---|---|---|---|
| `webhook_receiver` | tracksolid_db | postgres | `webhook_app` | 10 |
| `ingest_worker` | tracksolid_db | postgres | `ingest_app` | 10 |
| `worker` | tracksolid_db | postgres | `worker_app` (read) | 5 |
| `dashboard_api` (prod backend) | tracksolid_db | postgres | `dashboard_app` (or reuse `dashboard_ro`) | 8 |
| `gateway` | **fleet_platform** | postgres | `gateway_app` | 15 |
| `cron` | **fleet_platform** | postgres | `cron_app` | 5 |

> Note `gateway`/`cron` use a **different database** (`fleet_platform`) on the same
> server — they still count against the shared 100-slot ceiling.

### Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)

```
webhook_app 10 + ingest_app 10 + worker_app 5 + dashboard_app 8       = 33  (tracksolid_db)
gateway_app 15 + cron_app 5                                            = 20  (fleet_platform)
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28  (existing)
                                                                  TOTAL ≈ 81  ✅
```
Tune the `CONNECTION LIMIT`s in the SQL to your real pool sizes; the point is the sum
is now **bounded and visible**, not open-ended superuser pools.

## Step 1 — Discover what each app actually needs (do NOT skip)

The drafted grants are best-effort (ingestion = write telemetry; gateway/cron = RW
app state; worker/dashboard = read). Confirm before cutover:

```sql
-- (a) Which tables does each app WRITE? Reset stats, run the app for a bit, re-check:
SELECT schemaname, relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
ORDER BY 1,2;

-- (b) Does the app run DDL/migrations at deploy? Check its code/entrypoint for
--     CREATE/ALTER/DROP or a migrations runner (e.g. run_migrations.py, alembic).
--     If yes → it needs object OWNERSHIP, see Step 3.
```
Or temporarily set `log_statement = 'ddl'` (or `'mod'`) and watch one deploy cycle.

## Step 2 — Create the roles (no app impact yet)

Generate a password per role (host-only, 0600), then apply the SQL as postgres:

```bash
for r in webhook_app ingest_app worker_app dashboard_app gateway_app cron_app; do
  [ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
done
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)

docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
  -v webhook_pw="$(cat ~/.webhook_app.pw)" -v ingest_pw="$(cat ~/.ingest_app.pw)" \
  -v worker_pw="$(cat ~/.worker_app.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
  < scripts/app_roles_tracksolid_db.sql

docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
  -v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
  < scripts/app_roles_fleet_platform.sql
```

## Step 3 — (Only if an app runs migrations) give its role object ownership

All objects are owned by `postgres`, so a non-superuser role can write **rows** but
not `ALTER`/`DROP` existing tables. If discovery showed an app issues DDL, reassign
the **app schemas** to the existing non-superuser owner role and add the app role to
it. **Scope this to the app schemas — never `REASSIGN OWNED BY postgres` globally**
(that would also try to move TimescaleDB/system objects).

```sql
-- tracksolid_db: make tracksolid_owner own the app objects, then add the ingestors.
DO $$
DECLARE r record;
BEGIN
  FOR r IN
    SELECT n.nspname, c.relname,
           CASE c.relkind WHEN 'v' THEN 'VIEW' WHEN 'm' THEN 'MATERIALIZED VIEW' ELSE 'TABLE' END AS kind
    FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace
    WHERE n.nspname IN ('tracksolid','reporting') AND c.relkind IN ('r','p','v','m')
  LOOP
    EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', r.kind, r.nspname, r.relname);
  END LOOP;
END $$;
GRANT CREATE ON SCHEMA tracksolid, reporting TO tracksolid_owner;
GRANT tracksolid_owner TO webhook_app, ingest_app;   -- they inherit ownership rights
```
(Do the analogous reassignment in `fleet_platform` to a `fleet_platform_owner` role
if `gateway`/`cron` run migrations. Keep `reporting.v_trips` owned by
`reporting_refresher` if that role refreshes it.)

Test one deploy/migration as the new role **before** cutting over all apps.

## Step 4 — Cut over one app at a time

For each service, change its `DATABASE_URL` user/password from `postgres:…` to the new
role (same host/port/dbname), redeploy **just that one**, and watch its logs for
`permission denied` (→ widen the group grant) and the DB for connection count:

```bash
# in the app's env (Coolify secret or compose):
#   tracksolid_db: postgresql://webhook_app:<pw>@timescale_db:5432/tracksolid_db
#   fleet_platform: postgresql://gateway_app:<pw>@timescale_db:5432/fleet_platform
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
```
Order: start with the **lowest-risk reader** (`worker`/`dashboard_api`), then the
ingestors, then `gateway`/`cron`.

## Rollback (instant)

Each app's only change is its `DATABASE_URL`. If anything misbehaves, set it back to
the `postgres:…` DSN and redeploy that one app — no DB change required. The roles are
additive; to remove one entirely: `DROP ROLE <app>;` (after nothing uses it).

## After all six are migrated

- Add `idle_session_timeout` is already covered by the per-role GUCs above.
- Consider **rotating the `postgres` superuser password** and restricting it to admin
  use only (it should no longer appear in any app's env).
- Re-check the budget: `SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1;`
  — no app should exceed its `CONNECTION LIMIT`, and the total should sit comfortably
  under 100. This is also when PgBouncer (separate PR) becomes optional rather than
  necessary.
infra(db-roles): dedicated non-superuser roles for the six apps on postgres Six service connections run as the postgres SUPERUSER across two databases on the shared 100-connection server — the root of the "too many connections" peaks and a standing least-privilege risk. Superuser sessions ignore per-role CONNECTION LIMIT and can consume the superuser-reserved slots. Drafts (apply as postgres; nothing applied here): - scripts/app_roles_tracksolid_db.sql — webhook_app, ingest_app, worker_app, dashboard_app. Capability groups (ts_app_read / ts_app_write), per-app NOSUPERUSER login roles with hard CONNECTION LIMIT + bounded GUCs (statement_timeout, idle_session_timeout, idle_in_transaction, lock_timeout). - scripts/app_roles_fleet_platform.sql — gateway_app, cron_app (the apps on the separate fleet_platform DB), fp_app_rw group over its schemas. - scripts/MIGRATE_APPS_OFF_SUPERUSER.md — runbook: discovery (what each app actually writes / whether it runs DDL), connection-budget table (sum ≈ 81 < 100), the object-ownership step for migration-running apps (reassign app schemas to the existing tracksolid_owner — scoped, never REASSIGN OWNED globally), one-at-a-time cutover, and instant rollback (DATABASE_URL only). Grants are best-effort by app function and explicitly call out where to verify before cutover; all objects are postgres-owned, so row DML works but DDL needs the ownership step. See the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-19 20:51:52 +00:00			# Migrating the stack apps off the `postgres` superuser

			`## Why`

			The Postgres server (`timescale_db`) has `max_connections = 100`. Six service
			connections run as the `postgres` superuser, each with a persistent pool that
			sits idle for hours. That's the root of the intermittent `FATAL: sorry, too many
			clients already`:

			- superuser sessions can use the `superuser_reserved_connections` slots, so the
			`server can fill completely with no admin headroom;`
			- you can't put a per-role `CONNECTION LIMIT` or enforce timeouts on them
			`effectively;`
			`- and it's a standing least-privilege risk (any of these apps can read/write/DROP`
			`anything in any database).`

			Giving each app a dedicated NOSUPERUSER role with a hard `CONNECTION LIMIT` fixes
			`all three.`

			`## The six connections (confirmed live)`

			`\| Service \| Database \| Current user \| New role \| Conn limit \|`
			`\|---\|---\|---\|---\|---\|`
			\| `webhook_receiver` \| tracksolid_db \| postgres \| `webhook_app` \| 10 \|
			\| `ingest_worker` \| tracksolid_db \| postgres \| `ingest_app` \| 10 \|
			\| `worker` \| tracksolid_db \| postgres \| `worker_app` (read) \| 5 \|
			\| `dashboard_api` (prod backend) \| tracksolid_db \| postgres \| `dashboard_app` (or reuse `dashboard_ro`) \| 8 \|
			\| `gateway` \| fleet_platform \| postgres \| `gateway_app` \| 15 \|
			\| `cron` \| fleet_platform \| postgres \| `cron_app` \| 5 \|

			> Note `gateway`/`cron` use a different database (`fleet_platform`) on the same
			`> server — they still count against the shared 100-slot ceiling.`

			`### Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)`

			```
			`webhook_app 10 + ingest_app 10 + worker_app 5 + dashboard_app 8 = 33 (tracksolid_db)`
			`gateway_app 15 + cron_app 5 = 20 (fleet_platform)`
			`analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28 (existing)`
			`TOTAL ≈ 81 ✅`
			```
			Tune the `CONNECTION LIMIT`s in the SQL to your real pool sizes; the point is the sum
			`is now bounded and visible, not open-ended superuser pools.`

			`## Step 1 — Discover what each app actually needs (do NOT skip)`

			`The drafted grants are best-effort (ingestion = write telemetry; gateway/cron = RW`
			`app state; worker/dashboard = read). Confirm before cutover:`

			```sql
			`-- (a) Which tables does each app WRITE? Reset stats, run the app for a bit, re-check:`
			`SELECT schemaname, relname, n_tup_ins, n_tup_upd, n_tup_del`
			`FROM pg_stat_user_tables`
			`WHERE n_tup_ins + n_tup_upd + n_tup_del > 0`
			`ORDER BY 1,2;`

			`-- (b) Does the app run DDL/migrations at deploy? Check its code/entrypoint for`
			`-- CREATE/ALTER/DROP or a migrations runner (e.g. run_migrations.py, alembic).`
			`-- If yes → it needs object OWNERSHIP, see Step 3.`
			```
			Or temporarily set `log_statement = 'ddl'` (or `'mod'`) and watch one deploy cycle.

			`## Step 2 — Create the roles (no app impact yet)`

			`Generate a password per role (host-only, 0600), then apply the SQL as postgres:`

			```bash
			`for r in webhook_app ingest_app worker_app dashboard_app gateway_app cron_app; do`
			`[ -s ~/.$r.pw ] \|\| ( umask 077; openssl rand -hex 24 > ~/.$r.pw )`
			`done`
			`DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' \| head -1)`

			`docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \`
			`-v webhook_pw="$(cat ~/.webhook_app.pw)" -v ingest_pw="$(cat ~/.ingest_app.pw)" \`
			`-v worker_pw="$(cat ~/.worker_app.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \`
			`< scripts/app_roles_tracksolid_db.sql`

			`docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \`
			`-v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \`
			`< scripts/app_roles_fleet_platform.sql`
			```

			`## Step 3 — (Only if an app runs migrations) give its role object ownership`

			All objects are owned by `postgres`, so a non-superuser role can write rows but
			not `ALTER`/`DROP` existing tables. If discovery showed an app issues DDL, reassign
			`the app schemas to the existing non-superuser owner role and add the app role to`
			it. Scope this to the app schemas — never `REASSIGN OWNED BY postgres` globally
			`(that would also try to move TimescaleDB/system objects).`

			```sql
			`-- tracksolid_db: make tracksolid_owner own the app objects, then add the ingestors.`
			`DO $$`
			`DECLARE r record;`
			`BEGIN`
			`FOR r IN`
			`SELECT n.nspname, c.relname,`
			`CASE c.relkind WHEN 'v' THEN 'VIEW' WHEN 'm' THEN 'MATERIALIZED VIEW' ELSE 'TABLE' END AS kind`
			`FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace`
			`WHERE n.nspname IN ('tracksolid','reporting') AND c.relkind IN ('r','p','v','m')`
			`LOOP`
			`EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', r.kind, r.nspname, r.relname);`
			`END LOOP;`
			`END $$;`
			`GRANT CREATE ON SCHEMA tracksolid, reporting TO tracksolid_owner;`
			`GRANT tracksolid_owner TO webhook_app, ingest_app; -- they inherit ownership rights`
			```
			(Do the analogous reassignment in `fleet_platform` to a `fleet_platform_owner` role
			if `gateway`/`cron` run migrations. Keep `reporting.v_trips` owned by
			`reporting_refresher` if that role refreshes it.)

			`Test one deploy/migration as the new role before cutting over all apps.`

			`## Step 4 — Cut over one app at a time`

			For each service, change its `DATABASE_URL` user/password from `postgres:…` to the new
			`role (same host/port/dbname), redeploy just that one, and watch its logs for`
			`permission denied` (→ widen the group grant) and the DB for connection count:

			```bash
			`# in the app's env (Coolify secret or compose):`
			`# tracksolid_db: postgresql://webhook_app:<pw>@timescale_db:5432/tracksolid_db`
			`# fleet_platform: postgresql://gateway_app:<pw>@timescale_db:5432/fleet_platform`
			`docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \`
			`"SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"`
			```
			Order: start with the lowest-risk reader (`worker`/`dashboard_api`), then the
			ingestors, then `gateway`/`cron`.

			`## Rollback (instant)`

			Each app's only change is its `DATABASE_URL`. If anything misbehaves, set it back to
			the `postgres:…` DSN and redeploy that one app — no DB change required. The roles are
			additive; to remove one entirely: `DROP ROLE <app>;` (after nothing uses it).

			`## After all six are migrated`

			- Add `idle_session_timeout` is already covered by the per-role GUCs above.
			- Consider rotating the `postgres` superuser password and restricting it to admin
			`use only (it should no longer appear in any app's env).`
			- Re-check the budget: `SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1;`
			— no app should exceed its `CONNECTION LIMIT`, and the total should sit comfortably
			`under 100. This is also when PgBouncer (separate PR) becomes optional rather than`
			`necessary.`