fleetanalytics_mcp/scripts/MIGRATE_APPS_OFF_SUPERUSER.md
kiania e571eeabed infra(db-roles): validated Option A — shared tracksolid_owner for migrators
Discovery (live) corrected the design: webhook_receiver, ingest_worker, and worker
all run run_migrations.py (DDL) and write telemetry — worker is the same image as
ingest_worker, not a reader. Because they ALTER objects they must own them, so all
three connect as the shared non-superuser tracksolid_owner (the role the repo already
intends to own these schemas). dashboard_api backend stays a reader (dashboard_app).

- app_roles_tracksolid_db.sql rewritten: tracksolid_owner LOGIN + CONNECTION LIMIT 30
  + GUCs + USAGE/CREATE; Timescale-aware ownership reassignment (skips table-linked
  sequences, ALTER MATERIALIZED VIEW for continuous aggregates, leaves reporting.v_trips
  with reporting_refresher, reassigns functions); dashboard_app read role.
- Reassignment validated in a rolled-back transaction on the live DB: reassigns the
  31-chunk position_history hypertable + the v_mileage_daily_cagg continuous aggregate,
  and as tracksolid_owner can ALTER the hypertable and create/drop tables.
- Runbook updated: discovery marked done, ownership folded into the apply (safe while
  apps still run as postgres — superuser bypasses ownership), corrected cutover order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 00:08:52 +03:00

6.4 KiB

Migrating the stack apps off the postgres superuser

Why

The Postgres server (timescale_db) has max_connections = 100. Six service connections run as the postgres superuser, each with a persistent pool that sits idle for hours. That's the root of the intermittent FATAL: sorry, too many clients already:

  • superuser sessions can use the superuser_reserved_connections slots, so the server can fill completely with no admin headroom;
  • you can't put a per-role CONNECTION LIMIT or enforce timeouts on them effectively;
  • and it's a standing least-privilege risk (any of these apps can read/write/DROP anything in any database).

Giving each app a dedicated NOSUPERUSER role with a hard CONNECTION LIMIT fixes all three.

The six connections (confirmed live 2026-06-20)

Service Database Current user New role Conn limit Notes
webhook_receiver tracksolid_db postgres tracksolid_owner 30 (shared) runs migrations
ingest_worker tracksolid_db postgres tracksolid_owner (shared) runs migrations
worker tracksolid_db postgres tracksolid_owner (shared) = ingest_worker image; runs migrations
dashboard_api (prod backend) tracksolid_db postgres dashboard_app (read) 8 reader
gateway fleet_platform postgres gateway_app 15 migration TBD
cron fleet_platform postgres cron_app 5 migration TBD

Migrators share tracksolid_owner. webhook_receiver, ingest_worker, and worker all run run_migrations.py (DDL) and write telemetry. Because they ALTER objects, they must OWN them — so they connect as the single non-superuser tracksolid_owner (the role the repo already intends to own these schemas). One shared role = correct ownership, no app code change, one bounded connection cap. gateway/cron use a different database (fleet_platform) on the same server — still counted against the 100-slot ceiling; confirm whether they migrate before cutover (apply the same owner pattern if so).

Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)

tracksolid_owner 30 (shared by 3 migrators) + dashboard_app 8         = 38  (tracksolid_db)
gateway_app 15 + cron_app 5                                            = 20  (fleet_platform)
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28  (existing)
                                                                  TOTAL ≈ 86  ✅

Tune the CONNECTION LIMITs to your real pool sizes; the point is the sum is now bounded and visible, not open-ended superuser pools.

Step 1 — Discovery (DONE 2026-06-20)

Confirmed live: webhook_receiver, ingest_worker, worker all start with python run_migrations.py && … → they run DDL and write telemetry (worker is the same image as ingest_worker). Writes span tracksolid, reporting, tickets. dashboard_api (prod backend) reads. gateway/cron are on fleet_platform and write state; their migration behaviour is not yet confirmed (opaque entrypoint.sh) — verify before cutover with:

-- re-run after a deploy to see writes; or set log_statement='ddl' on fleet_platform.
SELECT schemaname, sum(n_tup_ins+n_tup_upd+n_tup_del) FROM pg_stat_user_tables GROUP BY 1;

Step 2 — Create roles + reassign ownership (no app impact yet)

The ownership reassignment in app_roles_tracksolid_db.sql is safe to run while the apps still connect as postgres — superuser bypasses ownership, so nothing breaks until you flip a DATABASE_URL. It is Timescale-aware (skips linked sequences, uses ALTER MATERIALIZED VIEW for continuous aggregates, leaves reporting.v_trips with reporting_refresher) and idempotent — validated in a rolled-back transaction against the live DB.

for r in tracksolid_owner dashboard_app gateway_app cron_app; do
  [ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
done
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)

# tracksolid_db: owner/migrator role + ownership reassignment + dashboard reader
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
  -v owner_pw="$(cat ~/.tracksolid_owner.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
  < scripts/app_roles_tracksolid_db.sql

# fleet_platform: gateway/cron roles (see that file's notes re: migrations)
docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
  -v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
  < scripts/app_roles_fleet_platform.sql

If gateway/cron run migrations, they need the same owner treatment on fleet_platform (reassign its schemas to a fleet_platform_owner login role) — do that before cutting them over. Until confirmed, leave them on postgres.

Step 3 — Cut over one app at a time

Change each service's DATABASE_URL user/password (same host/port/dbname), redeploy just that one, watch its logs for permission denied and the DB for the count:

# the three migrators → the shared owner role:
postgresql://tracksolid_owner:<owner_pw>@timescale_db:5432/tracksolid_db
# the dashboard backend → the reader:
postgresql://dashboard_app:<dash_pw>@timescale_db:5432/tracksolid_db
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

Order: dashboard_api (reader, lowest risk) first → confirm → then the migrators one at a time (ingest_worker, then worker, then webhook_receiver), watching that run_migrations.py succeeds and ingestion resumes after each.

Rollback (instant)

Each app's only change is its DATABASE_URL. If anything misbehaves, set it back to the postgres:… DSN and redeploy that one app — no DB change required. The roles are additive; to remove one entirely: DROP ROLE <app>; (after nothing uses it).

After all six are migrated

  • Add idle_session_timeout is already covered by the per-role GUCs above.
  • Consider rotating the postgres superuser password and restricting it to admin use only (it should no longer appear in any app's env).
  • Re-check the budget: SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1; — no app should exceed its CONNECTION LIMIT, and the total should sit comfortably under 100. This is also when PgBouncer (separate PR) becomes optional rather than necessary.