kiania e571eeabed infra(db-roles): validated Option A — shared tracksolid_owner for migrators

Discovery (live) corrected the design: webhook_receiver, ingest_worker, and worker
all run run_migrations.py (DDL) and write telemetry — worker is the same image as
ingest_worker, not a reader. Because they ALTER objects they must own them, so all
three connect as the shared non-superuser tracksolid_owner (the role the repo already
intends to own these schemas). dashboard_api backend stays a reader (dashboard_app).

- app_roles_tracksolid_db.sql rewritten: tracksolid_owner LOGIN + CONNECTION LIMIT 30
  + GUCs + USAGE/CREATE; Timescale-aware ownership reassignment (skips table-linked
  sequences, ALTER MATERIALIZED VIEW for continuous aggregates, leaves reporting.v_trips
  with reporting_refresher, reassigns functions); dashboard_app read role.
- Reassignment validated in a rolled-back transaction on the live DB: reassigns the
  31-chunk position_history hypertable + the v_mileage_daily_cagg continuous aggregate,
  and as tracksolid_owner can ALTER the hypertable and create/drop tables.
- Runbook updated: discovery marked done, ownership folded into the apply (safe while
  apps still run as postgres — superuser bypasses ownership), corrected cutover order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-20 00:08:52 +03:00

6.4 KiB

Raw Blame History

Migrating the stack apps off the `postgres` superuser

Why

The Postgres server (timescale_db) has max_connections = 100. Six service connections run as the postgres superuser, each with a persistent pool that sits idle for hours. That's the root of the intermittent FATAL: sorry, too many clients already:

superuser sessions can use the superuser_reserved_connections slots, so the server can fill completely with no admin headroom;
you can't put a per-role CONNECTION LIMIT or enforce timeouts on them effectively;
and it's a standing least-privilege risk (any of these apps can read/write/DROP anything in any database).

Giving each app a dedicated NOSUPERUSER role with a hard CONNECTION LIMIT fixes all three.

The six connections (confirmed live 2026-06-20)

Service	Database	Current user	New role	Conn limit	Notes
`webhook_receiver`	tracksolid_db	postgres	`tracksolid_owner`	30 (shared)	runs migrations
`ingest_worker`	tracksolid_db	postgres	`tracksolid_owner`	(shared)	runs migrations
`worker`	tracksolid_db	postgres	`tracksolid_owner`	(shared)	= ingest_worker image; runs migrations
`dashboard_api` (prod backend)	tracksolid_db	postgres	`dashboard_app` (read)	8	reader
`gateway`	fleet_platform	postgres	`gateway_app`	15	migration TBD
`cron`	fleet_platform	postgres	`cron_app`	5	migration TBD

Migrators share tracksolid_owner. webhook_receiver, ingest_worker, and worker all run run_migrations.py (DDL) and write telemetry. Because they ALTER objects, they must OWN them — so they connect as the single non-superuser tracksolid_owner (the role the repo already intends to own these schemas). One shared role = correct ownership, no app code change, one bounded connection cap. gateway/cron use a different database (fleet_platform) on the same server — still counted against the 100-slot ceiling; confirm whether they migrate before cutover (apply the same owner pattern if so).

Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)

tracksolid_owner 30 (shared by 3 migrators) + dashboard_app 8         = 38  (tracksolid_db)
gateway_app 15 + cron_app 5                                            = 20  (fleet_platform)
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28  (existing)
                                                                  TOTAL ≈ 86  ✅

Tune the CONNECTION LIMITs to your real pool sizes; the point is the sum is now bounded and visible, not open-ended superuser pools.

Step 1 — Discovery (DONE 2026-06-20)

Confirmed live: webhook_receiver, ingest_worker, worker all start with python run_migrations.py && … → they run DDL and write telemetry (worker is the same image as ingest_worker). Writes span tracksolid, reporting, tickets. dashboard_api (prod backend) reads. gateway/cron are on fleet_platform and write state; their migration behaviour is not yet confirmed (opaque entrypoint.sh) — verify before cutover with:

-- re-run after a deploy to see writes; or set log_statement='ddl' on fleet_platform.
SELECT schemaname, sum(n_tup_ins+n_tup_upd+n_tup_del) FROM pg_stat_user_tables GROUP BY 1;

Step 2 — Create roles + reassign ownership (no app impact yet)

The ownership reassignment in app_roles_tracksolid_db.sql is safe to run while the apps still connect as postgres — superuser bypasses ownership, so nothing breaks until you flip a DATABASE_URL. It is Timescale-aware (skips linked sequences, uses ALTER MATERIALIZED VIEW for continuous aggregates, leaves reporting.v_trips with reporting_refresher) and idempotent — validated in a rolled-back transaction against the live DB.

for r in tracksolid_owner dashboard_app gateway_app cron_app; do
  [ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
done
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)

# tracksolid_db: owner/migrator role + ownership reassignment + dashboard reader
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
  -v owner_pw="$(cat ~/.tracksolid_owner.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
  < scripts/app_roles_tracksolid_db.sql

# fleet_platform: gateway/cron roles (see that file's notes re: migrations)
docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
  -v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
  < scripts/app_roles_fleet_platform.sql

If gateway/cron run migrations, they need the same owner treatment on fleet_platform (reassign its schemas to a fleet_platform_owner login role) — do that before cutting them over. Until confirmed, leave them on postgres.

Step 3 — Cut over one app at a time

Change each service's DATABASE_URL user/password (same host/port/dbname), redeploy just that one, watch its logs for permission denied and the DB for the count:

# the three migrators → the shared owner role:
postgresql://tracksolid_owner:<owner_pw>@timescale_db:5432/tracksolid_db
# the dashboard backend → the reader:
postgresql://dashboard_app:<dash_pw>@timescale_db:5432/tracksolid_db

docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

Order: dashboard_api (reader, lowest risk) first → confirm → then the migrators one at a time (ingest_worker, then worker, then webhook_receiver), watching that run_migrations.py succeeds and ingestion resumes after each.

Rollback (instant)

Each app's only change is its DATABASE_URL. If anything misbehaves, set it back to the postgres:… DSN and redeploy that one app — no DB change required. The roles are additive; to remove one entirely: DROP ROLE <app>; (after nothing uses it).

After all six are migrated

Add idle_session_timeout is already covered by the per-role GUCs above.
Consider rotating the postgres superuser password and restricting it to admin use only (it should no longer appear in any app's env).
Re-check the budget: SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1; — no app should exceed its CONNECTION LIMIT, and the total should sit comfortably under 100. This is also when PgBouncer (separate PR) becomes optional rather than necessary.

6.4 KiB Raw Blame History

Migrating the stack apps off the postgres superuser

Why