Discovery (live) corrected the design: webhook_receiver, ingest_worker, and worker all run run_migrations.py (DDL) and write telemetry — worker is the same image as ingest_worker, not a reader. Because they ALTER objects they must own them, so all three connect as the shared non-superuser tracksolid_owner (the role the repo already intends to own these schemas). dashboard_api backend stays a reader (dashboard_app). - app_roles_tracksolid_db.sql rewritten: tracksolid_owner LOGIN + CONNECTION LIMIT 30 + GUCs + USAGE/CREATE; Timescale-aware ownership reassignment (skips table-linked sequences, ALTER MATERIALIZED VIEW for continuous aggregates, leaves reporting.v_trips with reporting_refresher, reassigns functions); dashboard_app read role. - Reassignment validated in a rolled-back transaction on the live DB: reassigns the 31-chunk position_history hypertable + the v_mileage_daily_cagg continuous aggregate, and as tracksolid_owner can ALTER the hypertable and create/drop tables. - Runbook updated: discovery marked done, ownership folded into the apply (safe while apps still run as postgres — superuser bypasses ownership), corrected cutover order. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
128 lines
6.4 KiB
Markdown
128 lines
6.4 KiB
Markdown
# Migrating the stack apps off the `postgres` superuser
|
|
|
|
## Why
|
|
|
|
The Postgres server (`timescale_db`) has `max_connections = 100`. Six service
|
|
connections run as the **`postgres` superuser**, each with a persistent pool that
|
|
sits idle for hours. That's the root of the intermittent `FATAL: sorry, too many
|
|
clients already`:
|
|
|
|
- superuser sessions can use the **`superuser_reserved_connections`** slots, so the
|
|
server can fill completely with no admin headroom;
|
|
- you can't put a per-role **`CONNECTION LIMIT`** or enforce timeouts on them
|
|
effectively;
|
|
- and it's a standing least-privilege risk (any of these apps can read/write/DROP
|
|
anything in any database).
|
|
|
|
Giving each app a dedicated **NOSUPERUSER** role with a hard `CONNECTION LIMIT` fixes
|
|
all three.
|
|
|
|
## The six connections (confirmed live 2026-06-20)
|
|
|
|
| Service | Database | Current user | New role | Conn limit | Notes |
|
|
|---|---|---|---|---|---|
|
|
| `webhook_receiver` | tracksolid_db | postgres | **`tracksolid_owner`** | 30 (shared) | runs migrations |
|
|
| `ingest_worker` | tracksolid_db | postgres | **`tracksolid_owner`** | (shared) | runs migrations |
|
|
| `worker` | tracksolid_db | postgres | **`tracksolid_owner`** | (shared) | = ingest_worker image; runs migrations |
|
|
| `dashboard_api` (prod backend) | tracksolid_db | postgres | `dashboard_app` (read) | 8 | reader |
|
|
| `gateway` | **fleet_platform** | postgres | `gateway_app` | 15 | migration TBD |
|
|
| `cron` | **fleet_platform** | postgres | `cron_app` | 5 | migration TBD |
|
|
|
|
> **Migrators share `tracksolid_owner`.** `webhook_receiver`, `ingest_worker`, and
|
|
> `worker` all run `run_migrations.py` (DDL) and write telemetry. Because they ALTER
|
|
> objects, they must OWN them — so they connect as the single non-superuser
|
|
> `tracksolid_owner` (the role the repo already intends to own these schemas). One
|
|
> shared role = correct ownership, no app code change, one bounded connection cap.
|
|
> `gateway`/`cron` use a **different database** (`fleet_platform`) on the same server —
|
|
> still counted against the 100-slot ceiling; confirm whether they migrate before
|
|
> cutover (apply the same owner pattern if so).
|
|
|
|
### Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)
|
|
|
|
```
|
|
tracksolid_owner 30 (shared by 3 migrators) + dashboard_app 8 = 38 (tracksolid_db)
|
|
gateway_app 15 + cron_app 5 = 20 (fleet_platform)
|
|
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28 (existing)
|
|
TOTAL ≈ 86 ✅
|
|
```
|
|
Tune the `CONNECTION LIMIT`s to your real pool sizes; the point is the sum is now
|
|
**bounded and visible**, not open-ended superuser pools.
|
|
|
|
## Step 1 — Discovery (DONE 2026-06-20)
|
|
|
|
Confirmed live: `webhook_receiver`, `ingest_worker`, `worker` all start with
|
|
`python run_migrations.py && …` → they run **DDL** and write telemetry (`worker` is
|
|
the same image as `ingest_worker`). Writes span `tracksolid`, `reporting`, `tickets`.
|
|
`dashboard_api` (prod backend) reads. `gateway`/`cron` are on `fleet_platform` and
|
|
write `state`; their migration behaviour is **not yet confirmed** (opaque
|
|
`entrypoint.sh`) — verify before cutover with:
|
|
|
|
```sql
|
|
-- re-run after a deploy to see writes; or set log_statement='ddl' on fleet_platform.
|
|
SELECT schemaname, sum(n_tup_ins+n_tup_upd+n_tup_del) FROM pg_stat_user_tables GROUP BY 1;
|
|
```
|
|
|
|
## Step 2 — Create roles + reassign ownership (no app impact yet)
|
|
|
|
The ownership reassignment in `app_roles_tracksolid_db.sql` is **safe to run while the
|
|
apps still connect as `postgres`** — superuser bypasses ownership, so nothing breaks
|
|
until you flip a `DATABASE_URL`. It is Timescale-aware (skips linked sequences, uses
|
|
`ALTER MATERIALIZED VIEW` for continuous aggregates, leaves `reporting.v_trips` with
|
|
`reporting_refresher`) and idempotent — validated in a rolled-back transaction against
|
|
the live DB.
|
|
|
|
```bash
|
|
for r in tracksolid_owner dashboard_app gateway_app cron_app; do
|
|
[ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
|
|
done
|
|
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)
|
|
|
|
# tracksolid_db: owner/migrator role + ownership reassignment + dashboard reader
|
|
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
|
|
-v owner_pw="$(cat ~/.tracksolid_owner.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
|
|
< scripts/app_roles_tracksolid_db.sql
|
|
|
|
# fleet_platform: gateway/cron roles (see that file's notes re: migrations)
|
|
docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
|
|
-v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
|
|
< scripts/app_roles_fleet_platform.sql
|
|
```
|
|
|
|
> If `gateway`/`cron` run migrations, they need the same owner treatment on
|
|
> `fleet_platform` (reassign its schemas to a `fleet_platform_owner` login role) — do
|
|
> that before cutting them over. Until confirmed, leave them on `postgres`.
|
|
|
|
## Step 3 — Cut over one app at a time
|
|
|
|
Change each service's `DATABASE_URL` user/password (same host/port/dbname), redeploy
|
|
**just that one**, watch its logs for `permission denied` and the DB for the count:
|
|
|
|
```
|
|
# the three migrators → the shared owner role:
|
|
postgresql://tracksolid_owner:<owner_pw>@timescale_db:5432/tracksolid_db
|
|
# the dashboard backend → the reader:
|
|
postgresql://dashboard_app:<dash_pw>@timescale_db:5432/tracksolid_db
|
|
```
|
|
```bash
|
|
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
|
|
"SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
|
|
```
|
|
**Order:** `dashboard_api` (reader, lowest risk) first → confirm → then the migrators
|
|
one at a time (`ingest_worker`, then `worker`, then `webhook_receiver`), watching that
|
|
`run_migrations.py` succeeds and ingestion resumes after each.
|
|
|
|
## Rollback (instant)
|
|
|
|
Each app's only change is its `DATABASE_URL`. If anything misbehaves, set it back to
|
|
the `postgres:…` DSN and redeploy that one app — no DB change required. The roles are
|
|
additive; to remove one entirely: `DROP ROLE <app>;` (after nothing uses it).
|
|
|
|
## After all six are migrated
|
|
|
|
- Add `idle_session_timeout` is already covered by the per-role GUCs above.
|
|
- Consider **rotating the `postgres` superuser password** and restricting it to admin
|
|
use only (it should no longer appear in any app's env).
|
|
- Re-check the budget: `SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1;`
|
|
— no app should exceed its `CONNECTION LIMIT`, and the total should sit comfortably
|
|
under 100. This is also when PgBouncer (separate PR) becomes optional rather than
|
|
necessary.
|