145 lines
6.6 KiB
Markdown
145 lines
6.6 KiB
Markdown
|
|
# Migrating the stack apps off the `postgres` superuser
|
||
|
|
|
||
|
|
## Why
|
||
|
|
|
||
|
|
The Postgres server (`timescale_db`) has `max_connections = 100`. Six service
|
||
|
|
connections run as the **`postgres` superuser**, each with a persistent pool that
|
||
|
|
sits idle for hours. That's the root of the intermittent `FATAL: sorry, too many
|
||
|
|
clients already`:
|
||
|
|
|
||
|
|
- superuser sessions can use the **`superuser_reserved_connections`** slots, so the
|
||
|
|
server can fill completely with no admin headroom;
|
||
|
|
- you can't put a per-role **`CONNECTION LIMIT`** or enforce timeouts on them
|
||
|
|
effectively;
|
||
|
|
- and it's a standing least-privilege risk (any of these apps can read/write/DROP
|
||
|
|
anything in any database).
|
||
|
|
|
||
|
|
Giving each app a dedicated **NOSUPERUSER** role with a hard `CONNECTION LIMIT` fixes
|
||
|
|
all three.
|
||
|
|
|
||
|
|
## The six connections (confirmed live)
|
||
|
|
|
||
|
|
| Service | Database | Current user | New role | Conn limit |
|
||
|
|
|---|---|---|---|---|
|
||
|
|
| `webhook_receiver` | tracksolid_db | postgres | `webhook_app` | 10 |
|
||
|
|
| `ingest_worker` | tracksolid_db | postgres | `ingest_app` | 10 |
|
||
|
|
| `worker` | tracksolid_db | postgres | `worker_app` (read) | 5 |
|
||
|
|
| `dashboard_api` (prod backend) | tracksolid_db | postgres | `dashboard_app` (or reuse `dashboard_ro`) | 8 |
|
||
|
|
| `gateway` | **fleet_platform** | postgres | `gateway_app` | 15 |
|
||
|
|
| `cron` | **fleet_platform** | postgres | `cron_app` | 5 |
|
||
|
|
|
||
|
|
> Note `gateway`/`cron` use a **different database** (`fleet_platform`) on the same
|
||
|
|
> server — they still count against the shared 100-slot ceiling.
|
||
|
|
|
||
|
|
### Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)
|
||
|
|
|
||
|
|
```
|
||
|
|
webhook_app 10 + ingest_app 10 + worker_app 5 + dashboard_app 8 = 33 (tracksolid_db)
|
||
|
|
gateway_app 15 + cron_app 5 = 20 (fleet_platform)
|
||
|
|
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28 (existing)
|
||
|
|
TOTAL ≈ 81 ✅
|
||
|
|
```
|
||
|
|
Tune the `CONNECTION LIMIT`s in the SQL to your real pool sizes; the point is the sum
|
||
|
|
is now **bounded and visible**, not open-ended superuser pools.
|
||
|
|
|
||
|
|
## Step 1 — Discover what each app actually needs (do NOT skip)
|
||
|
|
|
||
|
|
The drafted grants are best-effort (ingestion = write telemetry; gateway/cron = RW
|
||
|
|
app state; worker/dashboard = read). Confirm before cutover:
|
||
|
|
|
||
|
|
```sql
|
||
|
|
-- (a) Which tables does each app WRITE? Reset stats, run the app for a bit, re-check:
|
||
|
|
SELECT schemaname, relname, n_tup_ins, n_tup_upd, n_tup_del
|
||
|
|
FROM pg_stat_user_tables
|
||
|
|
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
|
||
|
|
ORDER BY 1,2;
|
||
|
|
|
||
|
|
-- (b) Does the app run DDL/migrations at deploy? Check its code/entrypoint for
|
||
|
|
-- CREATE/ALTER/DROP or a migrations runner (e.g. run_migrations.py, alembic).
|
||
|
|
-- If yes → it needs object OWNERSHIP, see Step 3.
|
||
|
|
```
|
||
|
|
Or temporarily set `log_statement = 'ddl'` (or `'mod'`) and watch one deploy cycle.
|
||
|
|
|
||
|
|
## Step 2 — Create the roles (no app impact yet)
|
||
|
|
|
||
|
|
Generate a password per role (host-only, 0600), then apply the SQL as postgres:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
for r in webhook_app ingest_app worker_app dashboard_app gateway_app cron_app; do
|
||
|
|
[ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
|
||
|
|
done
|
||
|
|
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)
|
||
|
|
|
||
|
|
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
|
||
|
|
-v webhook_pw="$(cat ~/.webhook_app.pw)" -v ingest_pw="$(cat ~/.ingest_app.pw)" \
|
||
|
|
-v worker_pw="$(cat ~/.worker_app.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
|
||
|
|
< scripts/app_roles_tracksolid_db.sql
|
||
|
|
|
||
|
|
docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
|
||
|
|
-v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
|
||
|
|
< scripts/app_roles_fleet_platform.sql
|
||
|
|
```
|
||
|
|
|
||
|
|
## Step 3 — (Only if an app runs migrations) give its role object ownership
|
||
|
|
|
||
|
|
All objects are owned by `postgres`, so a non-superuser role can write **rows** but
|
||
|
|
not `ALTER`/`DROP` existing tables. If discovery showed an app issues DDL, reassign
|
||
|
|
the **app schemas** to the existing non-superuser owner role and add the app role to
|
||
|
|
it. **Scope this to the app schemas — never `REASSIGN OWNED BY postgres` globally**
|
||
|
|
(that would also try to move TimescaleDB/system objects).
|
||
|
|
|
||
|
|
```sql
|
||
|
|
-- tracksolid_db: make tracksolid_owner own the app objects, then add the ingestors.
|
||
|
|
DO $$
|
||
|
|
DECLARE r record;
|
||
|
|
BEGIN
|
||
|
|
FOR r IN
|
||
|
|
SELECT n.nspname, c.relname,
|
||
|
|
CASE c.relkind WHEN 'v' THEN 'VIEW' WHEN 'm' THEN 'MATERIALIZED VIEW' ELSE 'TABLE' END AS kind
|
||
|
|
FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace
|
||
|
|
WHERE n.nspname IN ('tracksolid','reporting') AND c.relkind IN ('r','p','v','m')
|
||
|
|
LOOP
|
||
|
|
EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', r.kind, r.nspname, r.relname);
|
||
|
|
END LOOP;
|
||
|
|
END $$;
|
||
|
|
GRANT CREATE ON SCHEMA tracksolid, reporting TO tracksolid_owner;
|
||
|
|
GRANT tracksolid_owner TO webhook_app, ingest_app; -- they inherit ownership rights
|
||
|
|
```
|
||
|
|
(Do the analogous reassignment in `fleet_platform` to a `fleet_platform_owner` role
|
||
|
|
if `gateway`/`cron` run migrations. Keep `reporting.v_trips` owned by
|
||
|
|
`reporting_refresher` if that role refreshes it.)
|
||
|
|
|
||
|
|
Test one deploy/migration as the new role **before** cutting over all apps.
|
||
|
|
|
||
|
|
## Step 4 — Cut over one app at a time
|
||
|
|
|
||
|
|
For each service, change its `DATABASE_URL` user/password from `postgres:…` to the new
|
||
|
|
role (same host/port/dbname), redeploy **just that one**, and watch its logs for
|
||
|
|
`permission denied` (→ widen the group grant) and the DB for connection count:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# in the app's env (Coolify secret or compose):
|
||
|
|
# tracksolid_db: postgresql://webhook_app:<pw>@timescale_db:5432/tracksolid_db
|
||
|
|
# fleet_platform: postgresql://gateway_app:<pw>@timescale_db:5432/fleet_platform
|
||
|
|
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
|
||
|
|
"SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
|
||
|
|
```
|
||
|
|
Order: start with the **lowest-risk reader** (`worker`/`dashboard_api`), then the
|
||
|
|
ingestors, then `gateway`/`cron`.
|
||
|
|
|
||
|
|
## Rollback (instant)
|
||
|
|
|
||
|
|
Each app's only change is its `DATABASE_URL`. If anything misbehaves, set it back to
|
||
|
|
the `postgres:…` DSN and redeploy that one app — no DB change required. The roles are
|
||
|
|
additive; to remove one entirely: `DROP ROLE <app>;` (after nothing uses it).
|
||
|
|
|
||
|
|
## After all six are migrated
|
||
|
|
|
||
|
|
- Add `idle_session_timeout` is already covered by the per-role GUCs above.
|
||
|
|
- Consider **rotating the `postgres` superuser password** and restricting it to admin
|
||
|
|
use only (it should no longer appear in any app's env).
|
||
|
|
- Re-check the budget: `SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1;`
|
||
|
|
— no app should exceed its `CONNECTION LIMIT`, and the total should sit comfortably
|
||
|
|
under 100. This is also when PgBouncer (separate PR) becomes optional rather than
|
||
|
|
necessary.
|