infra(db-roles): dedicated non-superuser roles for the six superuser apps #3
3 changed files with 294 additions and 0 deletions
128
scripts/MIGRATE_APPS_OFF_SUPERUSER.md
Normal file
128
scripts/MIGRATE_APPS_OFF_SUPERUSER.md
Normal file
|
|
@ -0,0 +1,128 @@
|
|||
# Migrating the stack apps off the `postgres` superuser
|
||||
|
||||
## Why
|
||||
|
||||
The Postgres server (`timescale_db`) has `max_connections = 100`. Six service
|
||||
connections run as the **`postgres` superuser**, each with a persistent pool that
|
||||
sits idle for hours. That's the root of the intermittent `FATAL: sorry, too many
|
||||
clients already`:
|
||||
|
||||
- superuser sessions can use the **`superuser_reserved_connections`** slots, so the
|
||||
server can fill completely with no admin headroom;
|
||||
- you can't put a per-role **`CONNECTION LIMIT`** or enforce timeouts on them
|
||||
effectively;
|
||||
- and it's a standing least-privilege risk (any of these apps can read/write/DROP
|
||||
anything in any database).
|
||||
|
||||
Giving each app a dedicated **NOSUPERUSER** role with a hard `CONNECTION LIMIT` fixes
|
||||
all three.
|
||||
|
||||
## The six connections (confirmed live 2026-06-20)
|
||||
|
||||
| Service | Database | Current user | New role | Conn limit | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| `webhook_receiver` | tracksolid_db | postgres | **`tracksolid_owner`** | 30 (shared) | runs migrations |
|
||||
| `ingest_worker` | tracksolid_db | postgres | **`tracksolid_owner`** | (shared) | runs migrations |
|
||||
| `worker` | tracksolid_db | postgres | **`tracksolid_owner`** | (shared) | = ingest_worker image; runs migrations |
|
||||
| `dashboard_api` (prod backend) | tracksolid_db | postgres | `dashboard_app` (read) | 8 | reader |
|
||||
| `gateway` | **fleet_platform** | postgres | `gateway_app` | 15 | migration TBD |
|
||||
| `cron` | **fleet_platform** | postgres | `cron_app` | 5 | migration TBD |
|
||||
|
||||
> **Migrators share `tracksolid_owner`.** `webhook_receiver`, `ingest_worker`, and
|
||||
> `worker` all run `run_migrations.py` (DDL) and write telemetry. Because they ALTER
|
||||
> objects, they must OWN them — so they connect as the single non-superuser
|
||||
> `tracksolid_owner` (the role the repo already intends to own these schemas). One
|
||||
> shared role = correct ownership, no app code change, one bounded connection cap.
|
||||
> `gateway`/`cron` use a **different database** (`fleet_platform`) on the same server —
|
||||
> still counted against the 100-slot ceiling; confirm whether they migrate before
|
||||
> cutover (apply the same owner pattern if so).
|
||||
|
||||
### Connection budget (keep the sum < ~95, leaving 3 reserved + admin headroom)
|
||||
|
||||
```
|
||||
tracksolid_owner 30 (shared by 3 migrators) + dashboard_app 8 = 38 (tracksolid_db)
|
||||
gateway_app 15 + cron_app 5 = 20 (fleet_platform)
|
||||
analytics_ro ~8 + dashboard_ro ~12 + grafana_ro ~5 + reporting_refresher ~3 = ~28 (existing)
|
||||
TOTAL ≈ 86 ✅
|
||||
```
|
||||
Tune the `CONNECTION LIMIT`s to your real pool sizes; the point is the sum is now
|
||||
**bounded and visible**, not open-ended superuser pools.
|
||||
|
||||
## Step 1 — Discovery (DONE 2026-06-20)
|
||||
|
||||
Confirmed live: `webhook_receiver`, `ingest_worker`, `worker` all start with
|
||||
`python run_migrations.py && …` → they run **DDL** and write telemetry (`worker` is
|
||||
the same image as `ingest_worker`). Writes span `tracksolid`, `reporting`, `tickets`.
|
||||
`dashboard_api` (prod backend) reads. `gateway`/`cron` are on `fleet_platform` and
|
||||
write `state`; their migration behaviour is **not yet confirmed** (opaque
|
||||
`entrypoint.sh`) — verify before cutover with:
|
||||
|
||||
```sql
|
||||
-- re-run after a deploy to see writes; or set log_statement='ddl' on fleet_platform.
|
||||
SELECT schemaname, sum(n_tup_ins+n_tup_upd+n_tup_del) FROM pg_stat_user_tables GROUP BY 1;
|
||||
```
|
||||
|
||||
## Step 2 — Create roles + reassign ownership (no app impact yet)
|
||||
|
||||
The ownership reassignment in `app_roles_tracksolid_db.sql` is **safe to run while the
|
||||
apps still connect as `postgres`** — superuser bypasses ownership, so nothing breaks
|
||||
until you flip a `DATABASE_URL`. It is Timescale-aware (skips linked sequences, uses
|
||||
`ALTER MATERIALIZED VIEW` for continuous aggregates, leaves `reporting.v_trips` with
|
||||
`reporting_refresher`) and idempotent — validated in a rolled-back transaction against
|
||||
the live DB.
|
||||
|
||||
```bash
|
||||
for r in tracksolid_owner dashboard_app gateway_app cron_app; do
|
||||
[ -s ~/.$r.pw ] || ( umask 077; openssl rand -hex 24 > ~/.$r.pw )
|
||||
done
|
||||
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)
|
||||
|
||||
# tracksolid_db: owner/migrator role + ownership reassignment + dashboard reader
|
||||
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
|
||||
-v owner_pw="$(cat ~/.tracksolid_owner.pw)" -v dash_pw="$(cat ~/.dashboard_app.pw)" \
|
||||
< scripts/app_roles_tracksolid_db.sql
|
||||
|
||||
# fleet_platform: gateway/cron roles (see that file's notes re: migrations)
|
||||
docker exec -i "$DB" psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
|
||||
-v gateway_pw="$(cat ~/.gateway_app.pw)" -v cron_pw="$(cat ~/.cron_app.pw)" \
|
||||
< scripts/app_roles_fleet_platform.sql
|
||||
```
|
||||
|
||||
> If `gateway`/`cron` run migrations, they need the same owner treatment on
|
||||
> `fleet_platform` (reassign its schemas to a `fleet_platform_owner` login role) — do
|
||||
> that before cutting them over. Until confirmed, leave them on `postgres`.
|
||||
|
||||
## Step 3 — Cut over one app at a time
|
||||
|
||||
Change each service's `DATABASE_URL` user/password (same host/port/dbname), redeploy
|
||||
**just that one**, watch its logs for `permission denied` and the DB for the count:
|
||||
|
||||
```
|
||||
# the three migrators → the shared owner role:
|
||||
postgresql://tracksolid_owner:<owner_pw>@timescale_db:5432/tracksolid_db
|
||||
# the dashboard backend → the reader:
|
||||
postgresql://dashboard_app:<dash_pw>@timescale_db:5432/tracksolid_db
|
||||
```
|
||||
```bash
|
||||
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
|
||||
"SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
|
||||
```
|
||||
**Order:** `dashboard_api` (reader, lowest risk) first → confirm → then the migrators
|
||||
one at a time (`ingest_worker`, then `worker`, then `webhook_receiver`), watching that
|
||||
`run_migrations.py` succeeds and ingestion resumes after each.
|
||||
|
||||
## Rollback (instant)
|
||||
|
||||
Each app's only change is its `DATABASE_URL`. If anything misbehaves, set it back to
|
||||
the `postgres:…` DSN and redeploy that one app — no DB change required. The roles are
|
||||
additive; to remove one entirely: `DROP ROLE <app>;` (after nothing uses it).
|
||||
|
||||
## After all six are migrated
|
||||
|
||||
- Add `idle_session_timeout` is already covered by the per-role GUCs above.
|
||||
- Consider **rotating the `postgres` superuser password** and restricting it to admin
|
||||
use only (it should no longer appear in any app's env).
|
||||
- Re-check the budget: `SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1;`
|
||||
— no app should exceed its `CONNECTION LIMIT`, and the total should sit comfortably
|
||||
under 100. This is also when PgBouncer (separate PR) becomes optional rather than
|
||||
necessary.
|
||||
71
scripts/app_roles_fleet_platform.sql
Normal file
71
scripts/app_roles_fleet_platform.sql
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
-- app_roles_fleet_platform.sql — dedicated NON-SUPERUSER login roles for the apps
|
||||
-- that connect to the fleet_platform database as the `postgres` SUPERUSER.
|
||||
-- ─────────────────────────────────────────────────────────────────────────────
|
||||
-- Sibling of app_roles_tracksolid_db.sql, for the OTHER database on the same server.
|
||||
-- gateway + cron (the fleet_platform Coolify app) connect here as postgres. Same
|
||||
-- rationale: least privilege + a hard per-role CONNECTION LIMIT so they can't
|
||||
-- exhaust the server-wide 100-connection ceiling.
|
||||
--
|
||||
-- Schemas in fleet_platform: auth, domain, events, geo, ops, serve, slo, state
|
||||
-- (all owned by postgres). gateway (the API) and cron (scheduled jobs) almost
|
||||
-- certainly READ+WRITE app state across these, so they get DML; widen/narrow per
|
||||
-- the discovery step in MIGRATE_APPS_OFF_SUPERUSER.md. As with the sibling file,
|
||||
-- this does NOT change object ownership, so it does not grant DDL on existing
|
||||
-- (postgres-owned) objects — see step 3 of the runbook if these apps run migrations.
|
||||
--
|
||||
-- Run as the postgres SUPERUSER, on the fleet_platform database:
|
||||
-- docker exec -i <timescale_db> psql -U postgres -d fleet_platform -v ON_ERROR_STOP=1 \
|
||||
-- -v gateway_pw="$(cat ~/.gateway_app.pw)" \
|
||||
-- -v cron_pw="$(cat ~/.cron_app.pw)" \
|
||||
-- < scripts/app_roles_fleet_platform.sql
|
||||
|
||||
\set ON_ERROR_STOP on
|
||||
|
||||
-- ── 1. Capability group (read + write across the app schemas) ───────────────────
|
||||
DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='fp_app_rw') THEN CREATE ROLE fp_app_rw NOLOGIN; END IF;
|
||||
END $$;
|
||||
|
||||
DO $grants$
|
||||
DECLARE s text;
|
||||
BEGIN
|
||||
FOREACH s IN ARRAY ARRAY['auth','domain','events','geo','ops','serve','slo','state'] LOOP
|
||||
EXECUTE format('GRANT USAGE ON SCHEMA %I TO fp_app_rw', s);
|
||||
EXECUTE format('GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA %I TO fp_app_rw', s);
|
||||
EXECUTE format('GRANT USAGE, SELECT, UPDATE ON ALL SEQUENCES IN SCHEMA %I TO fp_app_rw', s);
|
||||
EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO fp_app_rw', s);
|
||||
EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO fp_app_rw', s);
|
||||
EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT USAGE, SELECT, UPDATE ON SEQUENCES TO fp_app_rw', s);
|
||||
EXECUTE format('ALTER DEFAULT PRIVILEGES FOR ROLE postgres IN SCHEMA %I GRANT EXECUTE ON FUNCTIONS TO fp_app_rw', s);
|
||||
END LOOP;
|
||||
END $grants$;
|
||||
|
||||
-- ── 2. Per-app LOGIN roles ──────────────────────────────────────────────────────
|
||||
-- gateway — the request-facing API (latency-sensitive: short statement_timeout).
|
||||
DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='gateway_app') THEN
|
||||
CREATE ROLE gateway_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
|
||||
END IF; END $$;
|
||||
ALTER ROLE gateway_app WITH LOGIN PASSWORD :'gateway_pw' CONNECTION LIMIT 15;
|
||||
GRANT CONNECT ON DATABASE fleet_platform TO gateway_app;
|
||||
GRANT fp_app_rw TO gateway_app;
|
||||
ALTER ROLE gateway_app SET statement_timeout = '15s';
|
||||
ALTER ROLE gateway_app SET idle_in_transaction_session_timeout = '30s';
|
||||
ALTER ROLE gateway_app SET idle_session_timeout = '5min';
|
||||
ALTER ROLE gateway_app SET lock_timeout = '3s';
|
||||
|
||||
-- cron — scheduled/background jobs (longer queries tolerated).
|
||||
DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='cron_app') THEN
|
||||
CREATE ROLE cron_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
|
||||
END IF; END $$;
|
||||
ALTER ROLE cron_app WITH LOGIN PASSWORD :'cron_pw' CONNECTION LIMIT 5;
|
||||
GRANT CONNECT ON DATABASE fleet_platform TO cron_app;
|
||||
GRANT fp_app_rw TO cron_app;
|
||||
ALTER ROLE cron_app SET statement_timeout = '120s';
|
||||
ALTER ROLE cron_app SET idle_in_transaction_session_timeout = '120s';
|
||||
ALTER ROLE cron_app SET idle_session_timeout = '10min';
|
||||
ALTER ROLE cron_app SET lock_timeout = '5s';
|
||||
|
||||
-- ── 3. Verify ───────────────────────────────────────────────────────────────────
|
||||
-- \du+
|
||||
95
scripts/app_roles_tracksolid_db.sql
Normal file
95
scripts/app_roles_tracksolid_db.sql
Normal file
|
|
@ -0,0 +1,95 @@
|
|||
-- app_roles_tracksolid_db.sql — get the tracksolid_db apps off the postgres SUPERUSER.
|
||||
-- ─────────────────────────────────────────────────────────────────────────────
|
||||
-- DESIGN (validated against the live DB, 2026-06-20):
|
||||
-- * webhook_receiver, ingest_worker, worker each run `run_migrations.py` (DDL) and
|
||||
-- write telemetry. `worker` is a second copy of the ingest_worker image. Because
|
||||
-- they run migrations, they need to OWN the objects they ALTER. They therefore
|
||||
-- connect as the shared, NON-SUPERUSER **tracksolid_owner** (the role the repo
|
||||
-- already intends to own these schemas — see analytics_ro_role.sql default privs).
|
||||
-- * the prod dashboard_api backend only reads → its own read role `dashboard_app`
|
||||
-- (or reuse the existing dashboard_ro).
|
||||
--
|
||||
-- This file is idempotent. Section 2 (ownership reassignment) is Timescale-aware:
|
||||
-- it skips table-linked sequences, uses ALTER MATERIALIZED VIEW for continuous
|
||||
-- aggregates, and leaves reporting.v_trips with reporting_refresher. Reassigning
|
||||
-- while the apps still run as postgres is SAFE — superuser bypasses ownership, so
|
||||
-- nothing breaks until you flip each app's DATABASE_URL (see the runbook).
|
||||
--
|
||||
-- Run as the postgres SUPERUSER, on tracksolid_db:
|
||||
-- docker exec -i <timescale_db> psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
|
||||
-- -v owner_pw="$(cat ~/.tracksolid_owner.pw)" \
|
||||
-- -v dash_pw="$(cat ~/.dashboard_app.pw)" \
|
||||
-- < scripts/app_roles_tracksolid_db.sql
|
||||
|
||||
\set ON_ERROR_STOP on
|
||||
|
||||
-- ── 1. tracksolid_owner: the shared owner/migrator login for the ingestion apps ──
|
||||
DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='tracksolid_owner') THEN
|
||||
CREATE ROLE tracksolid_owner LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
|
||||
END IF; END $$;
|
||||
-- LOGIN + password + a HARD connection cap (the real budget control). No
|
||||
-- statement_timeout: migrations (e.g. CREATE INDEX on a hypertable) can run long.
|
||||
ALTER ROLE tracksolid_owner WITH LOGIN PASSWORD :'owner_pw' CONNECTION LIMIT 30;
|
||||
ALTER ROLE tracksolid_owner SET idle_in_transaction_session_timeout = '5min';
|
||||
ALTER ROLE tracksolid_owner SET idle_session_timeout = '10min';
|
||||
ALTER ROLE tracksolid_owner SET lock_timeout = '10s';
|
||||
GRANT CONNECT ON DATABASE tracksolid_db TO tracksolid_owner;
|
||||
GRANT USAGE, CREATE ON SCHEMA tracksolid, reporting, tickets, fuel TO tracksolid_owner;
|
||||
|
||||
-- ── 2. Reassign the app objects to tracksolid_owner (Timescale-aware, idempotent) ─
|
||||
DO $reassign$
|
||||
DECLARE r record; k text;
|
||||
BEGIN
|
||||
FOR r IN
|
||||
SELECT n.nspname, c.relname, c.relkind,
|
||||
EXISTS (SELECT 1 FROM timescaledb_information.continuous_aggregates ca
|
||||
WHERE ca.view_schema=n.nspname AND ca.view_name=c.relname) AS is_cagg
|
||||
FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace
|
||||
WHERE n.nspname IN ('tracksolid','reporting','tickets','fuel')
|
||||
AND c.relkind IN ('r','p','v','m','S')
|
||||
AND pg_get_userbyid(c.relowner) <> 'tracksolid_owner' -- idempotent
|
||||
AND NOT (n.nspname='reporting' AND c.relname='v_trips') -- keep with refresher
|
||||
AND NOT (c.relkind='S' AND EXISTS ( -- skip linked seqs
|
||||
SELECT 1 FROM pg_depend d WHERE d.objid=c.oid AND d.deptype IN ('a','i')))
|
||||
LOOP
|
||||
k := CASE WHEN r.is_cagg OR r.relkind='m' THEN 'MATERIALIZED VIEW'
|
||||
WHEN r.relkind='v' THEN 'VIEW' WHEN r.relkind='S' THEN 'SEQUENCE' ELSE 'TABLE' END;
|
||||
EXECUTE format('ALTER %s %I.%I OWNER TO tracksolid_owner', k, r.nspname, r.relname);
|
||||
END LOOP;
|
||||
END $reassign$;
|
||||
|
||||
DO $fns$
|
||||
DECLARE r record;
|
||||
BEGIN
|
||||
FOR r IN SELECT p.oid::regprocedure AS sig
|
||||
FROM pg_proc p JOIN pg_namespace n ON n.oid=p.pronamespace
|
||||
WHERE n.nspname IN ('tracksolid','reporting','tickets','fuel')
|
||||
AND pg_get_userbyid(p.proowner) <> 'tracksolid_owner'
|
||||
LOOP EXECUTE format('ALTER FUNCTION %s OWNER TO tracksolid_owner', r.sig); END LOOP;
|
||||
END $fns$;
|
||||
|
||||
-- ── 3. dashboard_app: read-only role for the prod dashboard_api backend ──────────
|
||||
-- (If that backend turns out to also WRITE app state, widen via a write group like
|
||||
-- the fleet_platform file; start read-only.)
|
||||
DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname='dashboard_app') THEN
|
||||
CREATE ROLE dashboard_app LOGIN INHERIT NOSUPERUSER NOCREATEDB NOCREATEROLE;
|
||||
END IF; END $$;
|
||||
ALTER ROLE dashboard_app WITH LOGIN PASSWORD :'dash_pw' CONNECTION LIMIT 8;
|
||||
GRANT CONNECT ON DATABASE tracksolid_db TO dashboard_app;
|
||||
GRANT USAGE ON SCHEMA tracksolid, reporting, tickets, fuel TO dashboard_app;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA tracksolid, reporting, tickets, fuel TO dashboard_app;
|
||||
GRANT SELECT ON reporting.v_trips TO dashboard_app;
|
||||
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA reporting TO dashboard_app;
|
||||
ALTER DEFAULT PRIVILEGES FOR ROLE tracksolid_owner IN SCHEMA tracksolid, reporting, tickets, fuel
|
||||
GRANT SELECT ON TABLES TO dashboard_app; -- future objects (now owned by tracksolid_owner)
|
||||
ALTER ROLE dashboard_app SET statement_timeout = '30s';
|
||||
ALTER ROLE dashboard_app SET idle_in_transaction_session_timeout = '60s';
|
||||
ALTER ROLE dashboard_app SET idle_session_timeout = '5min';
|
||||
ALTER ROLE dashboard_app SET lock_timeout = '5s';
|
||||
|
||||
-- ── 4. Verify ────────────────────────────────────────────────────────────────────
|
||||
-- \du+ tracksolid_owner -- LOGIN + CONNECTION LIMIT 30
|
||||
-- SELECT pg_get_userbyid(relowner), count(*) FROM pg_class
|
||||
-- WHERE relnamespace IN (SELECT oid FROM pg_namespace WHERE nspname='tracksolid') GROUP BY 1;
|
||||
Loading…
Reference in a new issue