infra(db-roles): dedicated non-superuser roles for the six superuser apps #3

Open
kianiadee wants to merge 2 commits from infra/app-db-roles into main
Owner

Summary

Dedicated non-superuser Postgres roles for the six service connections that currently run as the postgres superuser — the root of the too many connections peaks and a standing least-privilege risk.

Superuser sessions can consume the superuser_reserved_connections slots and ignore per-role caps, so the 100-slot ceiling can fill with no admin headroom. Each new role gets a hard CONNECTION LIMIT + bounded timeouts, so the budget becomes bounded and visible.

The six connections (confirmed live)

Service Database New role Conn limit
webhook_receiver tracksolid_db webhook_app (write) 10
ingest_worker tracksolid_db ingest_app (write) 10
worker tracksolid_db worker_app (read) 5
dashboard_api backend tracksolid_db dashboard_app (read) 8
gateway fleet_platform gateway_app (rw) 15
cron fleet_platform cron_app (rw) 5

Budget: new 53 + existing readers ~28 ≈ 81 < 100 (gateway/cron use a separate DB but the same server, so they count too).

Files

  • scripts/app_roles_tracksolid_db.sqlts_app_read / ts_app_write capability groups + the four login roles, NOSUPERUSER, with CONNECTION LIMIT and per-role GUCs (statement_timeout, idle_session_timeout, idle_in_transaction_session_timeout, lock_timeout).
  • scripts/app_roles_fleet_platform.sqlfp_app_rw over the fleet_platform schemas (auth/domain/events/geo/ops/serve/slo/state) + gateway_app / cron_app.
  • scripts/MIGRATE_APPS_OFF_SUPERUSER.md — the runbook: discovery (what each app writes / whether it runs DDL), the connection-budget table, the object-ownership step for migration-running apps (reassign the app schemas to the existing tracksolid_ownerscoped, never REASSIGN OWNED BY postgres globally), one-at-a-time cutover order, and instant rollback (revert DATABASE_URL only).

Honest caveats

  • Grants are best-effort by app function (ingestion=write telemetry; gateway/cron=RW app state; worker/dashboard=read). The runbook's discovery step (Step 1) must confirm each before cutover — widen on permission denied.
  • All operational objects are owned by postgres, so these roles can write rows but not run DDL on existing tables. Apps that migrate at deploy need the ownership step (runbook Step 3).
  • Nothing is applied. SQL is drafted and structurally checked; I did not run role DDL against prod (it's gated). Happy to validate it in a rolled-back transaction on request.

Relationship to the other PRs

  • PR #1 — MCP reliability/security/build + footprint.
  • PR #2 — PgBouncer (optional once these roles + limits are in).
  • This PR removes the actual cause (superuser pools) and bounds each app's connections.

🤖 Generated with Claude Code

## Summary Dedicated **non-superuser** Postgres roles for the six service connections that currently run as the `postgres` superuser — the root of the `too many connections` peaks and a standing least-privilege risk. Superuser sessions can consume the `superuser_reserved_connections` slots and ignore per-role caps, so the 100-slot ceiling can fill with no admin headroom. Each new role gets a hard **`CONNECTION LIMIT`** + bounded timeouts, so the budget becomes **bounded and visible**. ## The six connections (confirmed live) | Service | Database | New role | Conn limit | |---|---|---|---| | `webhook_receiver` | tracksolid_db | `webhook_app` (write) | 10 | | `ingest_worker` | tracksolid_db | `ingest_app` (write) | 10 | | `worker` | tracksolid_db | `worker_app` (read) | 5 | | `dashboard_api` backend | tracksolid_db | `dashboard_app` (read) | 8 | | `gateway` | **fleet_platform** | `gateway_app` (rw) | 15 | | `cron` | **fleet_platform** | `cron_app` (rw) | 5 | Budget: new 53 + existing readers ~28 ≈ **81 < 100** ✅ (`gateway`/`cron` use a separate DB but the same server, so they count too). ## Files - **`scripts/app_roles_tracksolid_db.sql`** — `ts_app_read` / `ts_app_write` capability groups + the four login roles, NOSUPERUSER, with `CONNECTION LIMIT` and per-role GUCs (`statement_timeout`, `idle_session_timeout`, `idle_in_transaction_session_timeout`, `lock_timeout`). - **`scripts/app_roles_fleet_platform.sql`** — `fp_app_rw` over the fleet_platform schemas (auth/domain/events/geo/ops/serve/slo/state) + `gateway_app` / `cron_app`. - **`scripts/MIGRATE_APPS_OFF_SUPERUSER.md`** — the runbook: **discovery** (what each app writes / whether it runs DDL), the connection-budget table, the **object-ownership step** for migration-running apps (reassign the app schemas to the existing `tracksolid_owner` — *scoped*, never `REASSIGN OWNED BY postgres` globally), one-at-a-time cutover order, and **instant rollback** (revert `DATABASE_URL` only). ## Honest caveats - Grants are **best-effort by app function** (ingestion=write telemetry; gateway/cron=RW app state; worker/dashboard=read). The runbook's discovery step (Step 1) must confirm each before cutover — widen on `permission denied`. - All operational objects are owned by `postgres`, so these roles can write **rows** but not run **DDL** on existing tables. Apps that migrate at deploy need the ownership step (runbook Step 3). - **Nothing is applied.** SQL is drafted and structurally checked; I did not run role DDL against prod (it's gated). Happy to validate it in a rolled-back transaction on request. ## Relationship to the other PRs - PR #1 — MCP reliability/security/build + footprint. - PR #2 — PgBouncer (optional once these roles + limits are in). - This PR removes the actual cause (superuser pools) and bounds each app's connections. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
kianiadee added 1 commit 2026-06-19 20:52:20 +00:00
Six service connections run as the postgres SUPERUSER across two databases on the
shared 100-connection server — the root of the "too many connections" peaks and a
standing least-privilege risk. Superuser sessions ignore per-role CONNECTION LIMIT
and can consume the superuser-reserved slots.

Drafts (apply as postgres; nothing applied here):
- scripts/app_roles_tracksolid_db.sql — webhook_app, ingest_app, worker_app,
  dashboard_app. Capability groups (ts_app_read / ts_app_write), per-app NOSUPERUSER
  login roles with hard CONNECTION LIMIT + bounded GUCs (statement_timeout,
  idle_session_timeout, idle_in_transaction, lock_timeout).
- scripts/app_roles_fleet_platform.sql — gateway_app, cron_app (the apps on the
  separate fleet_platform DB), fp_app_rw group over its schemas.
- scripts/MIGRATE_APPS_OFF_SUPERUSER.md — runbook: discovery (what each app actually
  writes / whether it runs DDL), connection-budget table (sum ≈ 81 < 100), the
  object-ownership step for migration-running apps (reassign app schemas to the
  existing tracksolid_owner — scoped, never REASSIGN OWNED globally), one-at-a-time
  cutover, and instant rollback (DATABASE_URL only).

Grants are best-effort by app function and explicitly call out where to verify before
cutover; all objects are postgres-owned, so row DML works but DDL needs the ownership
step. See the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kianiadee added 1 commit 2026-06-19 21:08:57 +00:00
Discovery (live) corrected the design: webhook_receiver, ingest_worker, and worker
all run run_migrations.py (DDL) and write telemetry — worker is the same image as
ingest_worker, not a reader. Because they ALTER objects they must own them, so all
three connect as the shared non-superuser tracksolid_owner (the role the repo already
intends to own these schemas). dashboard_api backend stays a reader (dashboard_app).

- app_roles_tracksolid_db.sql rewritten: tracksolid_owner LOGIN + CONNECTION LIMIT 30
  + GUCs + USAGE/CREATE; Timescale-aware ownership reassignment (skips table-linked
  sequences, ALTER MATERIALIZED VIEW for continuous aggregates, leaves reporting.v_trips
  with reporting_refresher, reassigns functions); dashboard_app read role.
- Reassignment validated in a rolled-back transaction on the live DB: reassigns the
  31-chunk position_history hypertable + the v_mileage_daily_cagg continuous aggregate,
  and as tracksolid_owner can ALTER the hypertable and create/drop tables.
- Runbook updated: discovery marked done, ownership folded into the apply (safe while
  apps still run as postgres — superuser bypasses ownership), corrected cutover order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin infra/app-db-roles:infra/app-db-roles
git checkout infra/app-db-roles

Merge

Merge the changes and update on Forgejo.
git checkout main
git merge --no-ff infra/app-db-roles
git checkout main
git merge --ff-only infra/app-db-roles
git checkout infra/app-db-roles
git rebase main
git checkout main
git merge --no-ff infra/app-db-roles
git checkout main
git merge --squash infra/app-db-roles
git checkout main
git merge --ff-only infra/app-db-roles
git checkout main
git merge infra/app-db-roles
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: kianiadee/fleetanalytics_mcp#3
No description provided.