tracksolid_timescale_grafan.../260507_pgbouncer_deployment.md
David Kiania e811dd8f34
Some checks are pending
Static Analysis / static (push) Waiting to run
Tests / test (push) Waiting to run
feat(infra): add pgbouncer sidecar to cap tracksolid_db connections
Phase 1 of the pgbouncer + pgAdmin rollout (runbook:
260507_pgbouncer_deployment.md). pgAdmin4 on the maintainer's laptop has
been exhausting tracksolid_db's max_connections, cascading to pgcli and
operations. Adds an internal-only pgbouncer service in transaction mode
with a small backend pool (default 15) so admin-tool sprawl can no
longer starve the ingest pipeline.

No client cutover this round - ingest, Grafana, webhook, and backup all
keep talking to timescale_db:5432 directly. SCRAM passthrough is wired
via a new pgbouncer role + public.user_lookup() function (migration 10).
The role is created with a placeholder password; sync_role_passwords()
in run_migrations.py replaces it from PGBOUNCER_AUTH_PASSWORD on every
container startup, mirroring the existing grafana_ro convention.

Requires PGBOUNCER_AUTH_PASSWORD to be set in .env before deploy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 13:21:35 +03:00

328 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# pgbouncer + pgAdmin4 sidecar deployment
**Date:** 2026-05-07
**Branch:** `quality-program-2026-04-12`
**Status:** Plan approved; implementation pending
---
## Context
**Driver:** pgAdmin4 running on the maintainer's laptop has been exhausting
`tracksolid_db`'s `max_connections`. Each Query Tool tab in pgAdmin holds its
own long-lived backend connection; combined with the existing peak of ~5060
connections from the ingest pipeline, the budget tips over and cascades —
pgcli (and anything else trying to connect) starts failing.
**Goal:** Add pgbouncer in front of `timescale_db` to enforce a connection
budget via transaction-mode pooling, and deploy pgAdmin4 as a Coolify-managed
sidecar that connects through pgbouncer over the Docker network. Net effect:
pgAdmin sprawl is multiplexed onto a small fixed pool of backends, admin
tooling moves on-VM (lower latency, persistent workspace, smaller external
attack surface), and host port 5433 becomes optional/closeable in a follow-up.
**Frozen scope (unchanged this round):**
- DWH bronze pipeline (`dwh/*.sql`, `tracksolid_dwh@31.97.44.246:5888`)
- n8n DWH workflows (`n8n-workflows/dwh_extract*`, `dwh_load_bronze*`)
- Grafana provisioning (`grafana/provisioning/datasources/...`)
- Python ingest containers (`ingest_movement_rev.py`, `ingest_events_rev.py`,
`webhook_receiver_rev.py`) — they keep talking to `timescale_db:5432`
directly. Cutover, if desired, is a separate plan.
- `db_backup` sidecar — `pg_dump` is incompatible with transaction-mode
pooling and stays on `timescale_db:5432`.
---
## Phase 1 — pgbouncer sidecar, no client cutover
Add a new service to `docker-compose.yaml`. Internal Docker network only;
no host port binding.
```yaml
pgbouncer:
image: edoburu/pgbouncer:1.23.1
restart: always
depends_on:
timescale_db:
condition: service_healthy
env_file: .env
environment:
- DB_HOST=timescale_db
- DB_PORT=5432
- DB_USER=${POSTGRES_USER}
- DB_PASSWORD=${POSTGRES_PASSWORD}
- DB_NAME=${POSTGRES_DB}
- POOL_MODE=transaction
- AUTH_TYPE=scram-sha-256
- MAX_CLIENT_CONN=200
- DEFAULT_POOL_SIZE=15
- MIN_POOL_SIZE=2
- RESERVE_POOL_SIZE=5
- SERVER_RESET_QUERY=DISCARD ALL
- SERVER_IDLE_TIMEOUT=600
- ADMIN_USERS=${POSTGRES_USER}
- LISTEN_PORT=6432
- AUTH_USER=pgbouncer
- AUTH_QUERY=SELECT uname, phash FROM public.user_lookup($$1)
healthcheck:
test: ["CMD-SHELL", "pg_isready -h 127.0.0.1 -p 6432 -U ${POSTGRES_USER}"]
interval: 30s
timeout: 5s
retries: 3
```
**Why these values:**
- `POOL_MODE=transaction` — recycles backend on every transaction boundary.
Cuts pgAdmin's per-tab idle conn from 1 backend → ~0 when idle.
- `DEFAULT_POOL_SIZE=15` — total backend slots per (user, db) pair. Sits
comfortably under Postgres `max_connections` (default 100) leaving room for
ingest's existing ~5060.
- `MAX_CLIENT_CONN=200` — pgAdmin can open as many tabs as it wants; they
queue rather than fail.
- `RESERVE_POOL_SIZE=5` — emergency slack when `default_pool_size` saturates.
- `SERVER_RESET_QUERY=DISCARD ALL` — wipes session state between transactions
so leaked `SET`s from one client don't bleed into the next.
### Auth: SCRAM passthrough via `auth_query`
Avoids hand-maintaining `userlist.txt`. pgbouncer authenticates as a
dedicated `pgbouncer` Postgres role and looks up SCRAM hashes for the
requesting user via a SECURITY DEFINER function.
New migration `10_pgbouncer_auth.sql` (08 and 09 are taken by
`08_analytics_config.sql` and `09_trips_enrichment.sql`):
```sql
-- Role created with placeholder password; run_migrations.py:sync_role_passwords
-- replaces it with PGBOUNCER_AUTH_PASSWORD on every container startup.
-- Same convention used today for grafana_ro.
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pgbouncer') THEN
CREATE ROLE pgbouncer LOGIN PASSWORD 'SET_PASSWORD_IN_ENV';
END IF;
END
$$;
CREATE OR REPLACE FUNCTION public.user_lookup(in_user text,
OUT uname text, OUT phash text) RETURNS record AS $$
BEGIN
SELECT usename, passwd FROM pg_catalog.pg_shadow
WHERE usename = in_user INTO uname, phash;
RETURN;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
REVOKE ALL ON FUNCTION public.user_lookup(text) FROM public;
GRANT EXECUTE ON FUNCTION public.user_lookup(text) TO pgbouncer;
```
Two changes to `run_migrations.py`:
1. Append `"10_pgbouncer_auth.sql"` to `MIGRATIONS`.
2. Extend `sync_role_passwords()` `roles` dict with
`"pgbouncer": os.getenv("PGBOUNCER_AUTH_PASSWORD")`.
The migration is applied by the next ingest container restart and recorded
in `tracksolid.schema_migrations`. `sync_role_passwords` then ALTER ROLEs
the password from the env var so the placeholder is never live.
### New env vars in `.env`
- `PGBOUNCER_AUTH_PASSWORD` — password for the new `pgbouncer` Postgres role
- (existing vars reused: `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`)
### Phase 1 verification
1. Apply migration via ingest container restart; confirm in
`tracksolid.schema_migrations` that `10_pgbouncer_auth.sql` is recorded.
2. `docker compose up -d pgbouncer`.
3. From inside any compose service:
```bash
psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'
```
4. From the pgbouncer container's admin console:
```bash
psql -h 127.0.0.1 -p 6432 -U postgres -d pgbouncer -c 'SHOW POOLS;'
```
Confirm pool mode = `transaction`, server connections within
`default_pool_size`.
5. `SHOW STATS;` and `SHOW CLIENTS;` should both respond.
6. Confirm no client has cut over: `tracksolid.ingestion_log` continues
accumulating; Grafana panels keep refreshing.
---
## Phase 2 — pgAdmin4 sidecar pointed at pgbouncer
Coolify UI maps an HTTPS subdomain (e.g. `pgadmin.stage.rahamafresh.com`) to
internal port 80, mirroring the Grafana pattern at `docker-compose.yaml:7880`.
```yaml
pgadmin:
image: dpage/pgadmin4:8.14
restart: always
depends_on:
pgbouncer:
condition: service_healthy
env_file: .env
environment:
- PGADMIN_DEFAULT_EMAIL=${PGADMIN_DEFAULT_EMAIL}
- PGADMIN_DEFAULT_PASSWORD=${PGADMIN_DEFAULT_PASSWORD}
- PGADMIN_CONFIG_SERVER_MODE=True
- PGADMIN_CONFIG_MASTER_PASSWORD_REQUIRED=False
- PGADMIN_DISABLE_POSTFIX=True
volumes:
- pgadmin-data:/var/lib/pgadmin
- ./pgadmin/servers.json:/pgadmin4/servers.json:ro
# COOLIFY DOMAIN LOGIC:
# Set the actual URL in the Coolify UI; service exposes port 80 internally.
```
Add `pgadmin-data` to the `volumes:` block at the bottom of the compose file.
### Pre-registered server (`pgadmin/servers.json`)
```json
{
"Servers": {
"1": {
"Name": "tracksolid_db (via pgbouncer)",
"Group": "Servers",
"Host": "pgbouncer",
"Port": 6432,
"MaintenanceDB": "tracksolid_db",
"Username": "postgres",
"SSLMode": "disable",
"ConnectionParameters": {
"sslmode": "disable",
"connect_timeout": 10
}
}
}
}
```
### New env vars in `.env`
- `PGADMIN_DEFAULT_EMAIL`
- `PGADMIN_DEFAULT_PASSWORD`
### Phase 2 verification
1. In the Coolify UI, point a subdomain at the `pgadmin` service, port 80.
2. Open the URL, log in with `PGADMIN_DEFAULT_EMAIL` /
`PGADMIN_DEFAULT_PASSWORD`.
3. The pre-registered "tracksolid_db (via pgbouncer)" server appears in the
left tree. Connect; provide the `postgres` password when prompted (pgAdmin
stores it in its own keyring after first use).
4. Open a Query Tool, run `SELECT now(), current_user;`.
5. From the `pgbouncer` container admin console:
```sql
SHOW POOLS;
```
`cl_active` should reflect the open pgAdmin tab(s); `sv_active` /
`sv_idle` should sum to ≤ `default_pool_size` (15).
6. **Stress test:** open ~30 Query Tool tabs, run `SELECT pg_sleep(0.1);` in
each. Confirm `tracksolid_db` total connection count stays bounded:
```sql
SELECT count(*) FROM pg_stat_activity;
```
Should be `default_pool_size + reserve_pool_size + (other clients)`,
not the number of pgAdmin tabs.
---
## Files to modify / create
| Path | Change |
|---|---|
| `260507_pgbouncer_deployment.md` | THIS FILE — runbook for the rollout |
| `docker-compose.yaml` | Add `pgbouncer` and `pgadmin` services; add `pgadmin-data` volume |
| `10_pgbouncer_auth.sql` | NEW — creates `pgbouncer` role + `public.user_lookup` SECURITY DEFINER function |
| `pgadmin/servers.json` | NEW — pre-registers `pgbouncer:6432` as the default server |
| `.env` | Add `PGBOUNCER_AUTH_PASSWORD`, `PGADMIN_DEFAULT_EMAIL`, `PGADMIN_DEFAULT_PASSWORD` (do not commit values) |
| `docs/CONNECTIONS.md` | Add a "pgbouncer + pgAdmin" section: pool mode, exposure, who uses it, how to connect for ad-hoc admin |
| `CLAUDE.md` §3 / §4 | Note that admin tooling now goes through `pgbouncer:6432`; ingest/grafana/backup remain direct; reference this runbook |
## Files NOT to modify (frozen scope)
- `grafana/provisioning/datasources/tracksolid_postgres.yaml`
- `n8n-workflows/dwh_extract*.json`, `n8n-workflows/dwh_load_bronze*.json`
- `dwh/*.sql`
- `ingest_movement_rev.py`, `ingest_events_rev.py`, `webhook_receiver_rev.py`,
`ts_shared_rev.py`
- `backup/``pg_dump` keeps using `timescale_db:5432` directly
---
## Reused conventions and utilities
- `run_migrations.py` already applies new `NN_*.sql` files in order against
`tracksolid_db` and tracks them in `tracksolid.schema_migrations`. Phase 1
adds `10_pgbouncer_auth.sql` to this flow — no new tooling needed.
- `env_file: .env` + `depends_on: <svc> condition: service_healthy` mirrors
the existing pattern at `docker-compose.yaml:2831, 3942, 5053, 6770,
8790`.
- Coolify domain-via-UI mirrors the Grafana comment at
`docker-compose.yaml:7880` and the webhook_receiver comment at
`docker-compose.yaml:5455`.
- Container-name resolution rule from CLAUDE.md §3 still applies for any
`docker exec` against the new services:
```bash
docker ps --filter name=pgbouncer --format "{{.Names}}" | head -1
docker ps --filter name=pgadmin --format "{{.Names}}" | head -1
```
---
## Out-of-scope follow-ups (separate plans)
1. **Cut over Python ingest to pgbouncer.** Change `DATABASE_URL` in `.env`
from `timescale_db:5432` to `pgbouncer:6432`. Requires verifying psycopg2
pool + SAVEPOINTs against transaction-mode pgbouncer (low risk per
exploration — no LISTEN/NOTIFY, no advisory locks across statements, no
prepared statements in the codebase).
2. **Close host port 5433** on `timescale_db` once pgAdmin web UI is the
established admin path. Removes the public-IP Postgres exposure entirely.
3. **Rotate `dwh_owner` / `grafana_ro` plaintext passwords** still in
`dwh/260423_dwh_ddl_v1.sql` (pre-existing item from CLAUDE.md §10).
---
## Rollback
If pgbouncer or pgAdmin misbehaves:
1. **Stop the new services without touching the rest of the stack:**
```bash
docker compose stop pgbouncer pgadmin
docker compose rm -f pgbouncer pgadmin
```
Ingest, Grafana, webhook, backup are unaffected — they were never cut
over.
2. **Revert the SQL migration if needed:**
```sql
DROP FUNCTION public.user_lookup(text);
DROP ROLE pgbouncer;
DELETE FROM tracksolid.schema_migrations
WHERE filename = '10_pgbouncer_auth.sql';
```
3. **Revert compose changes** by checking out the prior `docker-compose.yaml`.
---
## End-to-end verification checklist
- [ ] `10_pgbouncer_auth.sql` applied — visible in
`tracksolid.schema_migrations`
- [ ] `pgbouncer` service healthy — `docker compose ps` shows `healthy`
- [ ] `psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'`
from inside the network
- [ ] `SHOW POOLS;` in pgbouncer admin shows `transaction` mode
- [ ] `pgadmin` service healthy — Coolify domain reachable over HTTPS
- [ ] Login + query through pgAdmin succeeds
- [ ] `SELECT count(*) FROM pg_stat_activity;` stays bounded under
30-tab stress test
- [ ] Existing pipelines unaffected: `tracksolid.ingestion_log` continues
growing at current rate; Grafana dashboards still render
- [ ] pgcli no longer hits "too many connections" when used alongside pgAdmin