2026-05-07 21:34:10 +00:00
|
|
|
|
# pgbouncer sidecar deployment
|
2026-05-07 10:21:35 +00:00
|
|
|
|
|
|
|
|
|
|
**Date:** 2026-05-07
|
|
|
|
|
|
**Branch:** `quality-program-2026-04-12`
|
2026-05-07 21:34:10 +00:00
|
|
|
|
**Status:** Phase 1 deployed. Phase 2 (pgAdmin4 sidecar) was rolled back on 2026-05-08 — see git history (`bc020cb`, then reverted).
|
2026-05-07 10:21:35 +00:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Context
|
|
|
|
|
|
|
|
|
|
|
|
**Driver:** pgAdmin4 running on the maintainer's laptop has been exhausting
|
|
|
|
|
|
`tracksolid_db`'s `max_connections`. Each Query Tool tab in pgAdmin holds its
|
|
|
|
|
|
own long-lived backend connection; combined with the existing peak of ~50–60
|
|
|
|
|
|
connections from the ingest pipeline, the budget tips over and cascades —
|
|
|
|
|
|
pgcli (and anything else trying to connect) starts failing.
|
|
|
|
|
|
|
|
|
|
|
|
**Goal:** Add pgbouncer in front of `timescale_db` to enforce a connection
|
2026-05-07 21:34:10 +00:00
|
|
|
|
budget via transaction-mode pooling. Desktop pgAdmin (or any other admin
|
|
|
|
|
|
client) connects through pgbouncer and is multiplexed onto a small fixed
|
|
|
|
|
|
pool of backends.
|
2026-05-07 10:21:35 +00:00
|
|
|
|
|
|
|
|
|
|
**Frozen scope (unchanged this round):**
|
|
|
|
|
|
- DWH bronze pipeline (`dwh/*.sql`, `tracksolid_dwh@31.97.44.246:5888`)
|
|
|
|
|
|
- n8n DWH workflows (`n8n-workflows/dwh_extract*`, `dwh_load_bronze*`)
|
|
|
|
|
|
- Grafana provisioning (`grafana/provisioning/datasources/...`)
|
|
|
|
|
|
- Python ingest containers (`ingest_movement_rev.py`, `ingest_events_rev.py`,
|
|
|
|
|
|
`webhook_receiver_rev.py`) — they keep talking to `timescale_db:5432`
|
|
|
|
|
|
directly. Cutover, if desired, is a separate plan.
|
|
|
|
|
|
- `db_backup` sidecar — `pg_dump` is incompatible with transaction-mode
|
|
|
|
|
|
pooling and stays on `timescale_db:5432`.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Phase 1 — pgbouncer sidecar, no client cutover
|
|
|
|
|
|
|
|
|
|
|
|
Add a new service to `docker-compose.yaml`. Internal Docker network only;
|
|
|
|
|
|
no host port binding.
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
|
pgbouncer:
|
2026-05-07 10:48:31 +00:00
|
|
|
|
image: edoburu/pgbouncer
|
2026-05-07 10:21:35 +00:00
|
|
|
|
restart: always
|
|
|
|
|
|
depends_on:
|
|
|
|
|
|
timescale_db:
|
|
|
|
|
|
condition: service_healthy
|
|
|
|
|
|
env_file: .env
|
|
|
|
|
|
environment:
|
|
|
|
|
|
- DB_HOST=timescale_db
|
|
|
|
|
|
- DB_PORT=5432
|
|
|
|
|
|
- DB_USER=${POSTGRES_USER}
|
|
|
|
|
|
- DB_PASSWORD=${POSTGRES_PASSWORD}
|
|
|
|
|
|
- DB_NAME=${POSTGRES_DB}
|
|
|
|
|
|
- POOL_MODE=transaction
|
|
|
|
|
|
- AUTH_TYPE=scram-sha-256
|
|
|
|
|
|
- MAX_CLIENT_CONN=200
|
|
|
|
|
|
- DEFAULT_POOL_SIZE=15
|
|
|
|
|
|
- MIN_POOL_SIZE=2
|
|
|
|
|
|
- RESERVE_POOL_SIZE=5
|
|
|
|
|
|
- SERVER_RESET_QUERY=DISCARD ALL
|
|
|
|
|
|
- SERVER_IDLE_TIMEOUT=600
|
|
|
|
|
|
- ADMIN_USERS=${POSTGRES_USER}
|
|
|
|
|
|
- LISTEN_PORT=6432
|
|
|
|
|
|
- AUTH_USER=pgbouncer
|
|
|
|
|
|
- AUTH_QUERY=SELECT uname, phash FROM public.user_lookup($$1)
|
|
|
|
|
|
healthcheck:
|
|
|
|
|
|
test: ["CMD-SHELL", "pg_isready -h 127.0.0.1 -p 6432 -U ${POSTGRES_USER}"]
|
|
|
|
|
|
interval: 30s
|
|
|
|
|
|
timeout: 5s
|
|
|
|
|
|
retries: 3
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Why these values:**
|
|
|
|
|
|
- `POOL_MODE=transaction` — recycles backend on every transaction boundary.
|
|
|
|
|
|
Cuts pgAdmin's per-tab idle conn from 1 backend → ~0 when idle.
|
|
|
|
|
|
- `DEFAULT_POOL_SIZE=15` — total backend slots per (user, db) pair. Sits
|
|
|
|
|
|
comfortably under Postgres `max_connections` (default 100) leaving room for
|
|
|
|
|
|
ingest's existing ~50–60.
|
|
|
|
|
|
- `MAX_CLIENT_CONN=200` — pgAdmin can open as many tabs as it wants; they
|
|
|
|
|
|
queue rather than fail.
|
|
|
|
|
|
- `RESERVE_POOL_SIZE=5` — emergency slack when `default_pool_size` saturates.
|
|
|
|
|
|
- `SERVER_RESET_QUERY=DISCARD ALL` — wipes session state between transactions
|
|
|
|
|
|
so leaked `SET`s from one client don't bleed into the next.
|
|
|
|
|
|
|
|
|
|
|
|
### Auth: SCRAM passthrough via `auth_query`
|
|
|
|
|
|
|
|
|
|
|
|
Avoids hand-maintaining `userlist.txt`. pgbouncer authenticates as a
|
|
|
|
|
|
dedicated `pgbouncer` Postgres role and looks up SCRAM hashes for the
|
|
|
|
|
|
requesting user via a SECURITY DEFINER function.
|
|
|
|
|
|
|
|
|
|
|
|
New migration `10_pgbouncer_auth.sql` (08 and 09 are taken by
|
|
|
|
|
|
`08_analytics_config.sql` and `09_trips_enrichment.sql`):
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
-- Role created with placeholder password; run_migrations.py:sync_role_passwords
|
|
|
|
|
|
-- replaces it with PGBOUNCER_AUTH_PASSWORD on every container startup.
|
|
|
|
|
|
-- Same convention used today for grafana_ro.
|
|
|
|
|
|
DO $$
|
|
|
|
|
|
BEGIN
|
|
|
|
|
|
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pgbouncer') THEN
|
|
|
|
|
|
CREATE ROLE pgbouncer LOGIN PASSWORD 'SET_PASSWORD_IN_ENV';
|
|
|
|
|
|
END IF;
|
|
|
|
|
|
END
|
|
|
|
|
|
$$;
|
|
|
|
|
|
|
|
|
|
|
|
CREATE OR REPLACE FUNCTION public.user_lookup(in_user text,
|
|
|
|
|
|
OUT uname text, OUT phash text) RETURNS record AS $$
|
|
|
|
|
|
BEGIN
|
|
|
|
|
|
SELECT usename, passwd FROM pg_catalog.pg_shadow
|
|
|
|
|
|
WHERE usename = in_user INTO uname, phash;
|
|
|
|
|
|
RETURN;
|
|
|
|
|
|
END;
|
|
|
|
|
|
$$ LANGUAGE plpgsql SECURITY DEFINER;
|
|
|
|
|
|
|
|
|
|
|
|
REVOKE ALL ON FUNCTION public.user_lookup(text) FROM public;
|
|
|
|
|
|
GRANT EXECUTE ON FUNCTION public.user_lookup(text) TO pgbouncer;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Two changes to `run_migrations.py`:
|
|
|
|
|
|
1. Append `"10_pgbouncer_auth.sql"` to `MIGRATIONS`.
|
|
|
|
|
|
2. Extend `sync_role_passwords()` `roles` dict with
|
|
|
|
|
|
`"pgbouncer": os.getenv("PGBOUNCER_AUTH_PASSWORD")`.
|
|
|
|
|
|
|
|
|
|
|
|
The migration is applied by the next ingest container restart and recorded
|
|
|
|
|
|
in `tracksolid.schema_migrations`. `sync_role_passwords` then ALTER ROLEs
|
|
|
|
|
|
the password from the env var so the placeholder is never live.
|
|
|
|
|
|
|
|
|
|
|
|
### New env vars in `.env`
|
|
|
|
|
|
|
|
|
|
|
|
- `PGBOUNCER_AUTH_PASSWORD` — password for the new `pgbouncer` Postgres role
|
|
|
|
|
|
- (existing vars reused: `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`)
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 1 verification
|
|
|
|
|
|
|
|
|
|
|
|
1. Apply migration via ingest container restart; confirm in
|
|
|
|
|
|
`tracksolid.schema_migrations` that `10_pgbouncer_auth.sql` is recorded.
|
|
|
|
|
|
2. `docker compose up -d pgbouncer`.
|
|
|
|
|
|
3. From inside any compose service:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'
|
|
|
|
|
|
```
|
|
|
|
|
|
4. From the pgbouncer container's admin console:
|
|
|
|
|
|
```bash
|
|
|
|
|
|
psql -h 127.0.0.1 -p 6432 -U postgres -d pgbouncer -c 'SHOW POOLS;'
|
|
|
|
|
|
```
|
|
|
|
|
|
Confirm pool mode = `transaction`, server connections within
|
|
|
|
|
|
`default_pool_size`.
|
|
|
|
|
|
5. `SHOW STATS;` and `SHOW CLIENTS;` should both respond.
|
|
|
|
|
|
6. Confirm no client has cut over: `tracksolid.ingestion_log` continues
|
|
|
|
|
|
accumulating; Grafana panels keep refreshing.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Files to modify / create
|
|
|
|
|
|
|
|
|
|
|
|
| Path | Change |
|
|
|
|
|
|
|---|---|
|
|
|
|
|
|
| `260507_pgbouncer_deployment.md` | THIS FILE — runbook for the rollout |
|
2026-05-07 21:34:10 +00:00
|
|
|
|
| `docker-compose.yaml` | Add `pgbouncer` service |
|
2026-05-07 10:21:35 +00:00
|
|
|
|
| `10_pgbouncer_auth.sql` | NEW — creates `pgbouncer` role + `public.user_lookup` SECURITY DEFINER function |
|
2026-05-07 21:34:10 +00:00
|
|
|
|
| `.env` | Add `PGBOUNCER_AUTH_PASSWORD` (do not commit values) |
|
|
|
|
|
|
| `docs/CONNECTIONS.md` | Add a "pgbouncer" section: pool mode, exposure, who uses it, how to connect for ad-hoc admin |
|
2026-05-07 10:21:35 +00:00
|
|
|
|
| `CLAUDE.md` §3 / §4 | Note that admin tooling now goes through `pgbouncer:6432`; ingest/grafana/backup remain direct; reference this runbook |
|
|
|
|
|
|
|
|
|
|
|
|
## Files NOT to modify (frozen scope)
|
|
|
|
|
|
|
|
|
|
|
|
- `grafana/provisioning/datasources/tracksolid_postgres.yaml`
|
|
|
|
|
|
- `n8n-workflows/dwh_extract*.json`, `n8n-workflows/dwh_load_bronze*.json`
|
|
|
|
|
|
- `dwh/*.sql`
|
|
|
|
|
|
- `ingest_movement_rev.py`, `ingest_events_rev.py`, `webhook_receiver_rev.py`,
|
|
|
|
|
|
`ts_shared_rev.py`
|
|
|
|
|
|
- `backup/` — `pg_dump` keeps using `timescale_db:5432` directly
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Reused conventions and utilities
|
|
|
|
|
|
|
|
|
|
|
|
- `run_migrations.py` already applies new `NN_*.sql` files in order against
|
|
|
|
|
|
`tracksolid_db` and tracks them in `tracksolid.schema_migrations`. Phase 1
|
|
|
|
|
|
adds `10_pgbouncer_auth.sql` to this flow — no new tooling needed.
|
|
|
|
|
|
- `env_file: .env` + `depends_on: <svc> condition: service_healthy` mirrors
|
|
|
|
|
|
the existing pattern at `docker-compose.yaml:28–31, 39–42, 50–53, 67–70,
|
|
|
|
|
|
87–90`.
|
|
|
|
|
|
- Container-name resolution rule from CLAUDE.md §3 still applies for any
|
2026-05-07 21:34:10 +00:00
|
|
|
|
`docker exec` against the new service:
|
2026-05-07 10:21:35 +00:00
|
|
|
|
```bash
|
|
|
|
|
|
docker ps --filter name=pgbouncer --format "{{.Names}}" | head -1
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Out-of-scope follow-ups (separate plans)
|
|
|
|
|
|
|
|
|
|
|
|
1. **Cut over Python ingest to pgbouncer.** Change `DATABASE_URL` in `.env`
|
|
|
|
|
|
from `timescale_db:5432` to `pgbouncer:6432`. Requires verifying psycopg2
|
|
|
|
|
|
pool + SAVEPOINTs against transaction-mode pgbouncer (low risk per
|
|
|
|
|
|
exploration — no LISTEN/NOTIFY, no advisory locks across statements, no
|
|
|
|
|
|
prepared statements in the codebase).
|
2026-05-07 21:34:10 +00:00
|
|
|
|
2. **Rotate `dwh_owner` / `grafana_ro` plaintext passwords** still in
|
2026-05-07 10:21:35 +00:00
|
|
|
|
`dwh/260423_dwh_ddl_v1.sql` (pre-existing item from CLAUDE.md §10).
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Rollback
|
|
|
|
|
|
|
2026-05-07 21:34:10 +00:00
|
|
|
|
If pgbouncer misbehaves:
|
2026-05-07 10:21:35 +00:00
|
|
|
|
|
2026-05-07 21:34:10 +00:00
|
|
|
|
1. **Stop the service without touching the rest of the stack:**
|
2026-05-07 10:21:35 +00:00
|
|
|
|
```bash
|
2026-05-07 21:34:10 +00:00
|
|
|
|
docker compose stop pgbouncer
|
|
|
|
|
|
docker compose rm -f pgbouncer
|
2026-05-07 10:21:35 +00:00
|
|
|
|
```
|
|
|
|
|
|
Ingest, Grafana, webhook, backup are unaffected — they were never cut
|
|
|
|
|
|
over.
|
|
|
|
|
|
2. **Revert the SQL migration if needed:**
|
|
|
|
|
|
```sql
|
|
|
|
|
|
DROP FUNCTION public.user_lookup(text);
|
|
|
|
|
|
DROP ROLE pgbouncer;
|
|
|
|
|
|
DELETE FROM tracksolid.schema_migrations
|
|
|
|
|
|
WHERE filename = '10_pgbouncer_auth.sql';
|
|
|
|
|
|
```
|
|
|
|
|
|
3. **Revert compose changes** by checking out the prior `docker-compose.yaml`.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## End-to-end verification checklist
|
|
|
|
|
|
|
|
|
|
|
|
- [ ] `10_pgbouncer_auth.sql` applied — visible in
|
|
|
|
|
|
`tracksolid.schema_migrations`
|
|
|
|
|
|
- [ ] `pgbouncer` service healthy — `docker compose ps` shows `healthy`
|
|
|
|
|
|
- [ ] `psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'`
|
|
|
|
|
|
from inside the network
|
|
|
|
|
|
- [ ] `SHOW POOLS;` in pgbouncer admin shows `transaction` mode
|
|
|
|
|
|
- [ ] Existing pipelines unaffected: `tracksolid.ingestion_log` continues
|
|
|
|
|
|
growing at current rate; Grafana dashboards still render
|
|
|
|
|
|
- [ ] pgcli no longer hits "too many connections" when used alongside pgAdmin
|