tracksolid_timescale_grafan.../260507_pgbouncer_deployment.md
David Kiania 3b79d5a62e
Some checks failed
Static Analysis / static (push) Has been cancelled
Tests / test (push) Has been cancelled
revert(infra): remove pgAdmin4 sidecar and configs
Reverts the Phase 2 pgAdmin web sidecar from bc020cb. pgbouncer (Phase 1)
stays in place. On the instance the pgadmin container has been stopped
and removed and the pgadmin-data volume dropped; Coolify subdomain and
PGADMIN_DEFAULT_* env vars to be removed in the UI separately.

Files:
- docker-compose.yaml: drop pgadmin service block + pgadmin-data volume
- pgadmin/servers.json: delete (directory removed)
- 260507_pgbouncer_deployment.md: strip Phase 2, runbook is pgbouncer-only

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-08 00:34:10 +03:00

235 lines
8.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# pgbouncer sidecar deployment
**Date:** 2026-05-07
**Branch:** `quality-program-2026-04-12`
**Status:** Phase 1 deployed. Phase 2 (pgAdmin4 sidecar) was rolled back on 2026-05-08 — see git history (`bc020cb`, then reverted).
---
## Context
**Driver:** pgAdmin4 running on the maintainer's laptop has been exhausting
`tracksolid_db`'s `max_connections`. Each Query Tool tab in pgAdmin holds its
own long-lived backend connection; combined with the existing peak of ~5060
connections from the ingest pipeline, the budget tips over and cascades —
pgcli (and anything else trying to connect) starts failing.
**Goal:** Add pgbouncer in front of `timescale_db` to enforce a connection
budget via transaction-mode pooling. Desktop pgAdmin (or any other admin
client) connects through pgbouncer and is multiplexed onto a small fixed
pool of backends.
**Frozen scope (unchanged this round):**
- DWH bronze pipeline (`dwh/*.sql`, `tracksolid_dwh@31.97.44.246:5888`)
- n8n DWH workflows (`n8n-workflows/dwh_extract*`, `dwh_load_bronze*`)
- Grafana provisioning (`grafana/provisioning/datasources/...`)
- Python ingest containers (`ingest_movement_rev.py`, `ingest_events_rev.py`,
`webhook_receiver_rev.py`) — they keep talking to `timescale_db:5432`
directly. Cutover, if desired, is a separate plan.
- `db_backup` sidecar — `pg_dump` is incompatible with transaction-mode
pooling and stays on `timescale_db:5432`.
---
## Phase 1 — pgbouncer sidecar, no client cutover
Add a new service to `docker-compose.yaml`. Internal Docker network only;
no host port binding.
```yaml
pgbouncer:
image: edoburu/pgbouncer
restart: always
depends_on:
timescale_db:
condition: service_healthy
env_file: .env
environment:
- DB_HOST=timescale_db
- DB_PORT=5432
- DB_USER=${POSTGRES_USER}
- DB_PASSWORD=${POSTGRES_PASSWORD}
- DB_NAME=${POSTGRES_DB}
- POOL_MODE=transaction
- AUTH_TYPE=scram-sha-256
- MAX_CLIENT_CONN=200
- DEFAULT_POOL_SIZE=15
- MIN_POOL_SIZE=2
- RESERVE_POOL_SIZE=5
- SERVER_RESET_QUERY=DISCARD ALL
- SERVER_IDLE_TIMEOUT=600
- ADMIN_USERS=${POSTGRES_USER}
- LISTEN_PORT=6432
- AUTH_USER=pgbouncer
- AUTH_QUERY=SELECT uname, phash FROM public.user_lookup($$1)
healthcheck:
test: ["CMD-SHELL", "pg_isready -h 127.0.0.1 -p 6432 -U ${POSTGRES_USER}"]
interval: 30s
timeout: 5s
retries: 3
```
**Why these values:**
- `POOL_MODE=transaction` — recycles backend on every transaction boundary.
Cuts pgAdmin's per-tab idle conn from 1 backend → ~0 when idle.
- `DEFAULT_POOL_SIZE=15` — total backend slots per (user, db) pair. Sits
comfortably under Postgres `max_connections` (default 100) leaving room for
ingest's existing ~5060.
- `MAX_CLIENT_CONN=200` — pgAdmin can open as many tabs as it wants; they
queue rather than fail.
- `RESERVE_POOL_SIZE=5` — emergency slack when `default_pool_size` saturates.
- `SERVER_RESET_QUERY=DISCARD ALL` — wipes session state between transactions
so leaked `SET`s from one client don't bleed into the next.
### Auth: SCRAM passthrough via `auth_query`
Avoids hand-maintaining `userlist.txt`. pgbouncer authenticates as a
dedicated `pgbouncer` Postgres role and looks up SCRAM hashes for the
requesting user via a SECURITY DEFINER function.
New migration `10_pgbouncer_auth.sql` (08 and 09 are taken by
`08_analytics_config.sql` and `09_trips_enrichment.sql`):
```sql
-- Role created with placeholder password; run_migrations.py:sync_role_passwords
-- replaces it with PGBOUNCER_AUTH_PASSWORD on every container startup.
-- Same convention used today for grafana_ro.
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pgbouncer') THEN
CREATE ROLE pgbouncer LOGIN PASSWORD 'SET_PASSWORD_IN_ENV';
END IF;
END
$$;
CREATE OR REPLACE FUNCTION public.user_lookup(in_user text,
OUT uname text, OUT phash text) RETURNS record AS $$
BEGIN
SELECT usename, passwd FROM pg_catalog.pg_shadow
WHERE usename = in_user INTO uname, phash;
RETURN;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
REVOKE ALL ON FUNCTION public.user_lookup(text) FROM public;
GRANT EXECUTE ON FUNCTION public.user_lookup(text) TO pgbouncer;
```
Two changes to `run_migrations.py`:
1. Append `"10_pgbouncer_auth.sql"` to `MIGRATIONS`.
2. Extend `sync_role_passwords()` `roles` dict with
`"pgbouncer": os.getenv("PGBOUNCER_AUTH_PASSWORD")`.
The migration is applied by the next ingest container restart and recorded
in `tracksolid.schema_migrations`. `sync_role_passwords` then ALTER ROLEs
the password from the env var so the placeholder is never live.
### New env vars in `.env`
- `PGBOUNCER_AUTH_PASSWORD` — password for the new `pgbouncer` Postgres role
- (existing vars reused: `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`)
### Phase 1 verification
1. Apply migration via ingest container restart; confirm in
`tracksolid.schema_migrations` that `10_pgbouncer_auth.sql` is recorded.
2. `docker compose up -d pgbouncer`.
3. From inside any compose service:
```bash
psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'
```
4. From the pgbouncer container's admin console:
```bash
psql -h 127.0.0.1 -p 6432 -U postgres -d pgbouncer -c 'SHOW POOLS;'
```
Confirm pool mode = `transaction`, server connections within
`default_pool_size`.
5. `SHOW STATS;` and `SHOW CLIENTS;` should both respond.
6. Confirm no client has cut over: `tracksolid.ingestion_log` continues
accumulating; Grafana panels keep refreshing.
---
## Files to modify / create
| Path | Change |
|---|---|
| `260507_pgbouncer_deployment.md` | THIS FILE — runbook for the rollout |
| `docker-compose.yaml` | Add `pgbouncer` service |
| `10_pgbouncer_auth.sql` | NEW — creates `pgbouncer` role + `public.user_lookup` SECURITY DEFINER function |
| `.env` | Add `PGBOUNCER_AUTH_PASSWORD` (do not commit values) |
| `docs/CONNECTIONS.md` | Add a "pgbouncer" section: pool mode, exposure, who uses it, how to connect for ad-hoc admin |
| `CLAUDE.md` §3 / §4 | Note that admin tooling now goes through `pgbouncer:6432`; ingest/grafana/backup remain direct; reference this runbook |
## Files NOT to modify (frozen scope)
- `grafana/provisioning/datasources/tracksolid_postgres.yaml`
- `n8n-workflows/dwh_extract*.json`, `n8n-workflows/dwh_load_bronze*.json`
- `dwh/*.sql`
- `ingest_movement_rev.py`, `ingest_events_rev.py`, `webhook_receiver_rev.py`,
`ts_shared_rev.py`
- `backup/``pg_dump` keeps using `timescale_db:5432` directly
---
## Reused conventions and utilities
- `run_migrations.py` already applies new `NN_*.sql` files in order against
`tracksolid_db` and tracks them in `tracksolid.schema_migrations`. Phase 1
adds `10_pgbouncer_auth.sql` to this flow — no new tooling needed.
- `env_file: .env` + `depends_on: <svc> condition: service_healthy` mirrors
the existing pattern at `docker-compose.yaml:2831, 3942, 5053, 6770,
8790`.
- Container-name resolution rule from CLAUDE.md §3 still applies for any
`docker exec` against the new service:
```bash
docker ps --filter name=pgbouncer --format "{{.Names}}" | head -1
```
---
## Out-of-scope follow-ups (separate plans)
1. **Cut over Python ingest to pgbouncer.** Change `DATABASE_URL` in `.env`
from `timescale_db:5432` to `pgbouncer:6432`. Requires verifying psycopg2
pool + SAVEPOINTs against transaction-mode pgbouncer (low risk per
exploration — no LISTEN/NOTIFY, no advisory locks across statements, no
prepared statements in the codebase).
2. **Rotate `dwh_owner` / `grafana_ro` plaintext passwords** still in
`dwh/260423_dwh_ddl_v1.sql` (pre-existing item from CLAUDE.md §10).
---
## Rollback
If pgbouncer misbehaves:
1. **Stop the service without touching the rest of the stack:**
```bash
docker compose stop pgbouncer
docker compose rm -f pgbouncer
```
Ingest, Grafana, webhook, backup are unaffected — they were never cut
over.
2. **Revert the SQL migration if needed:**
```sql
DROP FUNCTION public.user_lookup(text);
DROP ROLE pgbouncer;
DELETE FROM tracksolid.schema_migrations
WHERE filename = '10_pgbouncer_auth.sql';
```
3. **Revert compose changes** by checking out the prior `docker-compose.yaml`.
---
## End-to-end verification checklist
- [ ] `10_pgbouncer_auth.sql` applied — visible in
`tracksolid.schema_migrations`
- [ ] `pgbouncer` service healthy — `docker compose ps` shows `healthy`
- [ ] `psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'`
from inside the network
- [ ] `SHOW POOLS;` in pgbouncer admin shows `transaction` mode
- [ ] Existing pipelines unaffected: `tracksolid.ingestion_log` continues
growing at current rate; Grafana dashboards still render
- [ ] pgcli no longer hits "too many connections" when used alongside pgAdmin