infra(pgbouncer): add transaction-pooling front for timescale_db

The DB is at max_connections=100 with ~9 services each holding persistent pools
(several as the postgres superuser, idle for hours), so peaks hit "too many
connections". PgBouncer multiplexes many client connections onto a small fixed
set of backends, bounding DB connections regardless of how many app pools exist.

Adds (stack-wide infra, parked in this repo for now — see README scope note):
- pgbouncer.ini: transaction pooling, auth_query pass-through, bounded pool sizes
- auth_setup.sql: pgbouncer_auth role + SECURITY DEFINER pgbouncer.user_lookup()
  so per-app passwords aren't hand-maintained
- docker-compose.yml: the service (join the existing DB network)
- userlist.txt.example + .gitignore: keep the auth verifier out of git
- README.md: deploy steps, incremental cutover (superuser apps first), and the
  transaction-pooling caveats — including the MCP-specific note (rely on role-level
  GUCs; simplest to leave the minor MCP direct and pool the heavy superuser apps)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
kiania 2026-06-19 23:44:30 +03:00
parent 0355047fdd
commit b58e429c1c
6 changed files with 271 additions and 0 deletions

2
pgbouncer/.gitignore vendored Normal file
View file

@ -0,0 +1,2 @@
# The real userlist holds the pgbouncer_auth verifier — never commit it.
userlist.txt

111
pgbouncer/README.md Normal file
View file

@ -0,0 +1,111 @@
# PgBouncer for `timescale_db` (connection pooling)
> **Scope note:** this is **stack-wide infrastructure**, shared by every service that
> talks to `timescale_db` — it is only *parked* in the analytics-MCP repo because that
> is where the "too many connections" investigation happened. It arguably belongs in
> the backend/ingestion repo (`tracksolid_timescale_grafana_prod`). Move it there when
> convenient.
## Why
The DB runs at `max_connections = 100`. About nine services each keep a persistent
pool open — and several connect as the **`postgres` superuser**, holding connections
**idle for hours**. When those pools fill under load simultaneously, the sum crosses
~97 and new connections fail with `FATAL: sorry, too many clients already`.
PgBouncer fixes this structurally: clients connect to PgBouncer (cheap, thousands
allowed), and it multiplexes them onto a **small, fixed set** of real backend
connections. The DB's connection count then depends on the pool size you choose, not
on how many app pools exist.
```
9 app pools ──▶ PgBouncer :6432 ──▶ ≤25 real backends ──▶ timescale_db :5432
(hundreds) (transaction mode)
```
## Files
| File | Purpose |
|---|---|
| `pgbouncer.ini` | pooling + auth config (transaction mode, `auth_query`) |
| `auth_setup.sql` | creates `pgbouncer_auth` + `pgbouncer.user_lookup()` on the DB |
| `userlist.txt.example` | how to generate the real (gitignored) `userlist.txt` |
| `docker-compose.yml` | the PgBouncer service (join the DB network) |
## Deploy (once)
```bash
# 0) on the host, generate a password for the auth role
( umask 077; openssl rand -hex 24 > ~/.pgbouncer_auth.pw )
# 1) create the auth role + lookup function (as postgres superuser)
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
-v pgb_pw="$(cat ~/.pgbouncer_auth.pw)" < pgbouncer/auth_setup.sql
# 2) build userlist.txt from the stored verifier (formats always match this way)
docker exec -i "$DB" psql -U postgres -d tracksolid_db -tAc \
"SELECT '\"pgbouncer_auth\" \"' || passwd || '\"' \
FROM pg_shadow WHERE usename='pgbouncer_auth'" > pgbouncer/userlist.txt
# 3) set the real DB network name in docker-compose.yml (networks.dbnet.name), then:
docker compose -f pgbouncer/docker-compose.yml up -d
```
## Cut services over (incrementally)
Repoint each app's `DATABASE_URL` host/port from `timescale_db:5432` to
`pgbouncer:6432`**same** dbname, user, and password — and redeploy it.
**Migrate the superuser app pools first** (`webhook_receiver`, `ingest_worker`,
`dashboard_api` backend, `worker`/`cron`/`gateway`) — they are the heaviest
consumers. Do them one at a time and watch `SHOW POOLS;` (below).
## ⚠️ Transaction-pooling caveats — read before cutting over
`pool_mode = transaction` returns the backend to the pool at every COMMIT/ROLLBACK,
so **session-scoped features don't survive across transactions**:
- **Server-side prepared statements** — the app must not rely on them, or set the
driver to not cache them (e.g. asyncpg `statement_cache_size=0`; libpq simple
query / psycopg2 default is fine). PgBouncer ≥1.21 supports prepared statements in
transaction mode if you set `max_prepared_statements > 0` — enable that if an app
needs them.
- **`SET`/`RESET` that must persist between transactions**, session `LISTEN/NOTIFY`,
advisory locks held across transactions, `WITH HOLD` cursors, session temp tables.
- **Per-connection `options` startup GUCs are ignored** (see `ignore_startup_parameters`).
Apps that set GUCs via the `options=` DSN param must instead pin them at the **role**
level: `ALTER ROLE <app> SET statement_timeout = '...';` etc.
### The analytics MCP specifically
The MCP sends `options=-c default_transaction_read_only=on -c statement_timeout=30000`
on its DSN and calls `set_session(readonly=True)`. Behind transaction pooling:
- The `options` GUCs are dropped — **but** `analytics_ro` already has
`default_transaction_read_only=on` and `statement_timeout=30s` pinned at the role
level (`scripts/analytics_ro_role.sql`), so read-only enforcement is preserved.
- `set_session(readonly=True)` issues a `SET` that can leak across pooled clients.
Before pointing the MCP at PgBouncer, either drop that call (role default covers it)
or run the **MCP only in `session` pooling** (add a second `[databases]` alias with
`pool_mode=session`). Given the MCP is a *minor* consumer, the simplest path is to
**leave the MCP connecting directly** and pool only the heavy superuser apps.
## Operating
```bash
# admin console
docker exec -it pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_auth pgbouncer
# SHOW POOLS; -- cl_active / sv_active / waiting per pool
# SHOW CLIENTS; -- connected clients
# SHOW STATS; -- throughput
# sanity: confirm the DB now sees a small, stable backend count
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
"SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
```
**Sizing rule:** total backends PgBouncer opens = `Σ(default_pool_size per database) +
reserve_pool_size`. Keep that **well under** `max_connections` (100), leaving headroom
for superuser/admin/background-worker connections that bypass PgBouncer. The shipped
config (20 + 5 reserve, one database) tops out at ~25 backends.

40
pgbouncer/auth_setup.sql Normal file
View file

@ -0,0 +1,40 @@
-- auth_setup.sql — create the PgBouncer auth_query plumbing on tracksolid_db.
-- ─────────────────────────────────────────────────────────────────────────────
-- Run ONCE as the postgres SUPERUSER (the SECURITY DEFINER function must be owned by
-- a superuser to read pg_shadow). Apply with:
-- docker exec -i <timescale_db> psql -U postgres -d tracksolid_db \
-- -v ON_ERROR_STOP=1 -v pgb_pw="$(cat ~/.pgbouncer_auth.pw)" < auth_setup.sql
--
-- This lets PgBouncer authenticate ANY app user by looking its stored SCRAM verifier
-- up at connect time — so you never hand-maintain a userlist of every app password.
-- Only the pgbouncer_auth role itself needs an entry in userlist.txt.
\set ON_ERROR_STOP on
-- 1) A minimal LOGIN role PgBouncer uses to run the lookup. No other privileges.
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pgbouncer_auth') THEN
CREATE ROLE pgbouncer_auth LOGIN NOSUPERUSER NOCREATEDB NOCREATEROLE;
END IF;
END $$;
ALTER ROLE pgbouncer_auth WITH LOGIN PASSWORD :'pgb_pw';
-- 2) The lookup function. SECURITY DEFINER (owned by postgres) so it can read
-- pg_shadow; returns exactly (username, scram_verifier) for one user.
CREATE SCHEMA IF NOT EXISTS pgbouncer AUTHORIZATION postgres;
CREATE OR REPLACE FUNCTION pgbouncer.user_lookup(
IN i_username text, OUT uname text, OUT phash text
) RETURNS record
LANGUAGE sql
SECURITY DEFINER
SET search_path = pg_catalog
AS $$
SELECT usename, passwd FROM pg_catalog.pg_shadow WHERE usename = i_username;
$$;
-- 3) Lock the function down to ONLY pgbouncer_auth.
REVOKE ALL ON FUNCTION pgbouncer.user_lookup(text) FROM public;
GRANT USAGE ON SCHEMA pgbouncer TO pgbouncer_auth;
GRANT EXECUTE ON FUNCTION pgbouncer.user_lookup(text) TO pgbouncer_auth;

View file

@ -0,0 +1,40 @@
# docker-compose.yml — PgBouncer in front of timescale_db.
# ─────────────────────────────────────────────────────────────────────────────
# Deploy:
# 1. Apply auth_setup.sql to the DB as postgres (creates pgbouncer_auth + lookup fn).
# 2. Generate pgbouncer/userlist.txt (see userlist.txt.example).
# 3. Put this stack on the SAME docker network as timescale_db so `timescale_db`
# resolves (the tracksolid stack's network — the one with the 10.0.15.x addrs).
# 4. `docker compose -f pgbouncer/docker-compose.yml up -d`
# 5. Repoint each app's DSN host:port from timescale_db:5432 → pgbouncer:6432
# (same dbname/user/password) and redeploy it. Migrate the SUPERUSER app pools
# first — they are the heaviest consumers.
services:
pgbouncer:
image: edoburu/pgbouncer:latest # pin to a digest/tag in prod
container_name: pgbouncer
restart: unless-stopped
networks: [dbnet]
ports:
- "6432:6432" # drop this if only in-network apps connect
volumes:
- ./pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini:ro
- ./userlist.txt:/etc/pgbouncer/userlist.txt:ro
logging:
driver: json-file
options: { max-size: "10m", max-file: "5" }
healthcheck:
# `SHOW VERSION` on the admin console proves PgBouncer is accepting connections.
test: ["CMD-SHELL", "psql -h 127.0.0.1 -p 6432 -U pgbouncer_auth pgbouncer -tAc 'SHOW VERSION' || exit 1"]
interval: 30s
timeout: 3s
retries: 3
start_period: 10s
networks:
# Attach to the EXISTING network that can reach timescale_db (external = pre-created
# by the tracksolid/Coolify stack). Set the real name here, e.g. the network shown by
# docker inspect timescale_db --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{"\n"}}{{end}}'
dbnet:
external: true
name: CHANGE_ME_tracksolid_db_network

64
pgbouncer/pgbouncer.ini Normal file
View file

@ -0,0 +1,64 @@
; pgbouncer.ini — transaction-pooling front for timescale_db (tracksolid_db).
; ─────────────────────────────────────────────────────────────────────────────
; Purpose: the DB runs at max_connections=100 and ~9 stack services each hold a
; persistent pool (several as the postgres superuser, idle for hours), so peaks hit
; "too many connections". PgBouncer multiplexes MANY client connections onto a SMALL
; set of real backend connections, so the DB connection count stays bounded no matter
; how many app pools exist.
;
; Auth uses auth_query (NOT a hand-maintained userlist of every app): PgBouncer logs
; in as `pgbouncer_auth` and looks each user's verifier up via pgbouncer.user_lookup()
; — see auth_setup.sql. Only the pgbouncer_auth verifier lives in userlist.txt.
[databases]
; Apps point their DSN host at pgbouncer:6432 with the SAME dbname/user/password.
; `host` here is the real DB (the timescale_db container hostname on the DB network).
tracksolid_db = host=timescale_db port=5432 dbname=tracksolid_db
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
; ── Auth (pass-through via auth_query) ──────────────────────────────────────
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
auth_user = pgbouncer_auth
auth_query = SELECT uname, phash FROM pgbouncer.user_lookup($1)
; ── Pooling ─────────────────────────────────────────────────────────────────
; transaction mode = a server connection is returned to the pool at COMMIT/ROLLBACK,
; so a handful of backends serve hundreds of clients. See README for the feature
; caveats (no session-level prepared statements / SET that persists across txns).
pool_mode = transaction
; Total backend connections PgBouncer will ever open to the DB =
; (number of [databases] entries) × default_pool_size + reserve_pool_size.
; Keep the SUM across all poolers well under the DB's max_connections (100).
; With one database and 20 + 5 reserve = 25 backends max — leaving headroom for
; superuser/admin/background-worker connections that bypass PgBouncer.
default_pool_size = 20
min_pool_size = 0
reserve_pool_size = 5
reserve_pool_timeout = 3
; Clients can be plentiful (cheap) — only backends are scarce.
max_client_conn = 2000
; Recycle idle/old server connections so none linger for hours.
server_idle_timeout = 300
server_lifetime = 3600
; ── Robustness ──────────────────────────────────────────────────────────────
; Apps set per-connection GUCs via the `options` startup param (e.g. the analytics
; MCP sends `-c default_transaction_read_only=on -c statement_timeout=...`). In
; transaction pooling those can't be honored per-shared-backend, so ignore them and
; rely on ROLE-level settings (ALTER ROLE ... SET ...) instead. See README.
ignore_startup_parameters = extra_float_digits,options,search_path
; Admin/stats console (psql -p 6432 pgbouncer) — restricted to the auth role.
admin_users = pgbouncer_auth
stats_users = pgbouncer_auth
; Quiet by default; flip to 1 temporarily when debugging.
log_connections = 0
log_disconnections = 0

View file

@ -0,0 +1,14 @@
# userlist.txt — ONLY the pgbouncer_auth role needs an entry; every other user is
# resolved at connect time by auth_query (pgbouncer.user_lookup). See README.
#
# The real userlist.txt is gitignored (it holds a credential). Generate it from the
# pgbouncer_auth password you set in auth_setup.sql — PgBouncer accepts the verifier
# in SCRAM form. Easiest: copy the stored verifier straight from Postgres so the
# formats always match:
#
# docker exec -i <timescale_db> psql -U postgres -d tracksolid_db -tAc \
# "SELECT '\"pgbouncer_auth\" \"' || passwd || '\"' \
# FROM pg_shadow WHERE usename='pgbouncer_auth'" > pgbouncer/userlist.txt
#
# That yields a line of the form (SCRAM-SHA-256 verifier shown abbreviated):
"pgbouncer_auth" "SCRAM-SHA-256$4096:....$....:...."