fleetanalytics_mcp/pgbouncer/README.md

# PgBouncer for `timescale_db` (connection pooling)

> **Scope note:** this is **stack-wide infrastructure**, shared by every service that
> talks to `timescale_db` — it is only *parked* in the analytics-MCP repo because that
> is where the "too many connections" investigation happened. It arguably belongs in
> the backend/ingestion repo (`tracksolid_timescale_grafana_prod`). Move it there when
> convenient.

## Why

The DB runs at `max_connections = 100`. About nine services each keep a persistent
pool open — and several connect as the **`postgres` superuser**, holding connections
**idle for hours**. When those pools fill under load simultaneously, the sum crosses
~97 and new connections fail with `FATAL: sorry, too many clients already`.

PgBouncer fixes this structurally: clients connect to PgBouncer (cheap, thousands
allowed), and it multiplexes them onto a **small, fixed set** of real backend
connections. The DB's connection count then depends on the pool size you choose, not
on how many app pools exist.

```
9 app pools  ──▶  PgBouncer :6432  ──▶  ≤25 real backends  ──▶  timescale_db :5432
 (hundreds)        (transaction mode)
```

## Files

| File | Purpose |
|---|---|
| `pgbouncer.ini` | pooling + auth config (transaction mode, `auth_query`) |
| `auth_setup.sql` | creates `pgbouncer_auth` + `pgbouncer.user_lookup()` on the DB |
| `userlist.txt.example` | how to generate the real (gitignored) `userlist.txt` |
| `docker-compose.yml` | the PgBouncer service (join the DB network) |

## Deploy (once)

```bash
# 0) on the host, generate a password for the auth role
( umask 077; openssl rand -hex 24 > ~/.pgbouncer_auth.pw )

# 1) create the auth role + lookup function (as postgres superuser)
DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' | head -1)
docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \
  -v pgb_pw="$(cat ~/.pgbouncer_auth.pw)" < pgbouncer/auth_setup.sql

# 2) build userlist.txt from the stored verifier (formats always match this way)
docker exec -i "$DB" psql -U postgres -d tracksolid_db -tAc \
  "SELECT '\"pgbouncer_auth\" \"' || passwd || '\"' \
     FROM pg_shadow WHERE usename='pgbouncer_auth'" > pgbouncer/userlist.txt

# 3) set the real DB network name in docker-compose.yml (networks.dbnet.name), then:
docker compose -f pgbouncer/docker-compose.yml up -d
```

## Cut services over (incrementally)

Repoint each app's `DATABASE_URL` host/port from `timescale_db:5432` to
`pgbouncer:6432` — **same** dbname, user, and password — and redeploy it.

**Migrate the superuser app pools first** (`webhook_receiver`, `ingest_worker`,
`dashboard_api` backend, `worker`/`cron`/`gateway`) — they are the heaviest
consumers. Do them one at a time and watch `SHOW POOLS;` (below).

## ⚠️ Transaction-pooling caveats — read before cutting over

`pool_mode = transaction` returns the backend to the pool at every COMMIT/ROLLBACK,
so **session-scoped features don't survive across transactions**:

- **Server-side prepared statements** — the app must not rely on them, or set the
  driver to not cache them (e.g. asyncpg `statement_cache_size=0`; libpq simple
  query / psycopg2 default is fine). PgBouncer ≥1.21 supports prepared statements in
  transaction mode if you set `max_prepared_statements > 0` — enable that if an app
  needs them.
- **`SET`/`RESET` that must persist between transactions**, session `LISTEN/NOTIFY`,
  advisory locks held across transactions, `WITH HOLD` cursors, session temp tables.
- **Per-connection `options` startup GUCs are ignored** (see `ignore_startup_parameters`).
  Apps that set GUCs via the `options=` DSN param must instead pin them at the **role**
  level: `ALTER ROLE <app> SET statement_timeout = '...';` etc.

### The analytics MCP specifically

The MCP sends `options=-c default_transaction_read_only=on -c statement_timeout=30000`
on its DSN and calls `set_session(readonly=True)`. Behind transaction pooling:

- The `options` GUCs are dropped — **but** `analytics_ro` already has
  `default_transaction_read_only=on` and `statement_timeout=30s` pinned at the role
  level (`scripts/analytics_ro_role.sql`), so read-only enforcement is preserved.
- `set_session(readonly=True)` issues a `SET` that can leak across pooled clients.
  Before pointing the MCP at PgBouncer, either drop that call (role default covers it)
  or run the **MCP only in `session` pooling** (add a second `[databases]` alias with
  `pool_mode=session`). Given the MCP is a *minor* consumer, the simplest path is to
  **leave the MCP connecting directly** and pool only the heavy superuser apps.

## Operating

```bash
# admin console
docker exec -it pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_auth pgbouncer
#   SHOW POOLS;     -- cl_active / sv_active / waiting per pool
#   SHOW CLIENTS;   -- connected clients
#   SHOW STATS;     -- throughput

# sanity: confirm the DB now sees a small, stable backend count
docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \
  "SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
```

**Sizing rule:** total backends PgBouncer opens = `Σ(default_pool_size per database) +
reserve_pool_size`. Keep that **well under** `max_connections` (100), leaving headroom
for superuser/admin/background-worker connections that bypass PgBouncer. The shipped
config (20 + 5 reserve, one database) tops out at ~25 backends.
infra(pgbouncer): add transaction-pooling front for timescale_db The DB is at max_connections=100 with ~9 services each holding persistent pools (several as the postgres superuser, idle for hours), so peaks hit "too many connections". PgBouncer multiplexes many client connections onto a small fixed set of backends, bounding DB connections regardless of how many app pools exist. Adds (stack-wide infra, parked in this repo for now — see README scope note): - pgbouncer.ini: transaction pooling, auth_query pass-through, bounded pool sizes - auth_setup.sql: pgbouncer_auth role + SECURITY DEFINER pgbouncer.user_lookup() so per-app passwords aren't hand-maintained - docker-compose.yml: the service (join the existing DB network) - userlist.txt.example + .gitignore: keep the auth verifier out of git - README.md: deploy steps, incremental cutover (superuser apps first), and the transaction-pooling caveats — including the MCP-specific note (rely on role-level GUCs; simplest to leave the minor MCP direct and pool the heavy superuser apps) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-19 20:44:30 +00:00			# PgBouncer for `timescale_db` (connection pooling)

			`> Scope note: this is stack-wide infrastructure, shared by every service that`
			> talks to `timescale_db` — it is only parked in the analytics-MCP repo because that
			`> is where the "too many connections" investigation happened. It arguably belongs in`
			> the backend/ingestion repo (`tracksolid_timescale_grafana_prod`). Move it there when
			`> convenient.`

			`## Why`

			The DB runs at `max_connections = 100`. About nine services each keep a persistent
			pool open — and several connect as the `postgres` superuser, holding connections
			`idle for hours. When those pools fill under load simultaneously, the sum crosses`
			~97 and new connections fail with `FATAL: sorry, too many clients already`.

			`PgBouncer fixes this structurally: clients connect to PgBouncer (cheap, thousands`
			`allowed), and it multiplexes them onto a small, fixed set of real backend`
			`connections. The DB's connection count then depends on the pool size you choose, not`
			`on how many app pools exist.`

			```
			`9 app pools ──▶ PgBouncer :6432 ──▶ ≤25 real backends ──▶ timescale_db :5432`
			`(hundreds) (transaction mode)`
			```

			`## Files`

			`\| File \| Purpose \|`
			`\|---\|---\|`
			\| `pgbouncer.ini` \| pooling + auth config (transaction mode, `auth_query`) \|
			\| `auth_setup.sql` \| creates `pgbouncer_auth` + `pgbouncer.user_lookup()` on the DB \|
			\| `userlist.txt.example` \| how to generate the real (gitignored) `userlist.txt` \|
			\| `docker-compose.yml` \| the PgBouncer service (join the DB network) \|

			`## Deploy (once)`

			```bash
			`# 0) on the host, generate a password for the auth role`
			`( umask 077; openssl rand -hex 24 > ~/.pgbouncer_auth.pw )`

			`# 1) create the auth role + lookup function (as postgres superuser)`
			`DB=$(docker ps --filter name=timescale_db --format '{{.Names}}' \| head -1)`
			`docker exec -i "$DB" psql -U postgres -d tracksolid_db -v ON_ERROR_STOP=1 \`
			`-v pgb_pw="$(cat ~/.pgbouncer_auth.pw)" < pgbouncer/auth_setup.sql`

			`# 2) build userlist.txt from the stored verifier (formats always match this way)`
			`docker exec -i "$DB" psql -U postgres -d tracksolid_db -tAc \`
			`"SELECT '\"pgbouncer_auth\" \"' \|\| passwd \|\| '\"' \`
			`FROM pg_shadow WHERE usename='pgbouncer_auth'" > pgbouncer/userlist.txt`

			`# 3) set the real DB network name in docker-compose.yml (networks.dbnet.name), then:`
			`docker compose -f pgbouncer/docker-compose.yml up -d`
			```

			`## Cut services over (incrementally)`

			Repoint each app's `DATABASE_URL` host/port from `timescale_db:5432` to
			`pgbouncer:6432` — same dbname, user, and password — and redeploy it.

			Migrate the superuser app pools first (`webhook_receiver`, `ingest_worker`,
			`dashboard_api` backend, `worker`/`cron`/`gateway`) — they are the heaviest
			consumers. Do them one at a time and watch `SHOW POOLS;` (below).

			`## ⚠️ Transaction-pooling caveats — read before cutting over`

			`pool_mode = transaction` returns the backend to the pool at every COMMIT/ROLLBACK,
			`so session-scoped features don't survive across transactions:`

			`- Server-side prepared statements — the app must not rely on them, or set the`
			driver to not cache them (e.g. asyncpg `statement_cache_size=0`; libpq simple
			`query / psycopg2 default is fine). PgBouncer ≥1.21 supports prepared statements in`
			transaction mode if you set `max_prepared_statements > 0` — enable that if an app
			`needs them.`
			- `SET`/`RESET` that must persist between transactions, session `LISTEN/NOTIFY`,
			advisory locks held across transactions, `WITH HOLD` cursors, session temp tables.
			- Per-connection `options` startup GUCs are ignored (see `ignore_startup_parameters`).
			Apps that set GUCs via the `options=` DSN param must instead pin them at the role
			level: `ALTER ROLE <app> SET statement_timeout = '...';` etc.

			`### The analytics MCP specifically`

			The MCP sends `options=-c default_transaction_read_only=on -c statement_timeout=30000`
			on its DSN and calls `set_session(readonly=True)`. Behind transaction pooling:

			- The `options` GUCs are dropped — but `analytics_ro` already has
			`default_transaction_read_only=on` and `statement_timeout=30s` pinned at the role
			level (`scripts/analytics_ro_role.sql`), so read-only enforcement is preserved.
			- `set_session(readonly=True)` issues a `SET` that can leak across pooled clients.
			`Before pointing the MCP at PgBouncer, either drop that call (role default covers it)`
			or run the MCP only in `session` pooling (add a second `[databases]` alias with
			`pool_mode=session`). Given the MCP is a minor consumer, the simplest path is to
			`leave the MCP connecting directly and pool only the heavy superuser apps.`

			`## Operating`

			```bash
			`# admin console`
			`docker exec -it pgbouncer psql -h 127.0.0.1 -p 6432 -U pgbouncer_auth pgbouncer`
			`# SHOW POOLS; -- cl_active / sv_active / waiting per pool`
			`# SHOW CLIENTS; -- connected clients`
			`# SHOW STATS; -- throughput`

			`# sanity: confirm the DB now sees a small, stable backend count`
			`docker exec -i "$DB" psql -U postgres -d tracksolid_db -c \`
			`"SELECT usename, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"`
			```

			Sizing rule: total backends PgBouncer opens = `Σ(default_pool_size per database) +
			reserve_pool_size`. Keep that well under `max_connections` (100), leaving headroom
			`for superuser/admin/background-worker connections that bypass PgBouncer. The shipped`
			`config (20 + 5 reserve, one database) tops out at ~25 backends.`