tracksolid_timescale_grafan.../260507_pgbouncer_deployment.md
David Kiania f3ad612a1c
Some checks are pending
Static Analysis / static (push) Waiting to run
Tests / test (push) Waiting to run
fix(infra): drop pgbouncer image tag — pin 1.23.1 unavailable
The pinned tag failed to pull on Coolify deploy. Switching to the
untagged edoburu/pgbouncer (rolling latest) so the sidecar can come up.
Will revisit pinning to a known-good tag once verified live.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 13:48:31 +03:00

12 KiB
Raw Blame History

pgbouncer + pgAdmin4 sidecar deployment

Date: 2026-05-07 Branch: quality-program-2026-04-12 Status: Plan approved; implementation pending


Context

Driver: pgAdmin4 running on the maintainer's laptop has been exhausting tracksolid_db's max_connections. Each Query Tool tab in pgAdmin holds its own long-lived backend connection; combined with the existing peak of ~5060 connections from the ingest pipeline, the budget tips over and cascades — pgcli (and anything else trying to connect) starts failing.

Goal: Add pgbouncer in front of timescale_db to enforce a connection budget via transaction-mode pooling, and deploy pgAdmin4 as a Coolify-managed sidecar that connects through pgbouncer over the Docker network. Net effect: pgAdmin sprawl is multiplexed onto a small fixed pool of backends, admin tooling moves on-VM (lower latency, persistent workspace, smaller external attack surface), and host port 5433 becomes optional/closeable in a follow-up.

Frozen scope (unchanged this round):

  • DWH bronze pipeline (dwh/*.sql, tracksolid_dwh@31.97.44.246:5888)
  • n8n DWH workflows (n8n-workflows/dwh_extract*, dwh_load_bronze*)
  • Grafana provisioning (grafana/provisioning/datasources/...)
  • Python ingest containers (ingest_movement_rev.py, ingest_events_rev.py, webhook_receiver_rev.py) — they keep talking to timescale_db:5432 directly. Cutover, if desired, is a separate plan.
  • db_backup sidecar — pg_dump is incompatible with transaction-mode pooling and stays on timescale_db:5432.

Phase 1 — pgbouncer sidecar, no client cutover

Add a new service to docker-compose.yaml. Internal Docker network only; no host port binding.

  pgbouncer:
    image: edoburu/pgbouncer
    restart: always
    depends_on:
      timescale_db:
        condition: service_healthy
    env_file: .env
    environment:
      - DB_HOST=timescale_db
      - DB_PORT=5432
      - DB_USER=${POSTGRES_USER}
      - DB_PASSWORD=${POSTGRES_PASSWORD}
      - DB_NAME=${POSTGRES_DB}
      - POOL_MODE=transaction
      - AUTH_TYPE=scram-sha-256
      - MAX_CLIENT_CONN=200
      - DEFAULT_POOL_SIZE=15
      - MIN_POOL_SIZE=2
      - RESERVE_POOL_SIZE=5
      - SERVER_RESET_QUERY=DISCARD ALL
      - SERVER_IDLE_TIMEOUT=600
      - ADMIN_USERS=${POSTGRES_USER}
      - LISTEN_PORT=6432
      - AUTH_USER=pgbouncer
      - AUTH_QUERY=SELECT uname, phash FROM public.user_lookup($$1)
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -h 127.0.0.1 -p 6432 -U ${POSTGRES_USER}"]
      interval: 30s
      timeout: 5s
      retries: 3

Why these values:

  • POOL_MODE=transaction — recycles backend on every transaction boundary. Cuts pgAdmin's per-tab idle conn from 1 backend → ~0 when idle.
  • DEFAULT_POOL_SIZE=15 — total backend slots per (user, db) pair. Sits comfortably under Postgres max_connections (default 100) leaving room for ingest's existing ~5060.
  • MAX_CLIENT_CONN=200 — pgAdmin can open as many tabs as it wants; they queue rather than fail.
  • RESERVE_POOL_SIZE=5 — emergency slack when default_pool_size saturates.
  • SERVER_RESET_QUERY=DISCARD ALL — wipes session state between transactions so leaked SETs from one client don't bleed into the next.

Auth: SCRAM passthrough via auth_query

Avoids hand-maintaining userlist.txt. pgbouncer authenticates as a dedicated pgbouncer Postgres role and looks up SCRAM hashes for the requesting user via a SECURITY DEFINER function.

New migration 10_pgbouncer_auth.sql (08 and 09 are taken by 08_analytics_config.sql and 09_trips_enrichment.sql):

-- Role created with placeholder password; run_migrations.py:sync_role_passwords
-- replaces it with PGBOUNCER_AUTH_PASSWORD on every container startup.
-- Same convention used today for grafana_ro.
DO $$
BEGIN
  IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pgbouncer') THEN
    CREATE ROLE pgbouncer LOGIN PASSWORD 'SET_PASSWORD_IN_ENV';
  END IF;
END
$$;

CREATE OR REPLACE FUNCTION public.user_lookup(in_user text,
    OUT uname text, OUT phash text) RETURNS record AS $$
BEGIN
  SELECT usename, passwd FROM pg_catalog.pg_shadow
   WHERE usename = in_user INTO uname, phash;
  RETURN;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

REVOKE ALL ON FUNCTION public.user_lookup(text) FROM public;
GRANT EXECUTE ON FUNCTION public.user_lookup(text) TO pgbouncer;

Two changes to run_migrations.py:

  1. Append "10_pgbouncer_auth.sql" to MIGRATIONS.
  2. Extend sync_role_passwords() roles dict with "pgbouncer": os.getenv("PGBOUNCER_AUTH_PASSWORD").

The migration is applied by the next ingest container restart and recorded in tracksolid.schema_migrations. sync_role_passwords then ALTER ROLEs the password from the env var so the placeholder is never live.

New env vars in .env

  • PGBOUNCER_AUTH_PASSWORD — password for the new pgbouncer Postgres role
  • (existing vars reused: POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB)

Phase 1 verification

  1. Apply migration via ingest container restart; confirm in tracksolid.schema_migrations that 10_pgbouncer_auth.sql is recorded.
  2. docker compose up -d pgbouncer.
  3. From inside any compose service:
    psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'
    
  4. From the pgbouncer container's admin console:
    psql -h 127.0.0.1 -p 6432 -U postgres -d pgbouncer -c 'SHOW POOLS;'
    
    Confirm pool mode = transaction, server connections within default_pool_size.
  5. SHOW STATS; and SHOW CLIENTS; should both respond.
  6. Confirm no client has cut over: tracksolid.ingestion_log continues accumulating; Grafana panels keep refreshing.

Phase 2 — pgAdmin4 sidecar pointed at pgbouncer

Coolify UI maps an HTTPS subdomain (e.g. pgadmin.stage.rahamafresh.com) to internal port 80, mirroring the Grafana pattern at docker-compose.yaml:7880.

  pgadmin:
    image: dpage/pgadmin4:8.14
    restart: always
    depends_on:
      pgbouncer:
        condition: service_healthy
    env_file: .env
    environment:
      - PGADMIN_DEFAULT_EMAIL=${PGADMIN_DEFAULT_EMAIL}
      - PGADMIN_DEFAULT_PASSWORD=${PGADMIN_DEFAULT_PASSWORD}
      - PGADMIN_CONFIG_SERVER_MODE=True
      - PGADMIN_CONFIG_MASTER_PASSWORD_REQUIRED=False
      - PGADMIN_DISABLE_POSTFIX=True
    volumes:
      - pgadmin-data:/var/lib/pgadmin
      - ./pgadmin/servers.json:/pgadmin4/servers.json:ro
    # COOLIFY DOMAIN LOGIC:
    # Set the actual URL in the Coolify UI; service exposes port 80 internally.

Add pgadmin-data to the volumes: block at the bottom of the compose file.

Pre-registered server (pgadmin/servers.json)

{
  "Servers": {
    "1": {
      "Name": "tracksolid_db (via pgbouncer)",
      "Group": "Servers",
      "Host": "pgbouncer",
      "Port": 6432,
      "MaintenanceDB": "tracksolid_db",
      "Username": "postgres",
      "SSLMode": "disable",
      "ConnectionParameters": {
        "sslmode": "disable",
        "connect_timeout": 10
      }
    }
  }
}

New env vars in .env

  • PGADMIN_DEFAULT_EMAIL
  • PGADMIN_DEFAULT_PASSWORD

Phase 2 verification

  1. In the Coolify UI, point a subdomain at the pgadmin service, port 80.
  2. Open the URL, log in with PGADMIN_DEFAULT_EMAIL / PGADMIN_DEFAULT_PASSWORD.
  3. The pre-registered "tracksolid_db (via pgbouncer)" server appears in the left tree. Connect; provide the postgres password when prompted (pgAdmin stores it in its own keyring after first use).
  4. Open a Query Tool, run SELECT now(), current_user;.
  5. From the pgbouncer container admin console:
    SHOW POOLS;
    
    cl_active should reflect the open pgAdmin tab(s); sv_active / sv_idle should sum to ≤ default_pool_size (15).
  6. Stress test: open ~30 Query Tool tabs, run SELECT pg_sleep(0.1); in each. Confirm tracksolid_db total connection count stays bounded:
    SELECT count(*) FROM pg_stat_activity;
    
    Should be default_pool_size + reserve_pool_size + (other clients), not the number of pgAdmin tabs.

Files to modify / create

Path Change
260507_pgbouncer_deployment.md THIS FILE — runbook for the rollout
docker-compose.yaml Add pgbouncer and pgadmin services; add pgadmin-data volume
10_pgbouncer_auth.sql NEW — creates pgbouncer role + public.user_lookup SECURITY DEFINER function
pgadmin/servers.json NEW — pre-registers pgbouncer:6432 as the default server
.env Add PGBOUNCER_AUTH_PASSWORD, PGADMIN_DEFAULT_EMAIL, PGADMIN_DEFAULT_PASSWORD (do not commit values)
docs/CONNECTIONS.md Add a "pgbouncer + pgAdmin" section: pool mode, exposure, who uses it, how to connect for ad-hoc admin
CLAUDE.md §3 / §4 Note that admin tooling now goes through pgbouncer:6432; ingest/grafana/backup remain direct; reference this runbook

Files NOT to modify (frozen scope)

  • grafana/provisioning/datasources/tracksolid_postgres.yaml
  • n8n-workflows/dwh_extract*.json, n8n-workflows/dwh_load_bronze*.json
  • dwh/*.sql
  • ingest_movement_rev.py, ingest_events_rev.py, webhook_receiver_rev.py, ts_shared_rev.py
  • backup/pg_dump keeps using timescale_db:5432 directly

Reused conventions and utilities

  • run_migrations.py already applies new NN_*.sql files in order against tracksolid_db and tracks them in tracksolid.schema_migrations. Phase 1 adds 10_pgbouncer_auth.sql to this flow — no new tooling needed.
  • env_file: .env + depends_on: <svc> condition: service_healthy mirrors the existing pattern at docker-compose.yaml:2831, 3942, 5053, 6770, 8790.
  • Coolify domain-via-UI mirrors the Grafana comment at docker-compose.yaml:7880 and the webhook_receiver comment at docker-compose.yaml:5455.
  • Container-name resolution rule from CLAUDE.md §3 still applies for any docker exec against the new services:
    docker ps --filter name=pgbouncer --format "{{.Names}}" | head -1
    docker ps --filter name=pgadmin   --format "{{.Names}}" | head -1
    

Out-of-scope follow-ups (separate plans)

  1. Cut over Python ingest to pgbouncer. Change DATABASE_URL in .env from timescale_db:5432 to pgbouncer:6432. Requires verifying psycopg2 pool + SAVEPOINTs against transaction-mode pgbouncer (low risk per exploration — no LISTEN/NOTIFY, no advisory locks across statements, no prepared statements in the codebase).
  2. Close host port 5433 on timescale_db once pgAdmin web UI is the established admin path. Removes the public-IP Postgres exposure entirely.
  3. Rotate dwh_owner / grafana_ro plaintext passwords still in dwh/260423_dwh_ddl_v1.sql (pre-existing item from CLAUDE.md §10).

Rollback

If pgbouncer or pgAdmin misbehaves:

  1. Stop the new services without touching the rest of the stack:
    docker compose stop pgbouncer pgadmin
    docker compose rm -f pgbouncer pgadmin
    
    Ingest, Grafana, webhook, backup are unaffected — they were never cut over.
  2. Revert the SQL migration if needed:
    DROP FUNCTION public.user_lookup(text);
    DROP ROLE pgbouncer;
    DELETE FROM tracksolid.schema_migrations
     WHERE filename = '10_pgbouncer_auth.sql';
    
  3. Revert compose changes by checking out the prior docker-compose.yaml.

End-to-end verification checklist

  • 10_pgbouncer_auth.sql applied — visible in tracksolid.schema_migrations
  • pgbouncer service healthy — docker compose ps shows healthy
  • psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1' from inside the network
  • SHOW POOLS; in pgbouncer admin shows transaction mode
  • pgadmin service healthy — Coolify domain reachable over HTTPS
  • Login + query through pgAdmin succeeds
  • SELECT count(*) FROM pg_stat_activity; stays bounded under 30-tab stress test
  • Existing pipelines unaffected: tracksolid.ingestion_log continues growing at current rate; Grafana dashboards still render
  • pgcli no longer hits "too many connections" when used alongside pgAdmin