kianiadee/tracksolid_timescale_grafana_prod

Fork 0

david kiania d5093f0a1c

Static Analysis / static (push) Waiting to run

Details

Tests / test (push) Waiting to run

Details

Static Analysis / static (pull_request) Has been cancelled

Details

Tests / test (pull_request) Has been cancelled

Details

chore(cleanup): purge n8n, Grafana, and DWH references + dead artifacts

These subsystems are retired and replaced by better alternatives (FleetNow /
FleetOps SPAs via dashboard_api; in-process pooling; reporting.v_ingest_health).
Remove them so the repo reflects the live stack only. Nothing running depends
on the deleted artifacts.

Deleted (dead artifacts):
- n8n-workflows/ (retired webhook exports), grafana/ (provisioning for the
  removed service), dwh/ (migrations for the decommissioned external warehouse)
- runbooks: DWH_PIPELINE.md, DWH_Execution_Manual.md, grafanaDeployment.md,
  grafanaOperationalManual.md

Code/config:
- run_migrations.py: drop sync_role_passwords() (its only entries were the now
  -dead grafana_ro + pgbouncer syncs; the guard already made it inert)
- .env: remove the two unused GRAFANA_* vars
- ingest_movement_rev.py / db_audit / deploy_dashboard_api_staging.sh: reword
  stale Grafana/grafana_ro comments

Docs: scrub n8n/Grafana/DWH from CLAUDE.md, CONNECTIONS, DATA_FLOW,
OPERATIONS_MANUAL, docker_commands, KPI_FRAMEWORK, PLATFORM_OVERVIEW,
STAGING_FLEETOPS, and deprecation-banner the two large SQL libraries
(dwh_gold was already dropped 2026-06-05).

Kept deliberately: the grafana_ro DB role (now an unused read-only login),
applied migration history, dated docs/reports/*, and docs/superpowers/* specs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-10 21:41:27 +03:00

9.4 KiB

Raw Blame History

pgbouncer sidecar deployment

⚠️ SUPERSEDED — pgbouncer REMOVED 2026-06-10. The pooler was deployed but stayed dormant (zero app clients ever pointed at :6432; every service connects directly to timescale_db:5432). In-process pooling (ts_shared_rev ThreadedConnectionPool) is more than sufficient at fleet scale, and transaction-mode pooling is unsafe for the advisory-lock'd v_trips refresher (FIX-D02). The pgbouncer service was dropped from docker-compose.yaml; migration 10_pgbouncer_auth.sql (the pgbouncer role + public.user_lookup() fn) is left applied but inert. This document is retained for history only.

Date: 2026-05-07 Branch: quality-program-2026-04-12 Status: Phase 1 deployed, then removed 2026-06-10 (see banner above). Phase 2 (pgAdmin4 sidecar) was rolled back on 2026-05-08 — see git history (bc020cb, then reverted).

Context

Driver: pgAdmin4 running on the maintainer's laptop has been exhausting tracksolid_db's max_connections. Each Query Tool tab in pgAdmin holds its own long-lived backend connection; combined with the existing peak of ~50–60 connections from the ingest pipeline, the budget tips over and cascades — pgcli (and anything else trying to connect) starts failing.

Goal: Add pgbouncer in front of timescale_db to enforce a connection budget via transaction-mode pooling. Desktop pgAdmin (or any other admin client) connects through pgbouncer and is multiplexed onto a small fixed pool of backends.

Frozen scope (unchanged this round):

DWH bronze pipeline (dwh/*.sql, tracksolid_dwh@31.97.44.246:5888)
n8n DWH workflows (n8n-workflows/dwh_extract*, dwh_load_bronze*)
Grafana provisioning (grafana/provisioning/datasources/...)
Python ingest containers (ingest_movement_rev.py, ingest_events_rev.py, webhook_receiver_rev.py) — they keep talking to timescale_db:5432 directly. Cutover, if desired, is a separate plan.
db_backup sidecar — pg_dump is incompatible with transaction-mode pooling and stays on timescale_db:5432.

Phase 1 — pgbouncer sidecar, no client cutover

Add a new service to docker-compose.yaml. Internal Docker network only; no host port binding.

  pgbouncer:
    image: edoburu/pgbouncer
    restart: always
    depends_on:
      timescale_db:
        condition: service_healthy
    env_file: .env
    environment:
      - DB_HOST=timescale_db
      - DB_PORT=5432
      - DB_USER=${POSTGRES_USER}
      - DB_PASSWORD=${POSTGRES_PASSWORD}
      - DB_NAME=${POSTGRES_DB}
      - POOL_MODE=transaction
      - AUTH_TYPE=scram-sha-256
      - MAX_CLIENT_CONN=200
      - DEFAULT_POOL_SIZE=15
      - MIN_POOL_SIZE=2
      - RESERVE_POOL_SIZE=5
      - SERVER_RESET_QUERY=DISCARD ALL
      - SERVER_IDLE_TIMEOUT=600
      - ADMIN_USERS=${POSTGRES_USER}
      - LISTEN_PORT=6432
      - AUTH_USER=pgbouncer
      - AUTH_QUERY=SELECT uname, phash FROM public.user_lookup($$1)
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -h 127.0.0.1 -p 6432 -U ${POSTGRES_USER}"]
      interval: 30s
      timeout: 5s
      retries: 3

Why these values:

POOL_MODE=transaction — recycles backend on every transaction boundary. Cuts pgAdmin's per-tab idle conn from 1 backend → ~0 when idle.
DEFAULT_POOL_SIZE=15 — total backend slots per (user, db) pair. Sits comfortably under Postgres max_connections (default 100) leaving room for ingest's existing ~50–60.
MAX_CLIENT_CONN=200 — pgAdmin can open as many tabs as it wants; they queue rather than fail.
RESERVE_POOL_SIZE=5 — emergency slack when default_pool_size saturates.
SERVER_RESET_QUERY=DISCARD ALL — wipes session state between transactions so leaked SETs from one client don't bleed into the next.

Auth: SCRAM passthrough via `auth_query`

Avoids hand-maintaining userlist.txt. pgbouncer authenticates as a dedicated pgbouncer Postgres role and looks up SCRAM hashes for the requesting user via a SECURITY DEFINER function.

New migration 10_pgbouncer_auth.sql (08 and 09 are taken by 08_analytics_config.sql and 09_trips_enrichment.sql):

-- Role created with placeholder password; run_migrations.py:sync_role_passwords
-- replaces it with PGBOUNCER_AUTH_PASSWORD on every container startup.
-- Same convention used today for grafana_ro.
DO $$
BEGIN
  IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pgbouncer') THEN
    CREATE ROLE pgbouncer LOGIN PASSWORD 'SET_PASSWORD_IN_ENV';
  END IF;
END
$$;

CREATE OR REPLACE FUNCTION public.user_lookup(in_user text,
    OUT uname text, OUT phash text) RETURNS record AS $$
BEGIN
  SELECT usename, passwd FROM pg_catalog.pg_shadow
   WHERE usename = in_user INTO uname, phash;
  RETURN;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

REVOKE ALL ON FUNCTION public.user_lookup(text) FROM public;
GRANT EXECUTE ON FUNCTION public.user_lookup(text) TO pgbouncer;

Two changes to run_migrations.py:

Append "10_pgbouncer_auth.sql" to MIGRATIONS.
Extend sync_role_passwords() roles dict with "pgbouncer": os.getenv("PGBOUNCER_AUTH_PASSWORD").

The migration is applied by the next ingest container restart and recorded in tracksolid.schema_migrations. sync_role_passwords then ALTER ROLEs the password from the env var so the placeholder is never live.

New env vars in `.env`

PGBOUNCER_AUTH_PASSWORD — password for the new pgbouncer Postgres role
(existing vars reused: POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB)

Phase 1 verification

Apply migration via ingest container restart; confirm in tracksolid.schema_migrations that 10_pgbouncer_auth.sql is recorded.
docker compose up -d pgbouncer.

From inside any compose service:

psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1'

From the pgbouncer container's admin console:
```
psql -h 127.0.0.1 -p 6432 -U postgres -d pgbouncer -c 'SHOW POOLS;'
```
Confirm pool mode = transaction, server connections within default_pool_size.
SHOW STATS; and SHOW CLIENTS; should both respond.
Confirm no client has cut over: tracksolid.ingestion_log continues accumulating; Grafana panels keep refreshing.

Files to modify / create

Path	Change
`260507_pgbouncer_deployment.md`	THIS FILE — runbook for the rollout
`docker-compose.yaml`	Add `pgbouncer` service
`10_pgbouncer_auth.sql`	NEW — creates `pgbouncer` role + `public.user_lookup` SECURITY DEFINER function
`.env`	Add `PGBOUNCER_AUTH_PASSWORD` (do not commit values)
`docs/CONNECTIONS.md`	Add a "pgbouncer" section: pool mode, exposure, who uses it, how to connect for ad-hoc admin
`CLAUDE.md` §3 / §4	Note that admin tooling now goes through `pgbouncer:6432`; ingest/grafana/backup remain direct; reference this runbook

Files NOT to modify (frozen scope)

grafana/provisioning/datasources/tracksolid_postgres.yaml
n8n-workflows/dwh_extract*.json, n8n-workflows/dwh_load_bronze*.json
dwh/*.sql
ingest_movement_rev.py, ingest_events_rev.py, webhook_receiver_rev.py, ts_shared_rev.py
backup/ — pg_dump keeps using timescale_db:5432 directly

Reused conventions and utilities

run_migrations.py already applies new NN_*.sql files in order against tracksolid_db and tracks them in tracksolid.schema_migrations. Phase 1 adds 10_pgbouncer_auth.sql to this flow — no new tooling needed.
env_file: .env + depends_on: <svc> condition: service_healthy mirrors the existing pattern at docker-compose.yaml:28–31, 39–42, 50–53, 67–70, 87–90.
Container-name resolution rule from CLAUDE.md §3 still applies for any docker exec against the new service:
```
docker ps --filter name=pgbouncer --format "{{.Names}}" | head -1
```

Out-of-scope follow-ups (separate plans)

Cut over Python ingest to pgbouncer. Change DATABASE_URL in .env from timescale_db:5432 to pgbouncer:6432. Requires verifying psycopg2 pool + SAVEPOINTs against transaction-mode pgbouncer (low risk per exploration — no LISTEN/NOTIFY, no advisory locks across statements, no prepared statements in the codebase).
Rotate dwh_owner / grafana_ro plaintext passwords still in dwh/260423_dwh_ddl_v1.sql (pre-existing item from CLAUDE.md §10).

Rollback

If pgbouncer misbehaves:

Stop the service without touching the rest of the stack:
```
docker compose stop pgbouncer
docker compose rm -f pgbouncer
```
Ingest, Grafana, webhook, backup are unaffected — they were never cut over.

Revert the SQL migration if needed:

DROP FUNCTION public.user_lookup(text);
DROP ROLE pgbouncer;
DELETE FROM tracksolid.schema_migrations
 WHERE filename = '10_pgbouncer_auth.sql';

Revert compose changes by checking out the prior docker-compose.yaml.

End-to-end verification checklist

10_pgbouncer_auth.sql applied — visible in tracksolid.schema_migrations
pgbouncer service healthy — docker compose ps shows healthy
psql -h pgbouncer -p 6432 -U postgres -d tracksolid_db -c 'SELECT 1' from inside the network
SHOW POOLS; in pgbouncer admin shows transaction mode
Existing pipelines unaffected: tracksolid.ingestion_log continues growing at current rate; Grafana dashboards still render
pgcli no longer hits "too many connections" when used alongside pgAdmin

9.4 KiB Raw Blame History Unescape Escape

pgbouncer sidecar deployment

Context

Phase 1 — pgbouncer sidecar, no client cutover

Auth: SCRAM passthrough via auth_query

New env vars in .env

Phase 1 verification

Files to modify / create

Files NOT to modify (frozen scope)

Reused conventions and utilities

Out-of-scope follow-ups (separate plans)

Rollback

End-to-end verification checklist

9.4 KiB

Raw Blame History

Auth: SCRAM passthrough via `auth_query`

New env vars in `.env`