fix: harden MCP server reliability, build reproducibility, and auth #1

Merged

kianiadee merged 2 commits from fix/reliability-pool-build-guard into main

2026-06-19 21:40:08 +00:00

kianiadee commented

2026-06-19 20:31:18 +00:00

Owner

Summary

Addresses the intermittent query failures on the live MCP instance (the container itself is healthy — RestartCount=0, no OOM — so the failures are application/query-level), plus build-reproducibility and security hardening.

Reliability (`analytics_mcp.py`)

Discard dead pooled connections instead of recycling them. A broken socket (DB restart, network blip, crash) previously poisoned the pool, so every later query handed that connection failed until the container was recreated. New _is_disconnect() distinguishes a genuine connection loss (class-08 / 57P0x SQLSTATE, or socket-level OperationalError with pgcode=None) from an in-session query error like statement_timeout (QueryCanceled / 57014), which leaves the connection usable.
query() retries once, only on a genuine disconnect, so a recycled-but-stale connection is invisible to the analyst. A real query error (timeout, bad SQL) still surfaces immediately — no double-running of slow queries.
Bound concurrent checkouts with a semaphore (POOL_MAX) so >POOL_MAX concurrent queries queue instead of overflowing the pool and raising PoolError (a 500 to the analyst).
Lazy pool (minconn=0) + retry on init, so a brief DB outage at deploy time no longer crash-loops the worker.

Build reproducibility

Commit uv.lock (was gitignored) and build with uv sync --frozen, so a redeploy can't silently pull a newer, behaviour-changed mcp/starlette.

Security

Constant-time Bearer-token comparison (hmac.compare_digest).
/healthz no longer leaks the analyst/token count.
Dockerfile runs as a non-root user.
deploy.sh: Docker log rotation (bounds disk — logs are flooded by internet bot scans) + Traefik rate-limit middleware.

Correctness / usability

Relax the SQL guard so a forbidden keyword inside a string literal (e.g. WHERE summary ILIKE '%please delete%') no longer rejects a valid read. The blocklist still rejects data-modifying CTEs, multi-statements, and writes (and writes are impossible anyway via the analytics_ro GRANTs + read-only rolled-back txn).
Fix stale docstrings (schemas now reporting, tracksolid, tickets, fuel).

Verification

py_compile + ruff clean (no new warnings vs. main).
Module imports in ~400 ms with a dead DSN → confirms no eager DB connect.
SQL guard test suite 9/9 (literals pass, writes/CTEs/multi-statement blocked).
Disconnect classifier verified (socket loss → retry; statement_timeout/guard errors → not retried).
uv lock --check clean.

Not included (follow-ups)

docs/ANALYTICS_MCP.md embeds an older copy of the pool code as a walkthrough — left as-is to avoid churn; worth a later refresh.
Deploy still manual (Coolify rebuild or deploy.sh). The new Traefik rate-limit label only applies via the deploy.sh path.

🤖 Generated with Claude Code

## Summary Addresses the intermittent query failures on the live MCP instance (the container itself is healthy — `RestartCount=0`, no OOM — so the failures are application/query-level), plus build-reproducibility and security hardening. ## Reliability (`analytics_mcp.py`) - **Discard dead pooled connections** instead of recycling them. A broken socket (DB restart, network blip, crash) previously poisoned the pool, so every later query handed that connection failed until the container was recreated. New `_is_disconnect()` distinguishes a genuine connection loss (class-08 / `57P0x` SQLSTATE, or socket-level `OperationalError` with `pgcode=None`) from an in-session query error like `statement_timeout` (`QueryCanceled` / `57014`), which leaves the connection usable. - **`query()` retries once, only on a genuine disconnect**, so a recycled-but-stale connection is invisible to the analyst. A real query error (timeout, bad SQL) still surfaces immediately — no double-running of slow queries. - **Bound concurrent checkouts with a semaphore (`POOL_MAX`)** so `>POOL_MAX` concurrent queries queue instead of overflowing the pool and raising `PoolError` (a 500 to the analyst). - **Lazy pool (`minconn=0`) + retry on init**, so a brief DB outage at deploy time no longer crash-loops the worker. ## Build reproducibility - **Commit `uv.lock`** (was gitignored) and build with `uv sync --frozen`, so a redeploy can't silently pull a newer, behaviour-changed `mcp`/`starlette`. ## Security - Constant-time Bearer-token comparison (`hmac.compare_digest`). - `/healthz` no longer leaks the analyst/token count. - Dockerfile runs as a **non-root** user. - `deploy.sh`: Docker **log rotation** (bounds disk — logs are flooded by internet bot scans) + Traefik **rate-limit** middleware. ## Correctness / usability - Relax the SQL guard so a forbidden keyword inside a **string literal** (e.g. `WHERE summary ILIKE '%please delete%'`) no longer rejects a valid read. The blocklist still rejects data-modifying CTEs, multi-statements, and writes (and writes are impossible anyway via the `analytics_ro` GRANTs + read-only rolled-back txn). - Fix stale docstrings (schemas now `reporting, tracksolid, tickets, fuel`). ## Verification - `py_compile` + `ruff` clean (no new warnings vs. `main`). - Module imports in ~400 ms with a dead DSN → confirms no eager DB connect. - SQL guard test suite 9/9 (literals pass, writes/CTEs/multi-statement blocked). - Disconnect classifier verified (socket loss → retry; `statement_timeout`/guard errors → not retried). - `uv lock --check` clean. ## Not included (follow-ups) - `docs/ANALYTICS_MCP.md` embeds an older copy of the pool code as a walkthrough — left as-is to avoid churn; worth a later refresh. - Deploy still manual (Coolify rebuild or `deploy.sh`). The new Traefik rate-limit label only applies via the `deploy.sh` path. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

kianiadee added 1 commit 2026-06-19 20:31:18 +00:00

fix: harden MCP server reliability, build reproducibility, and auth 5e3fc3910b

Addresses intermittent query failures on the live instance (container itself
is healthy — failures are application/query-level) plus security hardening.

Reliability (analytics_mcp.py):
- Discard dead pooled connections instead of recycling them. A broken socket
  (DB restart, network blip, crash) previously poisoned the pool and every
  later query handed that connection failed until container recreation. New
  _is_disconnect() classifies real connection loss (class-08 / 57P0x SQLSTATE,
  or socket-level OperationalError with pgcode=None) vs. an in-session query
  error like statement_timeout (QueryCanceled / 57014), which is NOT a
  disconnect and leaves the connection usable.
- query() retries ONCE, only on a genuine disconnect, so a recycled-but-stale
  connection is invisible to the analyst (a real query error still surfaces).
- Bound concurrent checkouts with a semaphore (POOL_MAX) so >POOL_MAX
  concurrent queries QUEUE instead of overflowing the pool and raising
  PoolError (a 500 to the analyst).
- Lazy pool (minconn=0) + retry on init, so a brief DB outage at deploy time
  no longer crash-loops the worker.

Build reproducibility:
- Commit uv.lock (was gitignored) and build with `uv sync --frozen` so
  redeploys can't silently pull a newer, behaviour-changed mcp/starlette.

Security:
- Constant-time Bearer-token comparison (hmac.compare_digest).
- /healthz no longer leaks the analyst/token count.
- Dockerfile runs as a non-root user.
- deploy.sh: Docker log rotation (bound disk) + Traefik rate-limit middleware.

Also: relax the SQL guard so a forbidden keyword inside a string literal (e.g.
WHERE summary ILIKE '%please delete%') no longer rejects a valid read; the
blocklist still rejects data-modifying CTEs (and writes are impossible anyway
via the analytics_ro GRANTs + read-only rolled-back txn). Fix stale docstrings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kianiadee added 1 commit 2026-06-19 20:38:25 +00:00

fix(connections): shrink MCP DB-connection footprint on a shared 100-conn DB c02c127798

The DB is at max_connections=100 and several stack services hold persistent
pools (several as the postgres superuser, idle for hours), so peaks hit
"too many connections". The MCP is a minor contributor but easy to bound:

- Dockerfile: uvicorn --workers 2 → 1. The MCP's connection budget is
  workers × MCP_POOL_MAX, so this caps it at 8 backends instead of 16. Scale
  via MCP_POOL_MAX, not workers, so the budget stays obvious. (Pairs with the
  minconn=0 lazy pool already on this branch: 0 connections held when idle.)
- analytics_ro_role.sql: add idle_session_timeout=5min so the DB reaps the
  MCP's idle POOLED connections (idle_in_transaction never reaps them — they're
  idle outside a txn) and returns the slots. Safe because the server now
  discards + transparently retries a reaped connection instead of erroring.

Note: the dominant fix is stack-wide (get the superuser app pools onto bounded,
timed roles; consider PgBouncer; or raise max_connections) — out of this repo's
scope but documented in the review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>