Addresses intermittent query failures on the live instance (container itself
is healthy — failures are application/query-level) plus security hardening.
Reliability (analytics_mcp.py):
- Discard dead pooled connections instead of recycling them. A broken socket
(DB restart, network blip, crash) previously poisoned the pool and every
later query handed that connection failed until container recreation. New
_is_disconnect() classifies real connection loss (class-08 / 57P0x SQLSTATE,
or socket-level OperationalError with pgcode=None) vs. an in-session query
error like statement_timeout (QueryCanceled / 57014), which is NOT a
disconnect and leaves the connection usable.
- query() retries ONCE, only on a genuine disconnect, so a recycled-but-stale
connection is invisible to the analyst (a real query error still surfaces).
- Bound concurrent checkouts with a semaphore (POOL_MAX) so >POOL_MAX
concurrent queries QUEUE instead of overflowing the pool and raising
PoolError (a 500 to the analyst).
- Lazy pool (minconn=0) + retry on init, so a brief DB outage at deploy time
no longer crash-loops the worker.
Build reproducibility:
- Commit uv.lock (was gitignored) and build with `uv sync --frozen` so
redeploys can't silently pull a newer, behaviour-changed mcp/starlette.
Security:
- Constant-time Bearer-token comparison (hmac.compare_digest).
- /healthz no longer leaks the analyst/token count.
- Dockerfile runs as a non-root user.
- deploy.sh: Docker log rotation (bound disk) + Traefik rate-limit middleware.
Also: relax the SQL guard so a forbidden keyword inside a string literal (e.g.
WHERE summary ILIKE '%please delete%') no longer rejects a valid read; the
blocklist still rejects data-modifying CTEs (and writes are impossible anyway
via the analytics_ro GRANTs + read-only rolled-back txn). Fix stale docstrings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>