The DB is at max_connections=100 and several stack services hold persistent
pools (several as the postgres superuser, idle for hours), so peaks hit
"too many connections". The MCP is a minor contributor but easy to bound:
- Dockerfile: uvicorn --workers 2 → 1. The MCP's connection budget is
workers × MCP_POOL_MAX, so this caps it at 8 backends instead of 16. Scale
via MCP_POOL_MAX, not workers, so the budget stays obvious. (Pairs with the
minconn=0 lazy pool already on this branch: 0 connections held when idle.)
- analytics_ro_role.sql: add idle_session_timeout=5min so the DB reaps the
MCP's idle POOLED connections (idle_in_transaction never reaps them — they're
idle outside a txn) and returns the slots. Safe because the server now
discards + transparently retries a reaped connection instead of erroring.
Note: the dominant fix is stack-wide (get the superuser app pools onto bounded,
timed roles; consider PgBouncer; or raise max_connections) — out of this repo's
scope but documented in the review.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses intermittent query failures on the live instance (container itself
is healthy — failures are application/query-level) plus security hardening.
Reliability (analytics_mcp.py):
- Discard dead pooled connections instead of recycling them. A broken socket
(DB restart, network blip, crash) previously poisoned the pool and every
later query handed that connection failed until container recreation. New
_is_disconnect() classifies real connection loss (class-08 / 57P0x SQLSTATE,
or socket-level OperationalError with pgcode=None) vs. an in-session query
error like statement_timeout (QueryCanceled / 57014), which is NOT a
disconnect and leaves the connection usable.
- query() retries ONCE, only on a genuine disconnect, so a recycled-but-stale
connection is invisible to the analyst (a real query error still surfaces).
- Bound concurrent checkouts with a semaphore (POOL_MAX) so >POOL_MAX
concurrent queries QUEUE instead of overflowing the pool and raising
PoolError (a 500 to the analyst).
- Lazy pool (minconn=0) + retry on init, so a brief DB outage at deploy time
no longer crash-loops the worker.
Build reproducibility:
- Commit uv.lock (was gitignored) and build with `uv sync --frozen` so
redeploys can't silently pull a newer, behaviour-changed mcp/starlette.
Security:
- Constant-time Bearer-token comparison (hmac.compare_digest).
- /healthz no longer leaks the analyst/token count.
- Dockerfile runs as a non-root user.
- deploy.sh: Docker log rotation (bound disk) + Traefik rate-limit middleware.
Also: relax the SQL guard so a forbidden keyword inside a string literal (e.g.
WHERE summary ILIKE '%please delete%') no longer rejects a valid read; the
blocklist still rejects data-modifying CTEs (and writes are impossible anyway
via the analytics_ro GRANTs + read-only rolled-back txn). Fix stale docstrings.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect the live state: readable data-surface table (reporting/tracksolid/
tickets/fuel + owners), the owner-keyed default-privilege gotcha, the
tickets.inc typed-vs-raw column note, the env knob, code-only redeploy that
reuses tokens, and tickets example prompts. Status flipped to deployed & live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The analytics_ro role only had USAGE/SELECT on reporting + tracksolid, so
the tickets schema (INC/CRQ, 8 tables + 1 view + 7 fns) and fuel schema
were invisible to the MCP server — queries failed with permission denied.
- analytics_ro_role.sql: GRANT USAGE/SELECT/EXECUTE on tickets + fuel.
Default privileges for these are keyed to postgres (their owner), not
tracksolid_owner, so future objects auto-grant correctly.
- analytics_mcp.py: READABLE_SCHEMAS now includes tickets + fuel and is
overridable via MCP_READABLE_SCHEMAS, so the introspection helpers
(list_tables/describe_table/sample_table) work for them too.
- deploy.sh: reuse existing analyst tokens from the running container when
MCP_AUTH_TOKENS is unset, so a code-only redeploy needs no secret.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The MCP SDK's transport-security DNS-rebinding protection only accepts a
localhost Host header by default and returns 421 behind Traefik (Host =
fleetmcp.*). It targets browser attacks on localhost-bound servers and does
not apply to a public, TLS-terminated, Bearer-authenticated service. Off by
default now; re-enableable via MCP_DNS_REBINDING_PROTECTION=1 + MCP_ALLOWED_HOSTS.
Also: deploy.sh health echo uses python (slim image has no curl).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>