Security:
- .dockerignore + Dockerfile: stop baking .env / the 346MB OSM pbf into image
layers; install pinned from uv.lock (reproducible builds) (SEC-04/05).
- docker-compose: DB port binds ${DB_BIND_ADDR:-127.0.0.1} — loopback-only by
default; remote tooling moves to an SSH tunnel (SEC-01).
- webhook_receiver: CRITICAL startup warning + WEBHOOK_REQUIRE_TOKEN=1 fail-closed
when JIMI_WEBHOOK_TOKEN is empty (SEC-02 / FIX-W01).
Correctness:
- FIX-M22/E07: capture cur.rowcount BEFORE RELEASE SAVEPOINT in poll_alarms/
poll_trips/poll_parking — the RELEASE reported -1, producing "Alarms: -4 new
events inserted" logs and negative ingestion_log.rows_inserted.
- FIX-W02: parse application/json push bodies (were silently dropped).
- FIX-W03: move webhook DB work off the event loop via asyncio.to_thread.
- FIX-M23: poll_trips phased so no txn/connection is held across Tracksolid +
Nominatim (1 req/s) network calls.
- FIX-M24: sync_devices disables devices absent from every target (guarded).
- FIX-W04: reject device-clock-garbage alarm_time (2019 timestamps observed).
- get_token(): don't relabel already-aware timestamptz expiries (BUG-P9).
Observability/lifecycle:
- migration 21: v_ingest_health restricted to active pipeline endpoints so
one-shot tools stop wedging /health/ingest at 'stale' (dry-run verified).
- FIX-M25: daily purge_audit_logs() trims ingestion_log (90d) + refresh_log (180d).
- remove orphaned duplicate migrations/10_driver_clock_views.sql; ruff lint config.
+5 webhook tests (82 pass). Report/plan/work-log in docs/reports/260702_*.
Local only; not deployed. CLAUDE.md fix-history edits left uncommitted (that file
also carries unrelated in-progress edits).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6.9 KiB
6.9 KiB
Fix & Refactor Plan — 2026-07-02
Companion to 260702_platform_audit_report.md (findings) and 260702_work_done.md
(execution log). Goal per the brief: faster, more scalable, and above all secure,
without regressing any FIX-* in CLAUDE.md §7.
The plan splits into (A) code/repo changes implemented in this session, and (B) operational actions that touch prod and therefore need explicit operator confirmation (CLAUDE.md working rule 1) — commands are provided, not executed.
Phase A — code & repo changes (implemented locally, tests must stay green)
A1. Security hardening in the repo
| Step | Finding | Change |
|---|---|---|
| A1.1 | SEC-04 | Add .dockerignore excluding .env*, .git, .venv, *.osm.pbf, CSV/data artefacts, docs, tests, caches. |
| A1.2 | SEC-05 | Dockerfile: COPY uv.lock + install with uv pip install --system --require-hashes-equivalent (uv sync --frozen layout) so builds are pinned and reproducible. Keep the runtime layout identical (/app, non-root user). |
| A1.3 | SEC-02 | webhook_receiver_rev.py: log a CRITICAL warning at startup when JIMI_WEBHOOK_TOKEN is empty, and add WEBHOOK_REQUIRE_TOKEN=1 (opt-in) that refuses to start unauthenticated. Default stays permissive so the next deploy doesn't break live ingestion before the token is configured on the Jimi side; flipping the flag is step B3. |
| A1.4 | SEC-01 | docker-compose.yaml: publish Postgres as ${DB_BIND_ADDR:-127.0.0.1}:5433:5432. Default becomes localhost-only on the next stack redeploy; remote tooling moves to an SSH tunnel (documented). Setting DB_BIND_ADDR=0.0.0.0 in .env restores the old exposure deliberately. |
A2. Correctness fixes
| Step | Finding | Change |
|---|---|---|
| A2.1 | BUG-P1 | Capture cur.rowcount immediately after the INSERT, before RELEASE SAVEPOINT, in poll_alarms, poll_trips, poll_parking. |
| A2.2 | BUG-P3 | _parse_request: branch on Content-Type — parse application/json bodies ({"token", "data_list"}), keep the observed form-encoded path (msgType/data) as-is. |
| A2.3 | BUG-P2 | Webhook endpoints: keep async request parsing, move each endpoint's DB work into a synchronous _process_*(items) function executed via await asyncio.to_thread(...). Event loop never blocks on psycopg2 again; existing tests (which patch webhook_receiver_rev.get_conn) still pass because module references are unchanged. |
| A2.4 | BUG-P4 | poll_trips: three-phase restructure — (1) fetch API trips per batch with no connection held; (2) one short read transaction for position_history enrichment; (3) reverse-geocode with no connection held; (4) one write transaction for the upserts + log. Nominatim's 1 req/s throttle no longer holds a pool connection or transaction open. |
| A2.5 | BUG-P6 | sync_devices: when every configured target listed successfully and returned a sane device count, set enabled_flag=0 for enabled IMEIs absent from the aggregate (and re-enable ones that return). Guard: skip the disable pass entirely if any target call failed or the aggregate is empty — an API outage must not disable the fleet. |
| A2.6 | BUG-P9 | get_token(): only .replace(tzinfo=utc) when the value is naive; otherwise use it as-is. |
| A2.7 | BUG-P8 | webhook_receiver + ingest_events: sanity-guard inbound event timestamps — reject alarm/event rows whose timestamp is > 2 days in the future or before 2026-01-01 (project epoch), logging a warning. Prevents 2019-clock devices polluting alarms/position_history further. |
A3. Observability & data lifecycle
| Step | Finding | Change |
|---|---|---|
| A3.1 | BUG-P5 | New migration 21_ingest_health_active_only.sql: redefine reporting.v_ingest_health to only include pipeline endpoints (explicit allow-list matching the ingest_worker schedule + webhook endpoints) so one-shot tools can't wedge /health/ingest at "stale". |
| A3.2 | BUG-P7 | New daily housekeeping job in the ingest worker: DELETE FROM tracksolid.ingestion_log WHERE run_at < now() - interval '90 days'; same for reporting.refresh_log (180 days). Registered at 02:30 daily. |
| A3.3 | OPS-03 | Remove the orphaned duplicate migrations/10_driver_clock_views.sql (never registered in run_migrations.py, superseded by the reporting layer). |
| A3.4 | OPS-04 | pyproject.toml: move ruff select under [tool.ruff.lint]. |
A4. Explicit non-goals for this session
- No change to the
reporting.*SQL functions (healthy; PERF-03 tuning is config-level). - No pgbouncer reintroduction (removed deliberately 2026-06-10; connection counts are fine).
- No changes to
tickets.*/ fleettickets code (different repo) — flagged separately. - No alarm retention policy (business decision on how much ACC history to keep).
Phase B — operational actions (prod; NOT executed without confirmation)
| # | Finding | Action | Command sketch |
|---|---|---|---|
| B1 | SEC-01 | Rotate the postgres superuser password (it was shared in chat), update .env on twala, redeploy stack. |
ALTER USER postgres WITH PASSWORD '<new>'; then edit .env, Coolify redeploy. |
| B2 | SEC-01 | Close public 5433 (after B1): redeploy with the new compose default, then use ssh -L 5433:localhost:5433 for local tools. Interim: ufw allow-list. |
|
| B3 | SEC-02 | Configure a push token in the Jimi/Tracksolid console, set JIMI_WEBHOOK_TOKEN in .env, set WEBHOOK_REQUIRE_TOKEN=1, redeploy webhook_receiver. |
|
| B4 | SEC-03 | Move services off the superuser: point DATABASE_URL for ingest_worker/webhook at tracksolid_owner (grants already exist for the tracksolid schema; verify with a staging dry-run), REFRESH_DATABASE_URL at reporting_refresher, db_backup at a dedicated dump role or keep postgres but only via internal network once B2 lands. |
|
| B5 | OPS-01 | Redeploy the prod dashboard_api bridge so it matches the repo (restores the 8 missing INC/CRQ/fuel routes): scp dashboard_api_rev.py twala:~/ && ssh twala 'bash ~/deploy_dashboard_api.sh'. |
|
| B6 | SEC-07 | Schedule a maintenance window to bump the TimescaleDB image to the latest pg16 minor. | |
| B7 | PERF-05 | In the fleettickets repo: drop the three unused geo indexes (~134 MB), consider trimming/compressing the raw payload column. |
Suggested order: B1 → B2 (same window), then B5 (fixes user-visible FleetOps prod), then B3, B4, B6 as follow-ups. Phase A ships first since B3/B4/B5 deploy code from this repo.
Verification per phase
- A:
pytest tests/(77 green before → must stay green, plus new tests for the rowcount fix and JSON parsing),ruff check .,docker buildlocally to confirm the image excludes.env/pbf (spot-check withdocker run --rm <img> ls -la /app). - B: after each deploy:
GET /health,GET /health/ingest(should reportokoverall once A3.1 lands),docker logsof ingest_worker for positive alarm counts,SELECT * FROM reporting.v_ingest_healthshows no negativerows_insertedon new rows.