# Fix & Refactor Plan — 2026-07-02 Companion to `260702_platform_audit_report.md` (findings) and `260702_work_done.md` (execution log). Goal per the brief: **faster, more scalable, and above all secure**, without regressing any FIX-* in CLAUDE.md §7. The plan splits into (A) code/repo changes implemented in this session, and (B) operational actions that touch prod and therefore need explicit operator confirmation (CLAUDE.md working rule 1) — commands are provided, not executed. --- ## Phase A — code & repo changes (implemented locally, tests must stay green) ### A1. Security hardening in the repo | Step | Finding | Change | |---|---|---| | A1.1 | SEC-04 | Add `.dockerignore` excluding `.env*`, `.git`, `.venv`, `*.osm.pbf`, CSV/data artefacts, docs, tests, caches. | | A1.2 | SEC-05 | Dockerfile: `COPY uv.lock` + install with `uv pip install --system --require-hashes`-equivalent (`uv sync --frozen` layout) so builds are pinned and reproducible. Keep the runtime layout identical (`/app`, non-root user). | | A1.3 | SEC-02 | `webhook_receiver_rev.py`: log a **CRITICAL** warning at startup when `JIMI_WEBHOOK_TOKEN` is empty, and add `WEBHOOK_REQUIRE_TOKEN=1` (opt-in) that refuses to start unauthenticated. Default stays permissive so the next deploy doesn't break live ingestion before the token is configured on the Jimi side; flipping the flag is step B3. | | A1.4 | SEC-01 | `docker-compose.yaml`: publish Postgres as `${DB_BIND_ADDR:-127.0.0.1}:5433:5432`. Default becomes localhost-only **on the next stack redeploy**; remote tooling moves to an SSH tunnel (documented). Setting `DB_BIND_ADDR=0.0.0.0` in `.env` restores the old exposure deliberately. | ### A2. Correctness fixes | Step | Finding | Change | |---|---|---| | A2.1 | BUG-P1 | Capture `cur.rowcount` immediately after the INSERT, before `RELEASE SAVEPOINT`, in `poll_alarms`, `poll_trips`, `poll_parking`. | | A2.2 | BUG-P3 | `_parse_request`: branch on Content-Type — parse `application/json` bodies (`{"token", "data_list"}`), keep the observed form-encoded path (`msgType`/`data`) as-is. | | A2.3 | BUG-P2 | Webhook endpoints: keep async request parsing, move each endpoint's DB work into a synchronous `_process_*(items)` function executed via `await asyncio.to_thread(...)`. Event loop never blocks on psycopg2 again; existing tests (which patch `webhook_receiver_rev.get_conn`) still pass because module references are unchanged. | | A2.4 | BUG-P4 | `poll_trips`: three-phase restructure — (1) fetch API trips per batch with **no** connection held; (2) one short read transaction for position_history enrichment; (3) reverse-geocode with no connection held; (4) one write transaction for the upserts + log. Nominatim's 1 req/s throttle no longer holds a pool connection or transaction open. | | A2.5 | BUG-P6 | `sync_devices`: when **every** configured target listed successfully and returned a sane device count, set `enabled_flag=0` for enabled IMEIs absent from the aggregate (and re-enable ones that return). Guard: skip the disable pass entirely if any target call failed or the aggregate is empty — an API outage must not disable the fleet. | | A2.6 | BUG-P9 | `get_token()`: only `.replace(tzinfo=utc)` when the value is naive; otherwise use it as-is. | | A2.7 | BUG-P8 | `webhook_receiver` + `ingest_events`: sanity-guard inbound event timestamps — reject alarm/event rows whose timestamp is > 2 days in the future or before 2026-01-01 (project epoch), logging a warning. Prevents 2019-clock devices polluting `alarms`/`position_history` further. | ### A3. Observability & data lifecycle | Step | Finding | Change | |---|---|---| | A3.1 | BUG-P5 | New migration `21_ingest_health_active_only.sql`: redefine `reporting.v_ingest_health` to only include pipeline endpoints (explicit allow-list matching the ingest_worker schedule + webhook endpoints) so one-shot tools can't wedge `/health/ingest` at "stale". | | A3.2 | BUG-P7 | New daily housekeeping job in the ingest worker: `DELETE FROM tracksolid.ingestion_log WHERE run_at < now() - interval '90 days'`; same for `reporting.refresh_log` (180 days). Registered at 02:30 daily. | | A3.3 | OPS-03 | Remove the orphaned duplicate `migrations/10_driver_clock_views.sql` (never registered in `run_migrations.py`, superseded by the reporting layer). | | A3.4 | OPS-04 | `pyproject.toml`: move ruff `select` under `[tool.ruff.lint]`. | ### A4. Explicit non-goals for this session - No change to the `reporting.*` SQL functions (healthy; PERF-03 tuning is config-level). - No pgbouncer reintroduction (removed deliberately 2026-06-10; connection counts are fine). - No changes to `tickets.*` / fleettickets code (different repo) — flagged separately. - No alarm retention policy (business decision on how much ACC history to keep). --- ## Phase B — operational actions (prod; NOT executed without confirmation) | # | Finding | Action | Command sketch | |---|---|---|---| | B1 | SEC-01 | **Rotate the postgres superuser password** (it was shared in chat), update `.env` on twala, redeploy stack. | `ALTER USER postgres WITH PASSWORD '';` then edit `.env`, Coolify redeploy. | | B2 | SEC-01 | Close public 5433 (after B1): redeploy with the new compose default, then use `ssh -L 5433:localhost:5433` for local tools. Interim: `ufw` allow-list. | | B3 | SEC-02 | Configure a push token in the Jimi/Tracksolid console, set `JIMI_WEBHOOK_TOKEN` in `.env`, set `WEBHOOK_REQUIRE_TOKEN=1`, redeploy webhook_receiver. | | B4 | SEC-03 | Move services off the superuser: point `DATABASE_URL` for ingest_worker/webhook at `tracksolid_owner` (grants already exist for the tracksolid schema; verify with a staging dry-run), `REFRESH_DATABASE_URL` at `reporting_refresher`, db_backup at a dedicated dump role or keep postgres but only via internal network once B2 lands. | | B5 | OPS-01 | Redeploy the prod dashboard_api bridge so it matches the repo (restores the 8 missing INC/CRQ/fuel routes): `scp dashboard_api_rev.py twala:~/ && ssh twala 'bash ~/deploy_dashboard_api.sh'`. | | B6 | SEC-07 | Schedule a maintenance window to bump the TimescaleDB image to the latest pg16 minor. | | B7 | PERF-05 | In the **fleettickets** repo: drop the three unused geo indexes (~134 MB), consider trimming/compressing the `raw` payload column. | Suggested order: B1 → B2 (same window), then B5 (fixes user-visible FleetOps prod), then B3, B4, B6 as follow-ups. Phase A ships first since B3/B4/B5 deploy code from this repo. --- ## Verification per phase - **A:** `pytest tests/` (77 green before → must stay green, plus new tests for the rowcount fix and JSON parsing), `ruff check .`, `docker build` locally to confirm the image excludes `.env`/pbf (spot-check with `docker run --rm ls -la /app`). - **B:** after each deploy: `GET /health`, `GET /health/ingest` (should report `ok` overall once A3.1 lands), `docker logs` of ingest_worker for positive alarm counts, `SELECT * FROM reporting.v_ingest_health` shows no negative `rows_inserted` on new rows.