74 lines
6.9 KiB
Markdown
74 lines
6.9 KiB
Markdown
|
|
# Fix & Refactor Plan — 2026-07-02
|
||
|
|
|
||
|
|
Companion to `260702_platform_audit_report.md` (findings) and `260702_work_done.md`
|
||
|
|
(execution log). Goal per the brief: **faster, more scalable, and above all secure**,
|
||
|
|
without regressing any FIX-* in CLAUDE.md §7.
|
||
|
|
|
||
|
|
The plan splits into (A) code/repo changes implemented in this session, and
|
||
|
|
(B) operational actions that touch prod and therefore need explicit operator
|
||
|
|
confirmation (CLAUDE.md working rule 1) — commands are provided, not executed.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase A — code & repo changes (implemented locally, tests must stay green)
|
||
|
|
|
||
|
|
### A1. Security hardening in the repo
|
||
|
|
| Step | Finding | Change |
|
||
|
|
|---|---|---|
|
||
|
|
| A1.1 | SEC-04 | Add `.dockerignore` excluding `.env*`, `.git`, `.venv`, `*.osm.pbf`, CSV/data artefacts, docs, tests, caches. |
|
||
|
|
| A1.2 | SEC-05 | Dockerfile: `COPY uv.lock` + install with `uv pip install --system --require-hashes`-equivalent (`uv sync --frozen` layout) so builds are pinned and reproducible. Keep the runtime layout identical (`/app`, non-root user). |
|
||
|
|
| A1.3 | SEC-02 | `webhook_receiver_rev.py`: log a **CRITICAL** warning at startup when `JIMI_WEBHOOK_TOKEN` is empty, and add `WEBHOOK_REQUIRE_TOKEN=1` (opt-in) that refuses to start unauthenticated. Default stays permissive so the next deploy doesn't break live ingestion before the token is configured on the Jimi side; flipping the flag is step B3. |
|
||
|
|
| A1.4 | SEC-01 | `docker-compose.yaml`: publish Postgres as `${DB_BIND_ADDR:-127.0.0.1}:5433:5432`. Default becomes localhost-only **on the next stack redeploy**; remote tooling moves to an SSH tunnel (documented). Setting `DB_BIND_ADDR=0.0.0.0` in `.env` restores the old exposure deliberately. |
|
||
|
|
|
||
|
|
### A2. Correctness fixes
|
||
|
|
| Step | Finding | Change |
|
||
|
|
|---|---|---|
|
||
|
|
| A2.1 | BUG-P1 | Capture `cur.rowcount` immediately after the INSERT, before `RELEASE SAVEPOINT`, in `poll_alarms`, `poll_trips`, `poll_parking`. |
|
||
|
|
| A2.2 | BUG-P3 | `_parse_request`: branch on Content-Type — parse `application/json` bodies (`{"token", "data_list"}`), keep the observed form-encoded path (`msgType`/`data`) as-is. |
|
||
|
|
| A2.3 | BUG-P2 | Webhook endpoints: keep async request parsing, move each endpoint's DB work into a synchronous `_process_*(items)` function executed via `await asyncio.to_thread(...)`. Event loop never blocks on psycopg2 again; existing tests (which patch `webhook_receiver_rev.get_conn`) still pass because module references are unchanged. |
|
||
|
|
| A2.4 | BUG-P4 | `poll_trips`: three-phase restructure — (1) fetch API trips per batch with **no** connection held; (2) one short read transaction for position_history enrichment; (3) reverse-geocode with no connection held; (4) one write transaction for the upserts + log. Nominatim's 1 req/s throttle no longer holds a pool connection or transaction open. |
|
||
|
|
| A2.5 | BUG-P6 | `sync_devices`: when **every** configured target listed successfully and returned a sane device count, set `enabled_flag=0` for enabled IMEIs absent from the aggregate (and re-enable ones that return). Guard: skip the disable pass entirely if any target call failed or the aggregate is empty — an API outage must not disable the fleet. |
|
||
|
|
| A2.6 | BUG-P9 | `get_token()`: only `.replace(tzinfo=utc)` when the value is naive; otherwise use it as-is. |
|
||
|
|
| A2.7 | BUG-P8 | `webhook_receiver` + `ingest_events`: sanity-guard inbound event timestamps — reject alarm/event rows whose timestamp is > 2 days in the future or before 2026-01-01 (project epoch), logging a warning. Prevents 2019-clock devices polluting `alarms`/`position_history` further. |
|
||
|
|
|
||
|
|
### A3. Observability & data lifecycle
|
||
|
|
| Step | Finding | Change |
|
||
|
|
|---|---|---|
|
||
|
|
| A3.1 | BUG-P5 | New migration `21_ingest_health_active_only.sql`: redefine `reporting.v_ingest_health` to only include pipeline endpoints (explicit allow-list matching the ingest_worker schedule + webhook endpoints) so one-shot tools can't wedge `/health/ingest` at "stale". |
|
||
|
|
| A3.2 | BUG-P7 | New daily housekeeping job in the ingest worker: `DELETE FROM tracksolid.ingestion_log WHERE run_at < now() - interval '90 days'`; same for `reporting.refresh_log` (180 days). Registered at 02:30 daily. |
|
||
|
|
| A3.3 | OPS-03 | Remove the orphaned duplicate `migrations/10_driver_clock_views.sql` (never registered in `run_migrations.py`, superseded by the reporting layer). |
|
||
|
|
| A3.4 | OPS-04 | `pyproject.toml`: move ruff `select` under `[tool.ruff.lint]`. |
|
||
|
|
|
||
|
|
### A4. Explicit non-goals for this session
|
||
|
|
- No change to the `reporting.*` SQL functions (healthy; PERF-03 tuning is config-level).
|
||
|
|
- No pgbouncer reintroduction (removed deliberately 2026-06-10; connection counts are fine).
|
||
|
|
- No changes to `tickets.*` / fleettickets code (different repo) — flagged separately.
|
||
|
|
- No alarm retention policy (business decision on how much ACC history to keep).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase B — operational actions (prod; NOT executed without confirmation)
|
||
|
|
|
||
|
|
| # | Finding | Action | Command sketch |
|
||
|
|
|---|---|---|---|
|
||
|
|
| B1 | SEC-01 | **Rotate the postgres superuser password** (it was shared in chat), update `.env` on twala, redeploy stack. | `ALTER USER postgres WITH PASSWORD '<new>';` then edit `.env`, Coolify redeploy. |
|
||
|
|
| B2 | SEC-01 | Close public 5433 (after B1): redeploy with the new compose default, then use `ssh -L 5433:localhost:5433` for local tools. Interim: `ufw` allow-list. |
|
||
|
|
| B3 | SEC-02 | Configure a push token in the Jimi/Tracksolid console, set `JIMI_WEBHOOK_TOKEN` in `.env`, set `WEBHOOK_REQUIRE_TOKEN=1`, redeploy webhook_receiver. |
|
||
|
|
| B4 | SEC-03 | Move services off the superuser: point `DATABASE_URL` for ingest_worker/webhook at `tracksolid_owner` (grants already exist for the tracksolid schema; verify with a staging dry-run), `REFRESH_DATABASE_URL` at `reporting_refresher`, db_backup at a dedicated dump role or keep postgres but only via internal network once B2 lands. |
|
||
|
|
| B5 | OPS-01 | Redeploy the prod dashboard_api bridge so it matches the repo (restores the 8 missing INC/CRQ/fuel routes): `scp dashboard_api_rev.py twala:~/ && ssh twala 'bash ~/deploy_dashboard_api.sh'`. |
|
||
|
|
| B6 | SEC-07 | Schedule a maintenance window to bump the TimescaleDB image to the latest pg16 minor. |
|
||
|
|
| B7 | PERF-05 | In the **fleettickets** repo: drop the three unused geo indexes (~134 MB), consider trimming/compressing the `raw` payload column. |
|
||
|
|
|
||
|
|
Suggested order: B1 → B2 (same window), then B5 (fixes user-visible FleetOps prod), then
|
||
|
|
B3, B4, B6 as follow-ups. Phase A ships first since B3/B4/B5 deploy code from this repo.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Verification per phase
|
||
|
|
- **A:** `pytest tests/` (77 green before → must stay green, plus new tests for the
|
||
|
|
rowcount fix and JSON parsing), `ruff check .`, `docker build` locally to confirm the
|
||
|
|
image excludes `.env`/pbf (spot-check with `docker run --rm <img> ls -la /app`).
|
||
|
|
- **B:** after each deploy: `GET /health`, `GET /health/ingest` (should report `ok`
|
||
|
|
overall once A3.1 lands), `docker logs` of ingest_worker for positive alarm counts,
|
||
|
|
`SELECT * FROM reporting.v_ingest_health` shows no negative `rows_inserted` on new rows.
|