tracksolid_timescale_grafan.../docs/reports/260702_fix_plan.md
david kiania b11294009b
Some checks are pending
Static Analysis / static (push) Waiting to run
Tests / test (push) Waiting to run
Static Analysis / static (pull_request) Waiting to run
Tests / test (pull_request) Waiting to run
fix(security,ingest): 260702 audit — secure the stack, correct poller counters
Security:
- .dockerignore + Dockerfile: stop baking .env / the 346MB OSM pbf into image
  layers; install pinned from uv.lock (reproducible builds) (SEC-04/05).
- docker-compose: DB port binds ${DB_BIND_ADDR:-127.0.0.1} — loopback-only by
  default; remote tooling moves to an SSH tunnel (SEC-01).
- webhook_receiver: CRITICAL startup warning + WEBHOOK_REQUIRE_TOKEN=1 fail-closed
  when JIMI_WEBHOOK_TOKEN is empty (SEC-02 / FIX-W01).

Correctness:
- FIX-M22/E07: capture cur.rowcount BEFORE RELEASE SAVEPOINT in poll_alarms/
  poll_trips/poll_parking — the RELEASE reported -1, producing "Alarms: -4 new
  events inserted" logs and negative ingestion_log.rows_inserted.
- FIX-W02: parse application/json push bodies (were silently dropped).
- FIX-W03: move webhook DB work off the event loop via asyncio.to_thread.
- FIX-M23: poll_trips phased so no txn/connection is held across Tracksolid +
  Nominatim (1 req/s) network calls.
- FIX-M24: sync_devices disables devices absent from every target (guarded).
- FIX-W04: reject device-clock-garbage alarm_time (2019 timestamps observed).
- get_token(): don't relabel already-aware timestamptz expiries (BUG-P9).

Observability/lifecycle:
- migration 21: v_ingest_health restricted to active pipeline endpoints so
  one-shot tools stop wedging /health/ingest at 'stale' (dry-run verified).
- FIX-M25: daily purge_audit_logs() trims ingestion_log (90d) + refresh_log (180d).
- remove orphaned duplicate migrations/10_driver_clock_views.sql; ruff lint config.

+5 webhook tests (82 pass). Report/plan/work-log in docs/reports/260702_*.
Local only; not deployed. CLAUDE.md fix-history edits left uncommitted (that file
also carries unrelated in-progress edits).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 09:51:02 +03:00

6.9 KiB

Fix & Refactor Plan — 2026-07-02

Companion to 260702_platform_audit_report.md (findings) and 260702_work_done.md (execution log). Goal per the brief: faster, more scalable, and above all secure, without regressing any FIX-* in CLAUDE.md §7.

The plan splits into (A) code/repo changes implemented in this session, and (B) operational actions that touch prod and therefore need explicit operator confirmation (CLAUDE.md working rule 1) — commands are provided, not executed.


Phase A — code & repo changes (implemented locally, tests must stay green)

A1. Security hardening in the repo

Step Finding Change
A1.1 SEC-04 Add .dockerignore excluding .env*, .git, .venv, *.osm.pbf, CSV/data artefacts, docs, tests, caches.
A1.2 SEC-05 Dockerfile: COPY uv.lock + install with uv pip install --system --require-hashes-equivalent (uv sync --frozen layout) so builds are pinned and reproducible. Keep the runtime layout identical (/app, non-root user).
A1.3 SEC-02 webhook_receiver_rev.py: log a CRITICAL warning at startup when JIMI_WEBHOOK_TOKEN is empty, and add WEBHOOK_REQUIRE_TOKEN=1 (opt-in) that refuses to start unauthenticated. Default stays permissive so the next deploy doesn't break live ingestion before the token is configured on the Jimi side; flipping the flag is step B3.
A1.4 SEC-01 docker-compose.yaml: publish Postgres as ${DB_BIND_ADDR:-127.0.0.1}:5433:5432. Default becomes localhost-only on the next stack redeploy; remote tooling moves to an SSH tunnel (documented). Setting DB_BIND_ADDR=0.0.0.0 in .env restores the old exposure deliberately.

A2. Correctness fixes

Step Finding Change
A2.1 BUG-P1 Capture cur.rowcount immediately after the INSERT, before RELEASE SAVEPOINT, in poll_alarms, poll_trips, poll_parking.
A2.2 BUG-P3 _parse_request: branch on Content-Type — parse application/json bodies ({"token", "data_list"}), keep the observed form-encoded path (msgType/data) as-is.
A2.3 BUG-P2 Webhook endpoints: keep async request parsing, move each endpoint's DB work into a synchronous _process_*(items) function executed via await asyncio.to_thread(...). Event loop never blocks on psycopg2 again; existing tests (which patch webhook_receiver_rev.get_conn) still pass because module references are unchanged.
A2.4 BUG-P4 poll_trips: three-phase restructure — (1) fetch API trips per batch with no connection held; (2) one short read transaction for position_history enrichment; (3) reverse-geocode with no connection held; (4) one write transaction for the upserts + log. Nominatim's 1 req/s throttle no longer holds a pool connection or transaction open.
A2.5 BUG-P6 sync_devices: when every configured target listed successfully and returned a sane device count, set enabled_flag=0 for enabled IMEIs absent from the aggregate (and re-enable ones that return). Guard: skip the disable pass entirely if any target call failed or the aggregate is empty — an API outage must not disable the fleet.
A2.6 BUG-P9 get_token(): only .replace(tzinfo=utc) when the value is naive; otherwise use it as-is.
A2.7 BUG-P8 webhook_receiver + ingest_events: sanity-guard inbound event timestamps — reject alarm/event rows whose timestamp is > 2 days in the future or before 2026-01-01 (project epoch), logging a warning. Prevents 2019-clock devices polluting alarms/position_history further.

A3. Observability & data lifecycle

Step Finding Change
A3.1 BUG-P5 New migration 21_ingest_health_active_only.sql: redefine reporting.v_ingest_health to only include pipeline endpoints (explicit allow-list matching the ingest_worker schedule + webhook endpoints) so one-shot tools can't wedge /health/ingest at "stale".
A3.2 BUG-P7 New daily housekeeping job in the ingest worker: DELETE FROM tracksolid.ingestion_log WHERE run_at < now() - interval '90 days'; same for reporting.refresh_log (180 days). Registered at 02:30 daily.
A3.3 OPS-03 Remove the orphaned duplicate migrations/10_driver_clock_views.sql (never registered in run_migrations.py, superseded by the reporting layer).
A3.4 OPS-04 pyproject.toml: move ruff select under [tool.ruff.lint].

A4. Explicit non-goals for this session

  • No change to the reporting.* SQL functions (healthy; PERF-03 tuning is config-level).
  • No pgbouncer reintroduction (removed deliberately 2026-06-10; connection counts are fine).
  • No changes to tickets.* / fleettickets code (different repo) — flagged separately.
  • No alarm retention policy (business decision on how much ACC history to keep).

Phase B — operational actions (prod; NOT executed without confirmation)

# Finding Action Command sketch
B1 SEC-01 Rotate the postgres superuser password (it was shared in chat), update .env on twala, redeploy stack. ALTER USER postgres WITH PASSWORD '<new>'; then edit .env, Coolify redeploy.
B2 SEC-01 Close public 5433 (after B1): redeploy with the new compose default, then use ssh -L 5433:localhost:5433 for local tools. Interim: ufw allow-list.
B3 SEC-02 Configure a push token in the Jimi/Tracksolid console, set JIMI_WEBHOOK_TOKEN in .env, set WEBHOOK_REQUIRE_TOKEN=1, redeploy webhook_receiver.
B4 SEC-03 Move services off the superuser: point DATABASE_URL for ingest_worker/webhook at tracksolid_owner (grants already exist for the tracksolid schema; verify with a staging dry-run), REFRESH_DATABASE_URL at reporting_refresher, db_backup at a dedicated dump role or keep postgres but only via internal network once B2 lands.
B5 OPS-01 Redeploy the prod dashboard_api bridge so it matches the repo (restores the 8 missing INC/CRQ/fuel routes): scp dashboard_api_rev.py twala:~/ && ssh twala 'bash ~/deploy_dashboard_api.sh'.
B6 SEC-07 Schedule a maintenance window to bump the TimescaleDB image to the latest pg16 minor.
B7 PERF-05 In the fleettickets repo: drop the three unused geo indexes (~134 MB), consider trimming/compressing the raw payload column.

Suggested order: B1 → B2 (same window), then B5 (fixes user-visible FleetOps prod), then B3, B4, B6 as follow-ups. Phase A ships first since B3/B4/B5 deploy code from this repo.


Verification per phase

  • A: pytest tests/ (77 green before → must stay green, plus new tests for the rowcount fix and JSON parsing), ruff check ., docker build locally to confirm the image excludes .env/pbf (spot-check with docker run --rm <img> ls -la /app).
  • B: after each deploy: GET /health, GET /health/ingest (should report ok overall once A3.1 lands), docker logs of ingest_worker for positive alarm counts, SELECT * FROM reporting.v_ingest_health shows no negative rows_inserted on new rows.