Access caveat — read this first. The remote instance was unreachable from the review environment:
every probed port (22, 5433, 5432, 443) timed out, so the IP is not whitelisted (or the host was down).
I could not run EXPLAIN, read live row/chunk counts, confirm which indexes actually exist,
or inspect the running images. Everything below is a static review of the source and migration files.
Items needing live confirmation are tagged verify live.
Immediate security note: .env is committed to git (it is listed in
.gitignore, but was tracked before that rule existed, so the rule is a no-op). The live Tracksolid app
secret, the Postgres superuser password, and the Grafana admin password are all in the repo history on Forgejo.
Treat all three as compromised and rotate them.
ingest_movement_rev.py runs every job on one schedule thread
(main(), lines 674–695). Within that, poll_trips() opens a transaction
(with get_conn(), line 343) and then, inside that open transaction, calls
reverse_geocode() twice per trip (lines 392–393).
reverse_geocode enforces a global 1 request/second Nominatim throttle
(ts_shared_rev.py:463, _geocode_throttle).
poll_trips is geocoding, the 60-second live-position
sweep cannot run. The "live" freshness SLA silently degrades to minutes whenever trips/parking work through a
backlog. poll_track_list (30 min) and poll_stale_locations (10 min) share the same
thread and also block each other.poll_trips re-runs the 8-subquery enrichment block (_ENRICH_QUERY,
lines 295–321) for the entire last hour of trips, even though the
ON CONFLICT mostly COALESCEs the result away.UPDATE
in a second pass (or delegate geocoding to a queue / n8n).WHERE start_address IS NULL so already-enriched trips don't re-pay the cost.schedule + time.sleep(1) on one thread is the wrong concurrency model when one job is
latency-critical and others do long network I/O.dwh_gold daily-metrics ETL is non-functionalHighdwh_gold.refresh_daily_metrics() (migration 05, lines 212–264) selects
t.imei AS vehicle_key and inserts into fact_daily_fleet_metrics.vehicle_key, which is
INTEGER REFERENCES dwh_gold.dim_vehicles(vehicle_key) (schema lines 237–243).
But imei is a 12–15-digit TEXT string:
int4 → "integer out of range".dim_vehicles — no code path
inserts into it.So the function cannot succeed as written, and v_utilisation_daily (which joins
fact → dim_vehicles → devices, migration 07, lines 268–286) will always be
empty. CLAUDE.md lists "schedule the nightly ETL" as a LOW open item — but scheduling it today would error on every
run.
Recommendation: redesign the gold layer around imei (drop the surrogate
key, or populate dim_vehicles from devices first and look up the key), and fix the column
type. This is a real bug hiding behind "not scheduled yet."
v_driver_aggregates_daily will not scale, and the safeguard wasn't appliedHighMigration 07 (lines 159–223) builds this view with two 31-day scans of
position_history plus a LAG() window over source='track_list' rows. There is
no index on position_history.source, and the only index on the hypertable is the
(imei, gps_time) primary key.
The view's own header comment says "convert to a continuous aggregate once the hypertable exceeds ~100k rows." At 156 devices writing a row/minute from the poll sweep plus track_list waypoints, you cross 100k in days, not months. verify live current row + chunk count.
Recommendation: build the speeding/harsh aggregates as a TimescaleDB continuous
aggregate (the pattern already exists in v_mileage_daily_cagg), or at minimum add a partial index
supporting the source='track_list' + time filter. As-is, the daily driver dashboard does a growing full
hypertable scan on every load.
docker-compose.yaml adds a pgbouncer sidecar (lines 82–116) "to cap
tracksolid_db connections," but .env sets
DATABASE_URL=...@timescale_db:5432/... — the Python pools connect straight to Postgres, not to
pgbouncer's 6432.
So the connection cap does nothing for the three services. The real ceiling today is the sum of per-process pools:
webhook : uvicorn --workers 2 → 2 procs × ThreadedConnectionPool(max=12) = 24
ingest_movement = 12
ingest_events = 12
total ≈ 48 direct conns
At 80–156 devices this is not a live performance problem — it is wasted/contradictory infrastructure and an
intent-vs-reality gap. You also maintain a SCRAM-passthrough user_lookup() SECURITY DEFINER function
(migration 10) with no consumer.
Recommendation: either point DATABASE_URL at pgbouncer:6432
(transaction-pool mode disallows session features, but the code uses none beyond client_encoding), or
remove the sidecar.
All three services run python run_migrations.py on startup (compose lines 26, 37,
48) and start in parallel once the DB is healthy. run_migrations.py does check-then-act
(already_applied() → run_file(), lines 231–242) with no
advisory lock. On a fresh database, three containers can pass already_applied()==False
simultaneously and run the same file.
CREATE TRIGGER loop (lines 255–267) has no
IF NOT EXISTS — concurrent runs throw, and run_file() treats any ERROR: as
fatal → sys.exit(1) → a service refuses to start.run_file() greps stderr for ERROR: without -v ON_ERROR_STOP=1, and files
02/03 have no BEGIN/COMMIT, so a mid-file failure can leave partial schema that later gets mis-seeded
as "applied."Recommendation: wrap the run in pg_advisory_lock(<const>) /
unlock, and run psql with ON_ERROR_STOP=1. Low effort, removes a class of cold-start flakiness.
10_driver_clock_views.sql is never appliedMediumThe runner's MIGRATIONS list (run_migrations.py:27–37) includes
10_pgbouncer_auth.sql but not 10_driver_clock_views.sql. Two files share the
10_ prefix and the list is hand-maintained, so v_driver_clock_daily/_today (which the n8n
tardiness workflow depends on, per the file header) exist only if someone applied them by hand — they are not
reproducible from a clean deploy.
Recommendation: rename to 11_ and add to the list. Better: switch the
runner from a hardcoded list to globbing NN_*.sql sorted, so this cannot recur.
_validate_token
(webhook_receiver_rev.py:84–87) skips validation entirely when
JIMI_WEBHOOK_TOKEN is empty, and it is not set in .env. The push endpoints are
exposed via Traefik, so anyone who learns the URL can inject arbitrary telemetry/alarms (each /pushgps
accepts up to 5000 rows, no rate limit). Set the token and make an unset token fail closed in production.git rm --cached .env and scrub history.dwh/260423_dwh_ddl_v1.sql plaintext passwords are an existing known item in CLAUDE.md — same class of
problem.v_mileage_daily_cagg is built on a column that's mostly NULL. It computes
MAX(current_mileage) - MIN(current_mileage) (schema lines 293–301), but
current_mileage is only populated by the poll sweep — track_list and /pushgps
inserts leave it NULL, and odometer resets/device swaps produce negative or huge deltas. The aggregate's
dist_km is unreliable. Prefer deriving daily distance from trips.distance_km.ingestion_log has no retention and no index. v_ingestion_health does
DISTINCT ON (endpoint) … ORDER BY endpoint, run_at DESC over the whole table, which grows ~875
rows/day forever. Add (endpoint, run_at DESC) plus a retention/partition policy.alarms_dedup UNIQUE (imei, alarm_type, alarm_time)
(schema line 199) — the poll path inserts alertTypeId as
alarm_type with no NOT-NULL guard, and NULL defeats the unique constraint
(NULL ≠ NULL), so a null-type alarm can duplicate. The webhook path guards this; the poll path
doesn't.live_positions/staleness queries are seq scans (no index on gps_time) — totally
fine at ~156 rows today; just don't carry that pattern into anything that scans position_history._parse_request (webhook lines 90–143): the
JSON-array branch _parse_data_list is never reached (it always falls through to
request.form()); harmless but misleading given the docstring claims it handles both.So this is balanced — the bones are solid:
SAVEPOINT isolation so one bad item can't abort a batch.upsert_live_position helper.execute_values on the high-volume push / track-list paths.COMMENT ON VIEW provenance.sync_devices N+1 already parallelized with a bounded thread pool.The issues above are mostly about coupling, one broken ETL, and scale-ahead-of-indexing — not a bad foundation.
| # | Action | Upside | Effort |
|---|---|---|---|
| 1 | Pull geocoding out of the trips transaction + gate on start_address IS NULL; isolate the 60s sweep on its own thread |
High — restores live freshness, frees connections | M |
| 2 | Fix or redesign refresh_daily_metrics / dim_vehicles (imei vs int key) |
High — unblocks all utilisation reporting | M |
| 3 | Convert v_driver_aggregates_daily to a continuous aggregate (or add source+time index) |
High and growing | M |
| 4 | Set JIMI_WEBHOOK_TOKEN; rotate + untrack .env |
High (security) | S |
| 5 | Advisory-lock the migration runner + ON_ERROR_STOP=1; add 10_driver_clock_views / switch to glob |
Medium (reliability) | S |
| 6 | Decide pgbouncer in-or-out; point DATABASE_URL accordingly |
Medium (clarity) | S |
| 7 | ingestion_log index + retention; fix poll-path alarm null dedup; fix cagg distance source |
Low–medium | S |
Next step for live confirmation: if I can get onto the box (whitelist the review IP for
5433, or an SSH tunnel), I'll confirm the verify live items — actual
position_history row/chunk counts, which indexes really exist, whether refresh_daily_metrics
has ever succeeded, and EXPLAIN ANALYZE on the heavier views — and tighten the priority order with real
numbers.