Engineering Review · Fireside Communications · Tracksolid Fleet Stack

Database & Microservice Assessment — Opportunities & Refactoring

Date: 2026-06-01  ·  Reviewer: Claude (Opus 4.8)  ·  Scope: TimescaleDB/PostGIS schema + migrations, and the three ingestion microservices (ingest_movement_rev.py, ingest_events_rev.py, webhook_receiver_rev.py + shared ts_shared_rev.py)

Findings are ordered by greatest performance upside first, as requested.

Access caveat — read this first. The remote instance was unreachable from the review environment: every probed port (22, 5433, 5432, 443) timed out, so the IP is not whitelisted (or the host was down). I could not run EXPLAIN, read live row/chunk counts, confirm which indexes actually exist, or inspect the running images. Everything below is a static review of the source and migration files. Items needing live confirmation are tagged verify live.

Immediate security note: .env is committed to git (it is listed in .gitignore, but was tracked before that rule existed, so the rule is a no-op). The live Tracksolid app secret, the Postgres superuser password, and the Grafana admin password are all in the repo history on Forgejo. Treat all three as compromised and rotate them.

Findings
  1. Single-threaded scheduler holds a DB transaction open across throttled geocoding High
  2. dwh_gold daily-metrics ETL is non-functional High
  3. v_driver_aggregates_daily will not scale; safeguard not applied High
  4. pgbouncer deployed but bypassed by the application Medium
  5. Migrations race across three containers with no lock Medium
  6. Orphaned migration: 10_driver_clock_views.sql never applied Medium
  7. Security gaps (webhook auth, committed secrets) Security
  8. Smaller DB-design notes Low
  9. What's genuinely good
  10. Suggested order of attack

1Single-threaded scheduler holds a DB transaction open across throttled geocodingHighest upside

ingest_movement_rev.py runs every job on one schedule thread (main(), lines 674–695). Within that, poll_trips() opens a transaction (with get_conn(), line 343) and then, inside that open transaction, calls reverse_geocode() twice per trip (lines 392–393). reverse_geocode enforces a global 1 request/second Nominatim throttle (ts_shared_rev.py:463, _geocode_throttle).

Consequences

Recommendation

2The dwh_gold daily-metrics ETL is non-functionalHigh

dwh_gold.refresh_daily_metrics() (migration 05, lines 212–264) selects t.imei AS vehicle_key and inserts into fact_daily_fleet_metrics.vehicle_key, which is INTEGER REFERENCES dwh_gold.dim_vehicles(vehicle_key) (schema lines 237–243). But imei is a 12–15-digit TEXT string:

So the function cannot succeed as written, and v_utilisation_daily (which joins fact → dim_vehicles → devices, migration 07, lines 268–286) will always be empty. CLAUDE.md lists "schedule the nightly ETL" as a LOW open item — but scheduling it today would error on every run.

Recommendation: redesign the gold layer around imei (drop the surrogate key, or populate dim_vehicles from devices first and look up the key), and fix the column type. This is a real bug hiding behind "not scheduled yet."

3v_driver_aggregates_daily will not scale, and the safeguard wasn't appliedHigh

Migration 07 (lines 159–223) builds this view with two 31-day scans of position_history plus a LAG() window over source='track_list' rows. There is no index on position_history.source, and the only index on the hypertable is the (imei, gps_time) primary key.

The view's own header comment says "convert to a continuous aggregate once the hypertable exceeds ~100k rows." At 156 devices writing a row/minute from the poll sweep plus track_list waypoints, you cross 100k in days, not months. verify live current row + chunk count.

Recommendation: build the speeding/harsh aggregates as a TimescaleDB continuous aggregate (the pattern already exists in v_mileage_daily_cagg), or at minimum add a partial index supporting the source='track_list' + time filter. As-is, the daily driver dashboard does a growing full hypertable scan on every load.

4pgbouncer is deployed but the application bypasses it entirelyMedium

docker-compose.yaml adds a pgbouncer sidecar (lines 82–116) "to cap tracksolid_db connections," but .env sets DATABASE_URL=...@timescale_db:5432/... — the Python pools connect straight to Postgres, not to pgbouncer's 6432.

So the connection cap does nothing for the three services. The real ceiling today is the sum of per-process pools:

webhook   : uvicorn --workers 2  →  2 procs × ThreadedConnectionPool(max=12) = 24
ingest_movement                                                          = 12
ingest_events                                                            = 12
                                                              total ≈ 48 direct conns

At 80–156 devices this is not a live performance problem — it is wasted/contradictory infrastructure and an intent-vs-reality gap. You also maintain a SCRAM-passthrough user_lookup() SECURITY DEFINER function (migration 10) with no consumer.

Recommendation: either point DATABASE_URL at pgbouncer:6432 (transaction-pool mode disallows session features, but the code uses none beyond client_encoding), or remove the sidecar.

5Migrations race across three containers with no lockMedium · reliability

All three services run python run_migrations.py on startup (compose lines 26, 37, 48) and start in parallel once the DB is healthy. run_migrations.py does check-then-act (already_applied()run_file(), lines 231–242) with no advisory lock. On a fresh database, three containers can pass already_applied()==False simultaneously and run the same file.

Recommendation: wrap the run in pg_advisory_lock(<const>) / unlock, and run psql with ON_ERROR_STOP=1. Low effort, removes a class of cold-start flakiness.

6Orphaned migration: 10_driver_clock_views.sql is never appliedMedium

The runner's MIGRATIONS list (run_migrations.py:27–37) includes 10_pgbouncer_auth.sql but not 10_driver_clock_views.sql. Two files share the 10_ prefix and the list is hand-maintained, so v_driver_clock_daily/_today (which the n8n tardiness workflow depends on, per the file header) exist only if someone applied them by hand — they are not reproducible from a clean deploy.

Recommendation: rename to 11_ and add to the list. Better: switch the runner from a hardcoded list to globbing NN_*.sql sorted, so this cannot recur.

7Security gaps worth fixing nowSecurity

8Smaller DB-design notesLow — queue these

What's genuinely good

So this is balanced — the bones are solid:

The issues above are mostly about coupling, one broken ETL, and scale-ahead-of-indexing — not a bad foundation.

»Suggested order of attack (effort vs. upside)

#ActionUpsideEffort
1 Pull geocoding out of the trips transaction + gate on start_address IS NULL; isolate the 60s sweep on its own thread High — restores live freshness, frees connections M
2 Fix or redesign refresh_daily_metrics / dim_vehicles (imei vs int key) High — unblocks all utilisation reporting M
3 Convert v_driver_aggregates_daily to a continuous aggregate (or add source+time index) High and growing M
4 Set JIMI_WEBHOOK_TOKEN; rotate + untrack .env High (security) S
5 Advisory-lock the migration runner + ON_ERROR_STOP=1; add 10_driver_clock_views / switch to glob Medium (reliability) S
6 Decide pgbouncer in-or-out; point DATABASE_URL accordingly Medium (clarity) S
7 ingestion_log index + retention; fix poll-path alarm null dedup; fix cagg distance source Low–medium S

Next step for live confirmation: if I can get onto the box (whitelist the review IP for 5433, or an SSH tunnel), I'll confirm the verify live items — actual position_history row/chunk counts, which indexes really exist, whether refresh_daily_metrics has ever succeeded, and EXPLAIN ANALYZE on the heavier views — and tighten the priority order with real numbers.