chore(repo): reorganize tree into migrations/ data/ legacy/ docs/

Group root-level files (accreted from incremental changes) by purpose
without moving any deployment entrypoint or breaking imports:

- migrations/  : numbered SQL 02-10
- data/        : source CSVs
- legacy/      : superseded pre-_rev scripts + old pipeline notes (not deployed)
- docs/{manuals,reference,reports}/ : loose manuals, references, reports
- strip stray ** / *** prefixes from 5 doc filenames
- delete empty documents.txt / push_webhook.md

Reference updates so nothing breaks:
- run_migrations.py  -> /app/migrations/<file>
- run_migrations.sh  -> $SCRIPT_DIR/migrations
- import_drivers_csv.py -> data/<csv>
- docker-compose.yaml -> runbook path comment
- CLAUDE.md -> codebase map + inline doc references

Deployed Python (3 services + ts_shared_rev + run_migrations) and the
documented ops one-shots stay at root, preserving the flat-import layout
and all documented commands. Verified: py_compile clean across all modules,
every MIGRATIONS entry resolves under migrations/, CI-referenced paths
(tests/, mypy targets, db_audit) and the grafana build context intact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
david kiania 2026-06-01 02:27:30 +03:00
parent 2309464ab8
commit e5b0e192d8
41 changed files with 393 additions and 17 deletions

View file

@ -19,7 +19,7 @@ docker exec $DB psql -U postgres -d tracksolid_db -c "SELECT COUNT(*) FROM track
**Run a migration file:**
```bash
docker exec -i $DB psql -U postgres -d tracksolid_db < 07_your_migration.sql
docker exec -i $DB psql -U postgres -d tracksolid_db < migrations/07_your_migration.sql
```
---
@ -91,17 +91,19 @@ dwh/ # DWH migrations for tracksolid_dwh@31.97.44.246:588
# 261002_bronze_constraints_audit.sql — ON CONFLICT key assertion
# 261003_dwh_roles.sql — role contract assertion
# 261004_dwh_observability_views.sql — freshness/failure views
02_tracksolid_full_schema_rev.sql # Full schema bootstrap
03..06_*.sql # Incremental migrations (06 adds assigned_city, dispatch_log, ops.*)
07_analytics_views.sql # Analytics views migration (applied 2026-04-21)
migrations/ # Numbered SQL migrations 0210, applied in order by run_migrations.py
# 02 full schema · 03 webhook · 04 distance fix · 05 enhancements
# 06 ops/analytics · 07 views · 08 config · 09 trips enrichment
# 10_driver_clock_views.sql · 10_pgbouncer_auth.sql
Dockerfile # Custom image for ingest/webhook containers
pyproject.toml # Python project + uv dependency spec
OPERATIONS_MANUAL.md # Day-to-day ops runbook
backup/ # pg_dump sidecar scripts and config
01_BusinessAnalytics.md # SQL analytics library — read before writing queries
20260414_FS__Logistics - final_fixed.csv # 144-device driver/vehicle source data
tracksolidApiDocumentation.md # API endpoint reference
260412_baseline_report.md # Fleet state snapshot (Apr 2026)
data/ # Source CSVs (FS Logistics 144-device list, FSG vehicles)
legacy/ # Superseded pre-_rev scripts + old pipeline notes (NOT deployed)
docs/manuals/ # OPERATIONS_MANUAL, grafana + DWH manuals, docker commands, DB manual
docs/reference/ # 01_BusinessAnalytics.md (SQL library — read before writing queries),
# tracksolidApiDocumentation.md, 260507_pgbouncer_deployment.md
docs/reports/ # Baseline reports, audit output, improvement reviews
```
---
@ -171,7 +173,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
## 6. API Critical Facts
**Always read `tracksolidApiDocumentation.md` before adding a new endpoint call.**
**Always read `docs/reference/tracksolidApiDocumentation.md` before adding a new endpoint call.**
| Fact | Detail |
|---|---|
@ -209,7 +211,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
1. **No prod push without explicit user confirmation.** Always state what you are about to push and wait.
2. **Never rewrite a migration that is already applied.** Check `tracksolid.schema_migrations` first. Add a new numbered migration file for any schema change.
3. **Read before writing.** Before suggesting any code change, read the relevant source file. Before writing a query, check `01_BusinessAnalytics.md` for an existing pattern.
3. **Read before writing.** Before suggesting any code change, read the relevant source file. Before writing a query, check `docs/reference/01_BusinessAnalytics.md` for an existing pattern.
4. **Reuse shared utilities.** All DB access via `get_conn()`, all API calls via `api_post()`, all cleaning via `clean()` / `clean_num()` / `clean_int()` / `clean_ts()` in `ts_shared_rev.py`. Do not reinvent these.
5. **Resolve container names dynamically.** Never hardcode the Coolify suffix. Use `docker ps --filter name=<service>`.
6. **SSH only when asked.** Default workflow is local code → commit → push. SSH into the instance only when explicitly asked to test or run something live.
@ -235,7 +237,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
| Cities active | Nairobi (primary), Mombasa (deploying), Kampala (4 devices in CSV) |
| Service flags | KDK 829A GP (239,264 km), Belta KCU-647D (235,000 km) |
Latest full snapshot: `260412_baseline_report.md`
Latest full snapshot: `docs/reports/260412_baseline_report.md`
---

View file

@ -81,7 +81,7 @@ services:
pgbouncer:
# Connection pooler in front of timescale_db.
# Runbook: 260507_pgbouncer_deployment.md
# Runbook: docs/reference/260507_pgbouncer_deployment.md
# Internal Docker network only — no host port. SCRAM passthrough via
# auth_query against the public.user_lookup() function (migration 10).
image: edoburu/pgbouncer

View file

@ -0,0 +1,374 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Tracksolid Stack — Engineering Review &amp; Improvement Plan (2026-06-01)</title>
<style>
:root {
--bg: #0f1115;
--panel: #171a21;
--panel-2: #1d212b;
--ink: #e6e9ef;
--ink-dim: #aab2c0;
--line: #2a2f3a;
--accent: #5b9dff;
--hi: #ff5d5d;
--med: #ffb454;
--lo: #5fd0a0;
--good: #5fd0a0;
--mono: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
--sans: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
}
* { box-sizing: border-box; }
html { scroll-behavior: smooth; }
body {
margin: 0; background: var(--bg); color: var(--ink);
font-family: var(--sans); line-height: 1.6; font-size: 16px;
}
.wrap { max-width: 980px; margin: 0 auto; padding: 48px 28px 96px; }
header.doc {
border-bottom: 1px solid var(--line); padding-bottom: 28px; margin-bottom: 36px;
}
.kicker { color: var(--accent); font-family: var(--mono); font-size: 13px; letter-spacing: .12em; text-transform: uppercase; }
h1 { font-size: 30px; line-height: 1.25; margin: 10px 0 14px; }
.meta { color: var(--ink-dim); font-size: 14px; }
.meta b { color: var(--ink); font-weight: 600; }
h2 { font-size: 22px; margin: 44px 0 8px; padding-top: 10px; border-top: 1px solid var(--line); }
h2 .num { color: var(--accent); font-family: var(--mono); margin-right: 10px; }
h3 { font-size: 17px; margin: 26px 0 6px; }
p { margin: 10px 0; }
a { color: var(--accent); }
code { font-family: var(--mono); font-size: .87em; background: var(--panel-2); padding: 1px 6px; border-radius: 4px; color: #d7e3ff; }
pre { background: #0b0d12; border: 1px solid var(--line); border-radius: 8px; padding: 14px 16px; overflow-x: auto; font-family: var(--mono); font-size: 13px; color: #cdd6e4; }
ul, ol { margin: 10px 0 10px 4px; padding-left: 22px; }
li { margin: 5px 0; }
.lead { color: var(--ink-dim); font-size: 16px; }
.callout { border-left: 3px solid var(--accent); background: var(--panel); border-radius: 0 8px 8px 0; padding: 14px 18px; margin: 18px 0; }
.callout.warn { border-left-color: var(--hi); }
.callout.warn b { color: var(--hi); }
.finding { background: var(--panel); border: 1px solid var(--line); border-radius: 10px; padding: 4px 22px 18px; margin: 22px 0; }
.badge { display: inline-block; font-family: var(--mono); font-size: 11px; font-weight: 700; letter-spacing: .06em; padding: 3px 9px; border-radius: 999px; text-transform: uppercase; vertical-align: middle; margin-left: 8px; }
.b-hi { background: rgba(255,93,93,.15); color: var(--hi); border: 1px solid rgba(255,93,93,.35); }
.b-med { background: rgba(255,180,84,.13); color: var(--med); border: 1px solid rgba(255,180,84,.32); }
.b-lo { background: rgba(95,208,160,.12); color: var(--lo); border: 1px solid rgba(95,208,160,.3); }
.b-sec { background: rgba(91,157,255,.13); color: var(--accent); border: 1px solid rgba(91,157,255,.32); }
.ref { font-family: var(--mono); font-size: 12.5px; color: var(--ink-dim); }
table { width: 100%; border-collapse: collapse; margin: 18px 0; font-size: 14.5px; }
th, td { text-align: left; padding: 10px 12px; border-bottom: 1px solid var(--line); vertical-align: top; }
th { color: var(--ink-dim); font-weight: 600; font-size: 12.5px; text-transform: uppercase; letter-spacing: .05em; }
td.up-h { color: var(--hi); font-weight: 600; }
td.up-m { color: var(--med); font-weight: 600; }
td.up-l { color: var(--lo); font-weight: 600; }
.pill { font-family: var(--mono); font-size: 11px; padding: 2px 7px; border-radius: 4px; background: var(--panel-2); color: var(--ink-dim); }
.good-box { border: 1px solid rgba(95,208,160,.3); background: rgba(95,208,160,.05); border-radius: 10px; padding: 6px 22px 16px; margin: 22px 0; }
.good-box h2 { border-top: none; color: var(--good); }
footer { margin-top: 60px; padding-top: 22px; border-top: 1px solid var(--line); color: var(--ink-dim); font-size: 13px; }
.toc { background: var(--panel); border: 1px solid var(--line); border-radius: 10px; padding: 16px 22px; margin: 8px 0 0; }
.toc ol { margin: 6px 0; }
.toc a { text-decoration: none; }
.toc a:hover { text-decoration: underline; }
</style>
</head>
<body>
<div class="wrap">
<header class="doc">
<div class="kicker">Engineering Review · Fireside Communications · Tracksolid Fleet Stack</div>
<h1>Database &amp; Microservice Assessment — Opportunities &amp; Refactoring</h1>
<p class="meta">
<b>Date:</b> 2026-06-01 &nbsp;·&nbsp;
<b>Reviewer:</b> Claude (Opus 4.8) &nbsp;·&nbsp;
<b>Scope:</b> TimescaleDB/PostGIS schema + migrations, and the three ingestion microservices
(<code>ingest_movement_rev.py</code>, <code>ingest_events_rev.py</code>, <code>webhook_receiver_rev.py</code> + shared <code>ts_shared_rev.py</code>)
</p>
<p class="meta">Findings are ordered by <b>greatest performance upside first</b>, as requested.</p>
</header>
<div class="callout warn">
<p><b>Access caveat — read this first.</b> The remote instance was <b>unreachable from the review environment</b>:
every probed port (22, 5433, 5432, 443) timed out, so the IP is not whitelisted (or the host was down).
I could <b>not</b> run <code>EXPLAIN</code>, read live row/chunk counts, confirm which indexes actually exist,
or inspect the running images. Everything below is a <b>static review</b> of the source and migration files.
Items needing live confirmation are tagged <span class="pill">verify live</span>.</p>
<p style="margin-bottom:0"><b>Immediate security note:</b> <code>.env</code> is <b>committed to git</b> (it is listed in
<code>.gitignore</code>, but was tracked before that rule existed, so the rule is a no-op). The live Tracksolid app
secret, the Postgres superuser password, and the Grafana admin password are all in the repo history on Forgejo.
Treat all three as compromised and rotate them.</p>
</div>
<div class="toc">
<strong>Findings</strong>
<ol>
<li><a href="#f1">Single-threaded scheduler holds a DB transaction open across throttled geocoding</a> <span class="badge b-hi">High</span></li>
<li><a href="#f2">dwh_gold daily-metrics ETL is non-functional</a> <span class="badge b-hi">High</span></li>
<li><a href="#f3">v_driver_aggregates_daily will not scale; safeguard not applied</a> <span class="badge b-hi">High</span></li>
<li><a href="#f4">pgbouncer deployed but bypassed by the application</a> <span class="badge b-med">Medium</span></li>
<li><a href="#f5">Migrations race across three containers with no lock</a> <span class="badge b-med">Medium</span></li>
<li><a href="#f6">Orphaned migration: 10_driver_clock_views.sql never applied</a> <span class="badge b-med">Medium</span></li>
<li><a href="#f7">Security gaps (webhook auth, committed secrets)</a> <span class="badge b-sec">Security</span></li>
<li><a href="#f8">Smaller DB-design notes</a> <span class="badge b-lo">Low</span></li>
<li><a href="#good">What's genuinely good</a></li>
<li><a href="#plan">Suggested order of attack</a></li>
</ol>
</div>
<!-- ====================== FINDING 1 ====================== -->
<h2 id="f1"><span class="num">1</span>Single-threaded scheduler holds a DB transaction open across throttled geocoding<span class="badge b-hi">Highest upside</span></h2>
<div class="finding">
<p><code>ingest_movement_rev.py</code> runs every job on one <code>schedule</code> thread
(<span class="ref">main(), lines 674695</span>). Within that, <code>poll_trips()</code> opens a transaction
(<span class="ref">with get_conn(), line 343</span>) and then, <b>inside that open transaction</b>, calls
<code>reverse_geocode()</code> twice per trip (<span class="ref">lines 392393</span>).
<code>reverse_geocode</code> enforces a global <b>1 request/second</b> Nominatim throttle
(<span class="ref">ts_shared_rev.py:463, _geocode_throttle</span>).</p>
<h3>Consequences</h3>
<ul>
<li>A batch of N new trips can hold a single pooled connection open for <b>N×~2 seconds</b> of network I/O — a
long-running transaction that pins a snapshot (bad for autovacuum's cleanup horizon) and ties up a connection.</li>
<li>Because the scheduler is one thread, while <code>poll_trips</code> is geocoding, the <b>60-second live-position
sweep cannot run</b>. The "live" freshness SLA silently degrades to minutes whenever trips/parking work through a
backlog. <code>poll_track_list</code> (30 min) and <code>poll_stale_locations</code> (10 min) share the same
thread and also block each other.</li>
<li>Every 15 min, <code>poll_trips</code> re-runs the 8-subquery enrichment block (<code>_ENRICH_QUERY</code>,
<span class="ref">lines 295321</span>) for the <b>entire last hour</b> of trips, even though the
<code>ON CONFLICT</code> mostly <code>COALESCE</code>s the result away.</li>
</ul>
<h3>Recommendation</h3>
<ul>
<li>Move geocoding <b>out of the DB transaction</b>: collect trip rows, commit, then geocode + <code>UPDATE</code>
in a second pass (or delegate geocoding to a queue / n8n).</li>
<li>Gate enrichment on <code>WHERE start_address IS NULL</code> so already-enriched trips don't re-pay the cost.</li>
<li>Run the 60s live sweep on its own thread/process so slow reporting jobs cannot starve it.
<code>schedule</code> + <code>time.sleep(1)</code> on one thread is the wrong concurrency model when one job is
latency-critical and others do long network I/O.</li>
</ul>
</div>
<!-- ====================== FINDING 2 ====================== -->
<h2 id="f2"><span class="num">2</span>The <code>dwh_gold</code> daily-metrics ETL is non-functional<span class="badge b-hi">High</span></h2>
<div class="finding">
<p><code>dwh_gold.refresh_daily_metrics()</code> (<span class="ref">migration 05, lines 212264</span>) selects
<code>t.imei AS vehicle_key</code> and inserts into <code>fact_daily_fleet_metrics.vehicle_key</code>, which is
<code>INTEGER REFERENCES dwh_gold.dim_vehicles(vehicle_key)</code> (<span class="ref">schema lines 237243</span>).
But <code>imei</code> is a 1215-digit <b>TEXT</b> string:</p>
<ul>
<li>15-digit IMEIs overflow <code>int4</code><em>"integer out of range"</em>.</li>
<li>Shorter ones violate the FK because <b>nothing ever populates <code>dim_vehicles</code></b> — no code path
inserts into it.</li>
</ul>
<p>So the function cannot succeed as written, and <code>v_utilisation_daily</code> (which joins
<code>fact → dim_vehicles → devices</code>, <span class="ref">migration 07, lines 268286</span>) will always be
empty. CLAUDE.md lists "schedule the nightly ETL" as a LOW open item — but scheduling it today would error on every
run.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> redesign the gold layer around <code>imei</code> (drop the surrogate
key, or populate <code>dim_vehicles</code> from <code>devices</code> first and look up the key), and fix the column
type. This is a real bug hiding behind "not scheduled yet."</p>
</div>
<!-- ====================== FINDING 3 ====================== -->
<h2 id="f3"><span class="num">3</span><code>v_driver_aggregates_daily</code> will not scale, and the safeguard wasn't applied<span class="badge b-hi">High</span></h2>
<div class="finding">
<p>Migration 07 (<span class="ref">lines 159223</span>) builds this view with two 31-day scans of
<code>position_history</code> plus a <code>LAG()</code> window over <code>source='track_list'</code> rows. There is
<b>no index on <code>position_history.source</code></b>, and the only index on the hypertable is the
<code>(imei, gps_time)</code> primary key.</p>
<p>The view's own header comment says <em>"convert to a continuous aggregate once the hypertable exceeds ~100k rows."</em>
At 156 devices writing a row/minute from the poll sweep plus track_list waypoints, you cross 100k in <b>days</b>, not
months. <span class="pill">verify live</span> current row + chunk count.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> build the speeding/harsh aggregates as a TimescaleDB continuous
aggregate (the pattern already exists in <code>v_mileage_daily_cagg</code>), or at minimum add a partial index
supporting the <code>source='track_list'</code> + time filter. As-is, the daily driver dashboard does a growing full
hypertable scan on every load.</p>
</div>
<!-- ====================== FINDING 4 ====================== -->
<h2 id="f4"><span class="num">4</span>pgbouncer is deployed but the application bypasses it entirely<span class="badge b-med">Medium</span></h2>
<div class="finding">
<p><code>docker-compose.yaml</code> adds a pgbouncer sidecar (<span class="ref">lines 82116</span>) "to cap
tracksolid_db connections," but <code>.env</code> sets
<code>DATABASE_URL=...@timescale_db:5432/...</code> — the Python pools connect <b>straight to Postgres</b>, not to
pgbouncer's 6432.</p>
<p>So the connection cap does nothing for the three services. The real ceiling today is the sum of per-process pools:</p>
<pre>webhook : uvicorn --workers 2 → 2 procs × ThreadedConnectionPool(max=12) = 24
ingest_movement = 12
ingest_events = 12
total ≈ 48 direct conns</pre>
<p>At 80156 devices this is not a live performance problem — it is wasted/contradictory infrastructure and an
intent-vs-reality gap. You also maintain a SCRAM-passthrough <code>user_lookup()</code> SECURITY DEFINER function
(<span class="ref">migration 10</span>) with no consumer.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> either point <code>DATABASE_URL</code> at <code>pgbouncer:6432</code>
(transaction-pool mode disallows session features, but the code uses none beyond <code>client_encoding</code>), or
remove the sidecar.</p>
</div>
<!-- ====================== FINDING 5 ====================== -->
<h2 id="f5"><span class="num">5</span>Migrations race across three containers with no lock<span class="badge b-med">Medium · reliability</span></h2>
<div class="finding">
<p>All three services run <code>python run_migrations.py</code> on startup (<span class="ref">compose lines 26, 37,
48</span>) and start in parallel once the DB is healthy. <code>run_migrations.py</code> does check-then-act
(<code>already_applied()</code><code>run_file()</code>, <span class="ref">lines 231242</span>) with <b>no
advisory lock</b>. On a fresh database, three containers can pass <code>already_applied()==False</code>
simultaneously and run the same file.</p>
<ul>
<li>Migration 02's <code>CREATE TRIGGER</code> loop (<span class="ref">lines 255267</span>) has no
<code>IF NOT EXISTS</code> — concurrent runs throw, and <code>run_file()</code> treats any <code>ERROR:</code> as
fatal → <code>sys.exit(1)</code> → a service refuses to start.</li>
<li><code>run_file()</code> greps stderr for <code>ERROR:</code> without <code>-v ON_ERROR_STOP=1</code>, and files
02/03 have no <code>BEGIN/COMMIT</code>, so a mid-file failure can leave partial schema that later gets mis-seeded
as "applied."</li>
</ul>
<p style="margin-bottom:0"><b>Recommendation:</b> wrap the run in <code>pg_advisory_lock(&lt;const&gt;)</code> /
unlock, and run psql with <code>ON_ERROR_STOP=1</code>. Low effort, removes a class of cold-start flakiness.</p>
</div>
<!-- ====================== FINDING 6 ====================== -->
<h2 id="f6"><span class="num">6</span>Orphaned migration: <code>10_driver_clock_views.sql</code> is never applied<span class="badge b-med">Medium</span></h2>
<div class="finding">
<p>The runner's <code>MIGRATIONS</code> list (<span class="ref">run_migrations.py:2737</span>) includes
<code>10_pgbouncer_auth.sql</code> but <b>not</b> <code>10_driver_clock_views.sql</code>. Two files share the
<code>10_</code> prefix and the list is hand-maintained, so <code>v_driver_clock_daily/_today</code> (which the n8n
tardiness workflow depends on, per the file header) exist only if someone applied them by hand — they are not
reproducible from a clean deploy.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> rename to <code>11_</code> and add to the list. Better: switch the
runner from a hardcoded list to globbing <code>NN_*.sql</code> sorted, so this cannot recur.</p>
</div>
<!-- ====================== FINDING 7 ====================== -->
<h2 id="f7"><span class="num">7</span>Security gaps worth fixing now<span class="badge b-sec">Security</span></h2>
<div class="finding">
<ul>
<li><b>Webhook auth is effectively off.</b> <code>_validate_token</code>
(<span class="ref">webhook_receiver_rev.py:8487</span>) skips validation entirely when
<code>JIMI_WEBHOOK_TOKEN</code> is empty, and it is <b>not set in <code>.env</code></b>. The push endpoints are
exposed via Traefik, so anyone who learns the URL can inject arbitrary telemetry/alarms (each <code>/pushgps</code>
accepts up to 5000 rows, no rate limit). Set the token and make an unset token <b>fail closed</b> in production.</li>
<li><b>Committed secrets</b> (see top banner). Rotate the Tracksolid app secret, Postgres password, and Grafana admin
password; <code>git rm --cached .env</code> and scrub history.</li>
<li><code>dwh/260423_dwh_ddl_v1.sql</code> plaintext passwords are an existing known item in CLAUDE.md — same class of
problem.</li>
</ul>
</div>
<!-- ====================== FINDING 8 ====================== -->
<h2 id="f8"><span class="num">8</span>Smaller DB-design notes<span class="badge b-lo">Low — queue these</span></h2>
<div class="finding">
<ul>
<li><b><code>v_mileage_daily_cagg</code> is built on a column that's mostly NULL.</b> It computes
<code>MAX(current_mileage) - MIN(current_mileage)</code> (<span class="ref">schema lines 293301</span>), but
<code>current_mileage</code> is only populated by the poll sweep — <code>track_list</code> and <code>/pushgps</code>
inserts leave it NULL, and odometer resets/device swaps produce negative or huge deltas. The aggregate's
<code>dist_km</code> is unreliable. Prefer deriving daily distance from <code>trips.distance_km</code>.</li>
<li><b><code>ingestion_log</code> has no retention and no index.</b> <code>v_ingestion_health</code> does
<code>DISTINCT ON (endpoint) … ORDER BY endpoint, run_at DESC</code> over the whole table, which grows ~875
rows/day forever. Add <code>(endpoint, run_at DESC)</code> plus a retention/partition policy.</li>
<li><b>Alarm dedup is leaky on the poll path.</b> <code>alarms_dedup UNIQUE (imei, alarm_type, alarm_time)</code>
(<span class="ref">schema line 199</span>) — the poll path inserts <code>alertTypeId</code> as
<code>alarm_type</code> with no NOT-NULL guard, and <code>NULL</code> defeats the unique constraint
(<code>NULL ≠ NULL</code>), so a null-type alarm can duplicate. The webhook path guards this; the poll path
doesn't.</li>
<li><b><code>live_positions</code>/staleness queries are seq scans</b> (no index on <code>gps_time</code>) — totally
fine at ~156 rows today; just don't carry that pattern into anything that scans <code>position_history</code>.</li>
<li><b>Dead/ambiguous code in <code>_parse_request</code></b> (<span class="ref">webhook lines 90143</span>): the
JSON-array branch <code>_parse_data_list</code> is never reached (it always falls through to
<code>request.form()</code>); harmless but misleading given the docstring claims it handles both.</li>
</ul>
</div>
<!-- ====================== GOOD ====================== -->
<div class="good-box">
<h2 id="good" style="border-top:none;"><span class="num" style="color:var(--good)"></span>What's genuinely good</h2>
<p>So this is balanced — the bones are solid:</p>
<ul>
<li>Per-row <code>SAVEPOINT</code> isolation so one bad item can't abort a batch.</li>
<li>Time-guarded upserts via the shared <code>upsert_live_position</code> helper.</li>
<li>Batched <code>execute_values</code> on the high-volume push / track-list paths.</li>
<li>Hypertables with compression + retention policies.</li>
<li>Parameterized SQL throughout — no injection surface.</li>
<li>Clean signal handling and pool teardown.</li>
<li>Idempotent migrations with a tracking table and <code>COMMENT ON VIEW</code> provenance.</li>
<li><code>sync_devices</code> N+1 already parallelized with a bounded thread pool.</li>
</ul>
<p style="margin-bottom:0">The issues above are mostly about <b>coupling</b>, <b>one broken ETL</b>, and
<b>scale-ahead-of-indexing</b> — not a bad foundation.</p>
</div>
<!-- ====================== PLAN ====================== -->
<h2 id="plan"><span class="num">»</span>Suggested order of attack (effort vs. upside)</h2>
<table>
<thead>
<tr><th style="width:38px">#</th><th>Action</th><th style="width:130px">Upside</th><th style="width:80px">Effort</th></tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Pull geocoding out of the trips transaction + gate on <code>start_address IS NULL</code>; isolate the 60s sweep on its own thread</td>
<td class="up-h">High — restores live freshness, frees connections</td>
<td>M</td>
</tr>
<tr>
<td>2</td>
<td>Fix or redesign <code>refresh_daily_metrics</code> / <code>dim_vehicles</code> (imei vs int key)</td>
<td class="up-h">High — unblocks all utilisation reporting</td>
<td>M</td>
</tr>
<tr>
<td>3</td>
<td>Convert <code>v_driver_aggregates_daily</code> to a continuous aggregate (or add <code>source</code>+time index)</td>
<td class="up-h">High and growing</td>
<td>M</td>
</tr>
<tr>
<td>4</td>
<td>Set <code>JIMI_WEBHOOK_TOKEN</code>; rotate + untrack <code>.env</code></td>
<td class="up-h">High (security)</td>
<td>S</td>
</tr>
<tr>
<td>5</td>
<td>Advisory-lock the migration runner + <code>ON_ERROR_STOP=1</code>; add <code>10_driver_clock_views</code> / switch to glob</td>
<td class="up-m">Medium (reliability)</td>
<td>S</td>
</tr>
<tr>
<td>6</td>
<td>Decide pgbouncer in-or-out; point <code>DATABASE_URL</code> accordingly</td>
<td class="up-m">Medium (clarity)</td>
<td>S</td>
</tr>
<tr>
<td>7</td>
<td><code>ingestion_log</code> index + retention; fix poll-path alarm null dedup; fix cagg distance source</td>
<td class="up-l">Lowmedium</td>
<td>S</td>
</tr>
</tbody>
</table>
<div class="callout">
<p style="margin:0"><b>Next step for live confirmation:</b> if I can get onto the box (whitelist the review IP for
5433, or an SSH tunnel), I'll confirm the <span class="pill">verify live</span> items — actual
<code>position_history</code> row/chunk counts, which indexes really exist, whether <code>refresh_daily_metrics</code>
has ever succeeded, and <code>EXPLAIN ANALYZE</code> on the heavier views — and tighten the priority order with real
numbers.</p>
</div>
<footer>
Generated 2026-06-01 by Claude (Opus 4.8) for Fireside Communications · Tracksolid Fleet Intelligence.
Static review only — no live database access was available at review time. File references use
<span class="ref">file:line</span> against the repository state on branch
<code>quality-program-2026-04-12</code>.
</footer>
</div>
</body>
</html>

View file

View file

@ -49,7 +49,7 @@ from ts_shared_rev import clean, clean_num, clean_ts, get_conn, get_logger
log = get_logger("csv_import")
DEFAULT_CSV_PATH = Path(__file__).parent / "20260427_FSG_Vehicles_mitieng.csv"
DEFAULT_CSV_PATH = Path(__file__).parent / "data" / "20260427_FSG_Vehicles_mitieng.csv"
# Columns fetched from DB for diff comparison.
DB_COLS = [

View file

View file

@ -221,7 +221,7 @@ def main():
applied = skipped = 0
for sql_file in MIGRATIONS:
path = os.path.join("/app", sql_file)
path = os.path.join("/app", "migrations", sql_file)
if not os.path.exists(path):
print(f" SKIP {sql_file} (file not found in /app)")

View file

@ -55,10 +55,10 @@ run_sql -c "
" > /dev/null
# ── Find and apply pending migrations ────────────────────────────────────────
MIGRATION_FILES=$(find "$SCRIPT_DIR" -maxdepth 1 -name '[0-9][0-9]_*.sql' | sort)
MIGRATION_FILES=$(find "$SCRIPT_DIR/migrations" -maxdepth 1 -name '[0-9][0-9]_*.sql' | sort)
if [[ -z "$MIGRATION_FILES" ]]; then
echo "No migration files found in $SCRIPT_DIR"
echo "No migration files found in $SCRIPT_DIR/migrations"
exit 0
fi