Compare commits

..

No commits in common. "5703d70aa6fa4ab57f183fe87144b4725f3e034f" and "2309464ab8549a0b50cab79eed9544b4e7e3225a" have entirely different histories.

42 changed files with 17 additions and 656 deletions

View file

@ -19,7 +19,7 @@ docker exec $DB psql -U postgres -d tracksolid_db -c "SELECT COUNT(*) FROM track
**Run a migration file:** **Run a migration file:**
```bash ```bash
docker exec -i $DB psql -U postgres -d tracksolid_db < migrations/07_your_migration.sql docker exec -i $DB psql -U postgres -d tracksolid_db < 07_your_migration.sql
``` ```
--- ---
@ -91,19 +91,17 @@ dwh/ # DWH migrations for tracksolid_dwh@31.97.44.246:588
# 261002_bronze_constraints_audit.sql — ON CONFLICT key assertion # 261002_bronze_constraints_audit.sql — ON CONFLICT key assertion
# 261003_dwh_roles.sql — role contract assertion # 261003_dwh_roles.sql — role contract assertion
# 261004_dwh_observability_views.sql — freshness/failure views # 261004_dwh_observability_views.sql — freshness/failure views
migrations/ # Numbered SQL migrations 0210, applied in order by run_migrations.py 02_tracksolid_full_schema_rev.sql # Full schema bootstrap
# 02 full schema · 03 webhook · 04 distance fix · 05 enhancements 03..06_*.sql # Incremental migrations (06 adds assigned_city, dispatch_log, ops.*)
# 06 ops/analytics · 07 views · 08 config · 09 trips enrichment 07_analytics_views.sql # Analytics views migration (applied 2026-04-21)
# 10_driver_clock_views.sql · 10_pgbouncer_auth.sql
Dockerfile # Custom image for ingest/webhook containers Dockerfile # Custom image for ingest/webhook containers
pyproject.toml # Python project + uv dependency spec pyproject.toml # Python project + uv dependency spec
OPERATIONS_MANUAL.md # Day-to-day ops runbook
backup/ # pg_dump sidecar scripts and config backup/ # pg_dump sidecar scripts and config
data/ # Source CSVs (FS Logistics 144-device list, FSG vehicles) 01_BusinessAnalytics.md # SQL analytics library — read before writing queries
legacy/ # Superseded pre-_rev scripts + old pipeline notes (NOT deployed) 20260414_FS__Logistics - final_fixed.csv # 144-device driver/vehicle source data
docs/manuals/ # OPERATIONS_MANUAL, grafana + DWH manuals, docker commands, DB manual tracksolidApiDocumentation.md # API endpoint reference
docs/reference/ # 01_BusinessAnalytics.md (SQL library — read before writing queries), 260412_baseline_report.md # Fleet state snapshot (Apr 2026)
# tracksolidApiDocumentation.md, 260507_pgbouncer_deployment.md
docs/reports/ # Baseline reports, audit output, improvement reviews
``` ```
--- ---
@ -173,7 +171,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
## 6. API Critical Facts ## 6. API Critical Facts
**Always read `docs/reference/tracksolidApiDocumentation.md` before adding a new endpoint call.** **Always read `tracksolidApiDocumentation.md` before adding a new endpoint call.**
| Fact | Detail | | Fact | Detail |
|---|---| |---|---|
@ -211,7 +209,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
1. **No prod push without explicit user confirmation.** Always state what you are about to push and wait. 1. **No prod push without explicit user confirmation.** Always state what you are about to push and wait.
2. **Never rewrite a migration that is already applied.** Check `tracksolid.schema_migrations` first. Add a new numbered migration file for any schema change. 2. **Never rewrite a migration that is already applied.** Check `tracksolid.schema_migrations` first. Add a new numbered migration file for any schema change.
3. **Read before writing.** Before suggesting any code change, read the relevant source file. Before writing a query, check `docs/reference/01_BusinessAnalytics.md` for an existing pattern. 3. **Read before writing.** Before suggesting any code change, read the relevant source file. Before writing a query, check `01_BusinessAnalytics.md` for an existing pattern.
4. **Reuse shared utilities.** All DB access via `get_conn()`, all API calls via `api_post()`, all cleaning via `clean()` / `clean_num()` / `clean_int()` / `clean_ts()` in `ts_shared_rev.py`. Do not reinvent these. 4. **Reuse shared utilities.** All DB access via `get_conn()`, all API calls via `api_post()`, all cleaning via `clean()` / `clean_num()` / `clean_int()` / `clean_ts()` in `ts_shared_rev.py`. Do not reinvent these.
5. **Resolve container names dynamically.** Never hardcode the Coolify suffix. Use `docker ps --filter name=<service>`. 5. **Resolve container names dynamically.** Never hardcode the Coolify suffix. Use `docker ps --filter name=<service>`.
6. **SSH only when asked.** Default workflow is local code → commit → push. SSH into the instance only when explicitly asked to test or run something live. 6. **SSH only when asked.** Default workflow is local code → commit → push. SSH into the instance only when explicitly asked to test or run something live.
@ -237,7 +235,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
| Cities active | Nairobi (primary), Mombasa (deploying), Kampala (4 devices in CSV) | | Cities active | Nairobi (primary), Mombasa (deploying), Kampala (4 devices in CSV) |
| Service flags | KDK 829A GP (239,264 km), Belta KCU-647D (235,000 km) | | Service flags | KDK 829A GP (239,264 km), Belta KCU-647D (235,000 km) |
Latest full snapshot: `docs/reports/260412_baseline_report.md` Latest full snapshot: `260412_baseline_report.md`
--- ---

View file

@ -1,238 +0,0 @@
"""
dashboard_api_rev.py Fireside Communications · Map Dashboard Read API
Stable replacement for the n8n webhooks that fed the Live Position and Fleet
Trips map dashboards. n8n was acting only as a thin HTTPSQL proxy; this
service does the same job directly against the proven reporting.* functions,
removing n8n's credential-management / reload / version-drift fragility from
the live-data path.
It REUSES the existing stack: ts_shared_rev's psycopg2 pool and DATABASE_URL,
the same Docker image, the same Coolify deploy. The reporting.* functions
(already verified to return correct GeoJSON) are the single source of truth.
Endpoints mirror the original n8n webhook paths so the only frontend change
is the base URL (the `N8N_BASE` constant in each dashboard SPA):
GET /webhook/live-positions?cost_centre=&acc_status=
{ summary, geojson } (reporting.fn_live_positions)
GET /webhook/vehicle-track?vehicle_number=&hours=
GeoJSON Feature (reporting.fn_vehicle_track)
GET /webhook/fleet-dashboard
{ drivers, cost_centres, cities, vehicles } (filter options)
POST /webhook/fleet-dashboard body: {period, vehicle_numbers, driver,
cost_centre, assigned_city,
start_date, end_date}
trips payload (reporting.fn_trips_for_map)
GET /health
"""
from __future__ import annotations
import os
from contextlib import asynccontextmanager
from datetime import date, datetime, timedelta, timezone
import psycopg2.extras
from fastapi import FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from ts_shared_rev import close_pool, get_conn, get_logger
log = get_logger("dashboard_api")
# Comma-separated list of allowed browser origins (the dashboard domains).
_ALLOWED_ORIGINS = [
o.strip()
for o in os.getenv(
"DASHBOARD_CORS_ORIGINS",
"https://liveposition.rahamafresh.com,https://fleetintelligence.rahamafresh.com",
).split(",")
if o.strip()
]
@asynccontextmanager
async def lifespan(app: FastAPI):
log.info("Dashboard API starting (v1.0). Origins=%s", _ALLOWED_ORIGINS)
yield
close_pool()
app = FastAPI(title="Fireside Map Dashboard API", lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=_ALLOWED_ORIGINS,
allow_methods=["GET", "POST", "OPTIONS"],
allow_headers=["*"],
)
_EMPTY_GEOJSON = {"type": "FeatureCollection", "features": []}
def _clean(v):
"""Treat missing / blank / sentinel values as None (= SQL wildcard)."""
if v is None:
return None
s = str(v).strip()
return s if s and s.lower() not in ("null", "undefined") else None
# ── Health ────────────────────────────────────────────────────────────────────
@app.get("/health")
def health():
return {"status": "ok"}
# ── Live positions (#004) ───────────────────────────────────────────────────
@app.get("/webhook/live-positions")
def live_positions(cost_centre: str | None = None, acc_status: str | None = None):
try:
with get_conn() as conn:
with conn.cursor() as cur:
cur.execute(
"SELECT reporting.fn_live_positions(%s, %s)",
(_clean(cost_centre), _clean(acc_status)),
)
payload = cur.fetchone()[0] or {}
return JSONResponse(
{
"summary": payload.get("summary") or {},
"geojson": payload.get("geojson") or _EMPTY_GEOJSON,
}
)
except Exception:
log.exception("live-positions failed")
return JSONResponse(
{
"error": {
"type": "unknown",
"message": "Live-position feed is unavailable. Try again in a few seconds.",
}
}
)
@app.get("/webhook/vehicle-track")
def vehicle_track(vehicle_number: str | None = None, hours: int = 1):
veh = _clean(vehicle_number)
if not veh:
return JSONResponse({"error": "vehicle_number is required"})
hours = max(1, min(24, hours or 1))
try:
with get_conn() as conn:
with conn.cursor() as cur:
cur.execute(
"SELECT reporting.fn_vehicle_track(%s, %s::int)", (veh, hours)
)
feature = cur.fetchone()[0]
return JSONResponse(
feature
or {"type": "Feature", "geometry": {"type": "LineString", "coordinates": []}, "properties": {}}
)
except Exception:
log.exception("vehicle-track failed for %s", veh)
return JSONResponse({"error": "vehicle-track unavailable"})
# ── Fleet trips (#002) ───────────────────────────────────────────────────────
_FILTER_OPTIONS_SQL = """
SELECT
(SELECT array_agg(driver ORDER BY driver) FROM reporting.v_filter_drivers) AS drivers,
(SELECT array_agg(cost_centre ORDER BY cost_centre) FROM reporting.v_filter_cost_centres) AS cost_centres,
(SELECT array_agg(assigned_city ORDER BY assigned_city) FROM reporting.v_filter_cities) AS cities,
(SELECT jsonb_agg(jsonb_build_object(
'vehicle_number', vehicle_number, 'drivers', drivers,
'cost_centre', cost_centre, 'assigned_city', assigned_city)
ORDER BY vehicle_number) FROM reporting.v_filter_vehicles) AS vehicles
"""
@app.get("/webhook/fleet-dashboard")
def fleet_filter_options():
try:
with get_conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(_FILTER_OPTIONS_SQL)
row = cur.fetchone() or {}
return JSONResponse(
{
"drivers": row.get("drivers") or [],
"cost_centres": row.get("cost_centres") or [],
"cities": row.get("cities") or [],
"vehicles": row.get("vehicles") or [],
}
)
except Exception:
log.exception("fleet-dashboard filter options failed")
return JSONResponse({"drivers": [], "cost_centres": [], "cities": [], "vehicles": []})
def _preset_to_range(period: str | None, start_date, end_date):
"""Mirror of the n8n preset_to_range node."""
today = datetime.now(timezone.utc).date()
p = (period or "").strip()
if p == "today":
return today, today
if p == "30d":
return today - timedelta(days=29), today
if p == "custom":
def _d(v, default):
v = _clean(v)
if not v:
return default
try:
return date.fromisoformat(v)
except ValueError:
return default
return _d(start_date, today), _d(end_date, today)
# default + '7d'
return today - timedelta(days=6), today
@app.post("/webhook/fleet-dashboard")
async def fleet_trips(request: Request):
try:
body = await request.json()
except Exception:
body = {}
if not isinstance(body, dict):
body = {}
body = body.get("body", body) if isinstance(body.get("body"), dict) else body
start, end = _preset_to_range(
body.get("period"), body.get("start_date"), body.get("end_date")
)
veh = _clean(body.get("vehicle_numbers")) # comma-separated string or None
try:
with get_conn() as conn:
with conn.cursor() as cur:
cur.execute(
"""
SELECT reporting.fn_trips_for_map(
CASE WHEN %(veh)s IS NULL THEN NULL
ELSE string_to_array(%(veh)s, ',') END,
%(driver)s, %(cc)s, %(city)s, %(start)s::date, %(end)s::date
)
""",
{
"veh": veh,
"driver": _clean(body.get("driver")),
"cc": _clean(body.get("cost_centre")),
"city": _clean(body.get("assigned_city")),
"start": start,
"end": end,
},
)
payload = cur.fetchone()[0]
return JSONResponse(payload if payload is not None else {})
except Exception:
log.exception("fleet-dashboard trips failed")
return JSONResponse(
{"error": {"type": "unknown", "message": "Fleet feed is unavailable. Try again in a few seconds."}}
)

View file

@ -59,31 +59,6 @@ services:
timeout: 5s timeout: 5s
retries: 3 retries: 3
dashboard_api:
# Stable read-API for the Live Position + Fleet Trips map dashboards.
# Replaces the n8n webhooks (n8n was only a thin HTTP->SQL proxy).
# Calls reporting.fn_live_positions / fn_vehicle_track / fn_trips_for_map.
build:
context: .
dockerfile: Dockerfile
command: sh -c "uvicorn dashboard_api_rev:app --host 0.0.0.0 --port 8890 --workers 2"
restart: always
depends_on:
timescale_db:
condition: service_healthy
env_file: .env
environment:
# Browser origins allowed to call this API (the dashboard domains).
- DASHBOARD_CORS_ORIGINS=${DASHBOARD_CORS_ORIGINS:-https://liveposition.rahamafresh.com,https://fleetintelligence.rahamafresh.com}
# No host port binding — set a domain (e.g. fleetapi.rahamafresh.com) in the
# Coolify UI pointing to this service on port 8890. The dashboards then point
# their N8N_BASE at that domain; paths (/webhook/...) are unchanged.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8890/health"]
interval: 30s
timeout: 5s
retries: 3
grafana: grafana:
build: build:
context: ./grafana context: ./grafana
@ -106,7 +81,7 @@ services:
pgbouncer: pgbouncer:
# Connection pooler in front of timescale_db. # Connection pooler in front of timescale_db.
# Runbook: docs/reference/260507_pgbouncer_deployment.md # Runbook: 260507_pgbouncer_deployment.md
# Internal Docker network only — no host port. SCRAM passthrough via # Internal Docker network only — no host port. SCRAM passthrough via
# auth_query against the public.user_lookup() function (migration 10). # auth_query against the public.user_lookup() function (migration 10).
image: edoburu/pgbouncer image: edoburu/pgbouncer

View file

@ -1,374 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Tracksolid Stack — Engineering Review &amp; Improvement Plan (2026-06-01)</title>
<style>
:root {
--bg: #0f1115;
--panel: #171a21;
--panel-2: #1d212b;
--ink: #e6e9ef;
--ink-dim: #aab2c0;
--line: #2a2f3a;
--accent: #5b9dff;
--hi: #ff5d5d;
--med: #ffb454;
--lo: #5fd0a0;
--good: #5fd0a0;
--mono: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
--sans: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
}
* { box-sizing: border-box; }
html { scroll-behavior: smooth; }
body {
margin: 0; background: var(--bg); color: var(--ink);
font-family: var(--sans); line-height: 1.6; font-size: 16px;
}
.wrap { max-width: 980px; margin: 0 auto; padding: 48px 28px 96px; }
header.doc {
border-bottom: 1px solid var(--line); padding-bottom: 28px; margin-bottom: 36px;
}
.kicker { color: var(--accent); font-family: var(--mono); font-size: 13px; letter-spacing: .12em; text-transform: uppercase; }
h1 { font-size: 30px; line-height: 1.25; margin: 10px 0 14px; }
.meta { color: var(--ink-dim); font-size: 14px; }
.meta b { color: var(--ink); font-weight: 600; }
h2 { font-size: 22px; margin: 44px 0 8px; padding-top: 10px; border-top: 1px solid var(--line); }
h2 .num { color: var(--accent); font-family: var(--mono); margin-right: 10px; }
h3 { font-size: 17px; margin: 26px 0 6px; }
p { margin: 10px 0; }
a { color: var(--accent); }
code { font-family: var(--mono); font-size: .87em; background: var(--panel-2); padding: 1px 6px; border-radius: 4px; color: #d7e3ff; }
pre { background: #0b0d12; border: 1px solid var(--line); border-radius: 8px; padding: 14px 16px; overflow-x: auto; font-family: var(--mono); font-size: 13px; color: #cdd6e4; }
ul, ol { margin: 10px 0 10px 4px; padding-left: 22px; }
li { margin: 5px 0; }
.lead { color: var(--ink-dim); font-size: 16px; }
.callout { border-left: 3px solid var(--accent); background: var(--panel); border-radius: 0 8px 8px 0; padding: 14px 18px; margin: 18px 0; }
.callout.warn { border-left-color: var(--hi); }
.callout.warn b { color: var(--hi); }
.finding { background: var(--panel); border: 1px solid var(--line); border-radius: 10px; padding: 4px 22px 18px; margin: 22px 0; }
.badge { display: inline-block; font-family: var(--mono); font-size: 11px; font-weight: 700; letter-spacing: .06em; padding: 3px 9px; border-radius: 999px; text-transform: uppercase; vertical-align: middle; margin-left: 8px; }
.b-hi { background: rgba(255,93,93,.15); color: var(--hi); border: 1px solid rgba(255,93,93,.35); }
.b-med { background: rgba(255,180,84,.13); color: var(--med); border: 1px solid rgba(255,180,84,.32); }
.b-lo { background: rgba(95,208,160,.12); color: var(--lo); border: 1px solid rgba(95,208,160,.3); }
.b-sec { background: rgba(91,157,255,.13); color: var(--accent); border: 1px solid rgba(91,157,255,.32); }
.ref { font-family: var(--mono); font-size: 12.5px; color: var(--ink-dim); }
table { width: 100%; border-collapse: collapse; margin: 18px 0; font-size: 14.5px; }
th, td { text-align: left; padding: 10px 12px; border-bottom: 1px solid var(--line); vertical-align: top; }
th { color: var(--ink-dim); font-weight: 600; font-size: 12.5px; text-transform: uppercase; letter-spacing: .05em; }
td.up-h { color: var(--hi); font-weight: 600; }
td.up-m { color: var(--med); font-weight: 600; }
td.up-l { color: var(--lo); font-weight: 600; }
.pill { font-family: var(--mono); font-size: 11px; padding: 2px 7px; border-radius: 4px; background: var(--panel-2); color: var(--ink-dim); }
.good-box { border: 1px solid rgba(95,208,160,.3); background: rgba(95,208,160,.05); border-radius: 10px; padding: 6px 22px 16px; margin: 22px 0; }
.good-box h2 { border-top: none; color: var(--good); }
footer { margin-top: 60px; padding-top: 22px; border-top: 1px solid var(--line); color: var(--ink-dim); font-size: 13px; }
.toc { background: var(--panel); border: 1px solid var(--line); border-radius: 10px; padding: 16px 22px; margin: 8px 0 0; }
.toc ol { margin: 6px 0; }
.toc a { text-decoration: none; }
.toc a:hover { text-decoration: underline; }
</style>
</head>
<body>
<div class="wrap">
<header class="doc">
<div class="kicker">Engineering Review · Fireside Communications · Tracksolid Fleet Stack</div>
<h1>Database &amp; Microservice Assessment — Opportunities &amp; Refactoring</h1>
<p class="meta">
<b>Date:</b> 2026-06-01 &nbsp;·&nbsp;
<b>Reviewer:</b> Claude (Opus 4.8) &nbsp;·&nbsp;
<b>Scope:</b> TimescaleDB/PostGIS schema + migrations, and the three ingestion microservices
(<code>ingest_movement_rev.py</code>, <code>ingest_events_rev.py</code>, <code>webhook_receiver_rev.py</code> + shared <code>ts_shared_rev.py</code>)
</p>
<p class="meta">Findings are ordered by <b>greatest performance upside first</b>, as requested.</p>
</header>
<div class="callout warn">
<p><b>Access caveat — read this first.</b> The remote instance was <b>unreachable from the review environment</b>:
every probed port (22, 5433, 5432, 443) timed out, so the IP is not whitelisted (or the host was down).
I could <b>not</b> run <code>EXPLAIN</code>, read live row/chunk counts, confirm which indexes actually exist,
or inspect the running images. Everything below is a <b>static review</b> of the source and migration files.
Items needing live confirmation are tagged <span class="pill">verify live</span>.</p>
<p style="margin-bottom:0"><b>Immediate security note:</b> <code>.env</code> is <b>committed to git</b> (it is listed in
<code>.gitignore</code>, but was tracked before that rule existed, so the rule is a no-op). The live Tracksolid app
secret, the Postgres superuser password, and the Grafana admin password are all in the repo history on Forgejo.
Treat all three as compromised and rotate them.</p>
</div>
<div class="toc">
<strong>Findings</strong>
<ol>
<li><a href="#f1">Single-threaded scheduler holds a DB transaction open across throttled geocoding</a> <span class="badge b-hi">High</span></li>
<li><a href="#f2">dwh_gold daily-metrics ETL is non-functional</a> <span class="badge b-hi">High</span></li>
<li><a href="#f3">v_driver_aggregates_daily will not scale; safeguard not applied</a> <span class="badge b-hi">High</span></li>
<li><a href="#f4">pgbouncer deployed but bypassed by the application</a> <span class="badge b-med">Medium</span></li>
<li><a href="#f5">Migrations race across three containers with no lock</a> <span class="badge b-med">Medium</span></li>
<li><a href="#f6">Orphaned migration: 10_driver_clock_views.sql never applied</a> <span class="badge b-med">Medium</span></li>
<li><a href="#f7">Security gaps (webhook auth, committed secrets)</a> <span class="badge b-sec">Security</span></li>
<li><a href="#f8">Smaller DB-design notes</a> <span class="badge b-lo">Low</span></li>
<li><a href="#good">What's genuinely good</a></li>
<li><a href="#plan">Suggested order of attack</a></li>
</ol>
</div>
<!-- ====================== FINDING 1 ====================== -->
<h2 id="f1"><span class="num">1</span>Single-threaded scheduler holds a DB transaction open across throttled geocoding<span class="badge b-hi">Highest upside</span></h2>
<div class="finding">
<p><code>ingest_movement_rev.py</code> runs every job on one <code>schedule</code> thread
(<span class="ref">main(), lines 674695</span>). Within that, <code>poll_trips()</code> opens a transaction
(<span class="ref">with get_conn(), line 343</span>) and then, <b>inside that open transaction</b>, calls
<code>reverse_geocode()</code> twice per trip (<span class="ref">lines 392393</span>).
<code>reverse_geocode</code> enforces a global <b>1 request/second</b> Nominatim throttle
(<span class="ref">ts_shared_rev.py:463, _geocode_throttle</span>).</p>
<h3>Consequences</h3>
<ul>
<li>A batch of N new trips can hold a single pooled connection open for <b>N×~2 seconds</b> of network I/O — a
long-running transaction that pins a snapshot (bad for autovacuum's cleanup horizon) and ties up a connection.</li>
<li>Because the scheduler is one thread, while <code>poll_trips</code> is geocoding, the <b>60-second live-position
sweep cannot run</b>. The "live" freshness SLA silently degrades to minutes whenever trips/parking work through a
backlog. <code>poll_track_list</code> (30 min) and <code>poll_stale_locations</code> (10 min) share the same
thread and also block each other.</li>
<li>Every 15 min, <code>poll_trips</code> re-runs the 8-subquery enrichment block (<code>_ENRICH_QUERY</code>,
<span class="ref">lines 295321</span>) for the <b>entire last hour</b> of trips, even though the
<code>ON CONFLICT</code> mostly <code>COALESCE</code>s the result away.</li>
</ul>
<h3>Recommendation</h3>
<ul>
<li>Move geocoding <b>out of the DB transaction</b>: collect trip rows, commit, then geocode + <code>UPDATE</code>
in a second pass (or delegate geocoding to a queue / n8n).</li>
<li>Gate enrichment on <code>WHERE start_address IS NULL</code> so already-enriched trips don't re-pay the cost.</li>
<li>Run the 60s live sweep on its own thread/process so slow reporting jobs cannot starve it.
<code>schedule</code> + <code>time.sleep(1)</code> on one thread is the wrong concurrency model when one job is
latency-critical and others do long network I/O.</li>
</ul>
</div>
<!-- ====================== FINDING 2 ====================== -->
<h2 id="f2"><span class="num">2</span>The <code>dwh_gold</code> daily-metrics ETL is non-functional<span class="badge b-hi">High</span></h2>
<div class="finding">
<p><code>dwh_gold.refresh_daily_metrics()</code> (<span class="ref">migration 05, lines 212264</span>) selects
<code>t.imei AS vehicle_key</code> and inserts into <code>fact_daily_fleet_metrics.vehicle_key</code>, which is
<code>INTEGER REFERENCES dwh_gold.dim_vehicles(vehicle_key)</code> (<span class="ref">schema lines 237243</span>).
But <code>imei</code> is a 1215-digit <b>TEXT</b> string:</p>
<ul>
<li>15-digit IMEIs overflow <code>int4</code><em>"integer out of range"</em>.</li>
<li>Shorter ones violate the FK because <b>nothing ever populates <code>dim_vehicles</code></b> — no code path
inserts into it.</li>
</ul>
<p>So the function cannot succeed as written, and <code>v_utilisation_daily</code> (which joins
<code>fact → dim_vehicles → devices</code>, <span class="ref">migration 07, lines 268286</span>) will always be
empty. CLAUDE.md lists "schedule the nightly ETL" as a LOW open item — but scheduling it today would error on every
run.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> redesign the gold layer around <code>imei</code> (drop the surrogate
key, or populate <code>dim_vehicles</code> from <code>devices</code> first and look up the key), and fix the column
type. This is a real bug hiding behind "not scheduled yet."</p>
</div>
<!-- ====================== FINDING 3 ====================== -->
<h2 id="f3"><span class="num">3</span><code>v_driver_aggregates_daily</code> will not scale, and the safeguard wasn't applied<span class="badge b-hi">High</span></h2>
<div class="finding">
<p>Migration 07 (<span class="ref">lines 159223</span>) builds this view with two 31-day scans of
<code>position_history</code> plus a <code>LAG()</code> window over <code>source='track_list'</code> rows. There is
<b>no index on <code>position_history.source</code></b>, and the only index on the hypertable is the
<code>(imei, gps_time)</code> primary key.</p>
<p>The view's own header comment says <em>"convert to a continuous aggregate once the hypertable exceeds ~100k rows."</em>
At 156 devices writing a row/minute from the poll sweep plus track_list waypoints, you cross 100k in <b>days</b>, not
months. <span class="pill">verify live</span> current row + chunk count.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> build the speeding/harsh aggregates as a TimescaleDB continuous
aggregate (the pattern already exists in <code>v_mileage_daily_cagg</code>), or at minimum add a partial index
supporting the <code>source='track_list'</code> + time filter. As-is, the daily driver dashboard does a growing full
hypertable scan on every load.</p>
</div>
<!-- ====================== FINDING 4 ====================== -->
<h2 id="f4"><span class="num">4</span>pgbouncer is deployed but the application bypasses it entirely<span class="badge b-med">Medium</span></h2>
<div class="finding">
<p><code>docker-compose.yaml</code> adds a pgbouncer sidecar (<span class="ref">lines 82116</span>) "to cap
tracksolid_db connections," but <code>.env</code> sets
<code>DATABASE_URL=...@timescale_db:5432/...</code> — the Python pools connect <b>straight to Postgres</b>, not to
pgbouncer's 6432.</p>
<p>So the connection cap does nothing for the three services. The real ceiling today is the sum of per-process pools:</p>
<pre>webhook : uvicorn --workers 2 → 2 procs × ThreadedConnectionPool(max=12) = 24
ingest_movement = 12
ingest_events = 12
total ≈ 48 direct conns</pre>
<p>At 80156 devices this is not a live performance problem — it is wasted/contradictory infrastructure and an
intent-vs-reality gap. You also maintain a SCRAM-passthrough <code>user_lookup()</code> SECURITY DEFINER function
(<span class="ref">migration 10</span>) with no consumer.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> either point <code>DATABASE_URL</code> at <code>pgbouncer:6432</code>
(transaction-pool mode disallows session features, but the code uses none beyond <code>client_encoding</code>), or
remove the sidecar.</p>
</div>
<!-- ====================== FINDING 5 ====================== -->
<h2 id="f5"><span class="num">5</span>Migrations race across three containers with no lock<span class="badge b-med">Medium · reliability</span></h2>
<div class="finding">
<p>All three services run <code>python run_migrations.py</code> on startup (<span class="ref">compose lines 26, 37,
48</span>) and start in parallel once the DB is healthy. <code>run_migrations.py</code> does check-then-act
(<code>already_applied()</code><code>run_file()</code>, <span class="ref">lines 231242</span>) with <b>no
advisory lock</b>. On a fresh database, three containers can pass <code>already_applied()==False</code>
simultaneously and run the same file.</p>
<ul>
<li>Migration 02's <code>CREATE TRIGGER</code> loop (<span class="ref">lines 255267</span>) has no
<code>IF NOT EXISTS</code> — concurrent runs throw, and <code>run_file()</code> treats any <code>ERROR:</code> as
fatal → <code>sys.exit(1)</code> → a service refuses to start.</li>
<li><code>run_file()</code> greps stderr for <code>ERROR:</code> without <code>-v ON_ERROR_STOP=1</code>, and files
02/03 have no <code>BEGIN/COMMIT</code>, so a mid-file failure can leave partial schema that later gets mis-seeded
as "applied."</li>
</ul>
<p style="margin-bottom:0"><b>Recommendation:</b> wrap the run in <code>pg_advisory_lock(&lt;const&gt;)</code> /
unlock, and run psql with <code>ON_ERROR_STOP=1</code>. Low effort, removes a class of cold-start flakiness.</p>
</div>
<!-- ====================== FINDING 6 ====================== -->
<h2 id="f6"><span class="num">6</span>Orphaned migration: <code>10_driver_clock_views.sql</code> is never applied<span class="badge b-med">Medium</span></h2>
<div class="finding">
<p>The runner's <code>MIGRATIONS</code> list (<span class="ref">run_migrations.py:2737</span>) includes
<code>10_pgbouncer_auth.sql</code> but <b>not</b> <code>10_driver_clock_views.sql</code>. Two files share the
<code>10_</code> prefix and the list is hand-maintained, so <code>v_driver_clock_daily/_today</code> (which the n8n
tardiness workflow depends on, per the file header) exist only if someone applied them by hand — they are not
reproducible from a clean deploy.</p>
<p style="margin-bottom:0"><b>Recommendation:</b> rename to <code>11_</code> and add to the list. Better: switch the
runner from a hardcoded list to globbing <code>NN_*.sql</code> sorted, so this cannot recur.</p>
</div>
<!-- ====================== FINDING 7 ====================== -->
<h2 id="f7"><span class="num">7</span>Security gaps worth fixing now<span class="badge b-sec">Security</span></h2>
<div class="finding">
<ul>
<li><b>Webhook auth is effectively off.</b> <code>_validate_token</code>
(<span class="ref">webhook_receiver_rev.py:8487</span>) skips validation entirely when
<code>JIMI_WEBHOOK_TOKEN</code> is empty, and it is <b>not set in <code>.env</code></b>. The push endpoints are
exposed via Traefik, so anyone who learns the URL can inject arbitrary telemetry/alarms (each <code>/pushgps</code>
accepts up to 5000 rows, no rate limit). Set the token and make an unset token <b>fail closed</b> in production.</li>
<li><b>Committed secrets</b> (see top banner). Rotate the Tracksolid app secret, Postgres password, and Grafana admin
password; <code>git rm --cached .env</code> and scrub history.</li>
<li><code>dwh/260423_dwh_ddl_v1.sql</code> plaintext passwords are an existing known item in CLAUDE.md — same class of
problem.</li>
</ul>
</div>
<!-- ====================== FINDING 8 ====================== -->
<h2 id="f8"><span class="num">8</span>Smaller DB-design notes<span class="badge b-lo">Low — queue these</span></h2>
<div class="finding">
<ul>
<li><b><code>v_mileage_daily_cagg</code> is built on a column that's mostly NULL.</b> It computes
<code>MAX(current_mileage) - MIN(current_mileage)</code> (<span class="ref">schema lines 293301</span>), but
<code>current_mileage</code> is only populated by the poll sweep — <code>track_list</code> and <code>/pushgps</code>
inserts leave it NULL, and odometer resets/device swaps produce negative or huge deltas. The aggregate's
<code>dist_km</code> is unreliable. Prefer deriving daily distance from <code>trips.distance_km</code>.</li>
<li><b><code>ingestion_log</code> has no retention and no index.</b> <code>v_ingestion_health</code> does
<code>DISTINCT ON (endpoint) … ORDER BY endpoint, run_at DESC</code> over the whole table, which grows ~875
rows/day forever. Add <code>(endpoint, run_at DESC)</code> plus a retention/partition policy.</li>
<li><b>Alarm dedup is leaky on the poll path.</b> <code>alarms_dedup UNIQUE (imei, alarm_type, alarm_time)</code>
(<span class="ref">schema line 199</span>) — the poll path inserts <code>alertTypeId</code> as
<code>alarm_type</code> with no NOT-NULL guard, and <code>NULL</code> defeats the unique constraint
(<code>NULL ≠ NULL</code>), so a null-type alarm can duplicate. The webhook path guards this; the poll path
doesn't.</li>
<li><b><code>live_positions</code>/staleness queries are seq scans</b> (no index on <code>gps_time</code>) — totally
fine at ~156 rows today; just don't carry that pattern into anything that scans <code>position_history</code>.</li>
<li><b>Dead/ambiguous code in <code>_parse_request</code></b> (<span class="ref">webhook lines 90143</span>): the
JSON-array branch <code>_parse_data_list</code> is never reached (it always falls through to
<code>request.form()</code>); harmless but misleading given the docstring claims it handles both.</li>
</ul>
</div>
<!-- ====================== GOOD ====================== -->
<div class="good-box">
<h2 id="good" style="border-top:none;"><span class="num" style="color:var(--good)"></span>What's genuinely good</h2>
<p>So this is balanced — the bones are solid:</p>
<ul>
<li>Per-row <code>SAVEPOINT</code> isolation so one bad item can't abort a batch.</li>
<li>Time-guarded upserts via the shared <code>upsert_live_position</code> helper.</li>
<li>Batched <code>execute_values</code> on the high-volume push / track-list paths.</li>
<li>Hypertables with compression + retention policies.</li>
<li>Parameterized SQL throughout — no injection surface.</li>
<li>Clean signal handling and pool teardown.</li>
<li>Idempotent migrations with a tracking table and <code>COMMENT ON VIEW</code> provenance.</li>
<li><code>sync_devices</code> N+1 already parallelized with a bounded thread pool.</li>
</ul>
<p style="margin-bottom:0">The issues above are mostly about <b>coupling</b>, <b>one broken ETL</b>, and
<b>scale-ahead-of-indexing</b> — not a bad foundation.</p>
</div>
<!-- ====================== PLAN ====================== -->
<h2 id="plan"><span class="num">»</span>Suggested order of attack (effort vs. upside)</h2>
<table>
<thead>
<tr><th style="width:38px">#</th><th>Action</th><th style="width:130px">Upside</th><th style="width:80px">Effort</th></tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Pull geocoding out of the trips transaction + gate on <code>start_address IS NULL</code>; isolate the 60s sweep on its own thread</td>
<td class="up-h">High — restores live freshness, frees connections</td>
<td>M</td>
</tr>
<tr>
<td>2</td>
<td>Fix or redesign <code>refresh_daily_metrics</code> / <code>dim_vehicles</code> (imei vs int key)</td>
<td class="up-h">High — unblocks all utilisation reporting</td>
<td>M</td>
</tr>
<tr>
<td>3</td>
<td>Convert <code>v_driver_aggregates_daily</code> to a continuous aggregate (or add <code>source</code>+time index)</td>
<td class="up-h">High and growing</td>
<td>M</td>
</tr>
<tr>
<td>4</td>
<td>Set <code>JIMI_WEBHOOK_TOKEN</code>; rotate + untrack <code>.env</code></td>
<td class="up-h">High (security)</td>
<td>S</td>
</tr>
<tr>
<td>5</td>
<td>Advisory-lock the migration runner + <code>ON_ERROR_STOP=1</code>; add <code>10_driver_clock_views</code> / switch to glob</td>
<td class="up-m">Medium (reliability)</td>
<td>S</td>
</tr>
<tr>
<td>6</td>
<td>Decide pgbouncer in-or-out; point <code>DATABASE_URL</code> accordingly</td>
<td class="up-m">Medium (clarity)</td>
<td>S</td>
</tr>
<tr>
<td>7</td>
<td><code>ingestion_log</code> index + retention; fix poll-path alarm null dedup; fix cagg distance source</td>
<td class="up-l">Lowmedium</td>
<td>S</td>
</tr>
</tbody>
</table>
<div class="callout">
<p style="margin:0"><b>Next step for live confirmation:</b> if I can get onto the box (whitelist the review IP for
5433, or an SSH tunnel), I'll confirm the <span class="pill">verify live</span> items — actual
<code>position_history</code> row/chunk counts, which indexes really exist, whether <code>refresh_daily_metrics</code>
has ever succeeded, and <code>EXPLAIN ANALYZE</code> on the heavier views — and tighten the priority order with real
numbers.</p>
</div>
<footer>
Generated 2026-06-01 by Claude (Opus 4.8) for Fireside Communications · Tracksolid Fleet Intelligence.
Static review only — no live database access was available at review time. File references use
<span class="ref">file:line</span> against the repository state on branch
<code>quality-program-2026-04-12</code>.
</footer>
</div>
</body>
</html>

0
documents.txt Normal file
View file

View file

@ -49,7 +49,7 @@ from ts_shared_rev import clean, clean_num, clean_ts, get_conn, get_logger
log = get_logger("csv_import") log = get_logger("csv_import")
DEFAULT_CSV_PATH = Path(__file__).parent / "data" / "20260427_FSG_Vehicles_mitieng.csv" DEFAULT_CSV_PATH = Path(__file__).parent / "20260427_FSG_Vehicles_mitieng.csv"
# Columns fetched from DB for diff comparison. # Columns fetched from DB for diff comparison.
DB_COLS = [ DB_COLS = [

0
push_webhook.md Normal file
View file

View file

@ -221,7 +221,7 @@ def main():
applied = skipped = 0 applied = skipped = 0
for sql_file in MIGRATIONS: for sql_file in MIGRATIONS:
path = os.path.join("/app", "migrations", sql_file) path = os.path.join("/app", sql_file)
if not os.path.exists(path): if not os.path.exists(path):
print(f" SKIP {sql_file} (file not found in /app)") print(f" SKIP {sql_file} (file not found in /app)")

View file

@ -55,10 +55,10 @@ run_sql -c "
" > /dev/null " > /dev/null
# ── Find and apply pending migrations ──────────────────────────────────────── # ── Find and apply pending migrations ────────────────────────────────────────
MIGRATION_FILES=$(find "$SCRIPT_DIR/migrations" -maxdepth 1 -name '[0-9][0-9]_*.sql' | sort) MIGRATION_FILES=$(find "$SCRIPT_DIR" -maxdepth 1 -name '[0-9][0-9]_*.sql' | sort)
if [[ -z "$MIGRATION_FILES" ]]; then if [[ -z "$MIGRATION_FILES" ]]; then
echo "No migration files found in $SCRIPT_DIR/migrations" echo "No migration files found in $SCRIPT_DIR"
exit 0 exit 0
fi fi