Compare commits
2 commits
2309464ab8
...
5703d70aa6
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5703d70aa6 | ||
|
|
e5b0e192d8 |
42 changed files with 656 additions and 17 deletions
26
CLAUDE.md
26
CLAUDE.md
|
|
@ -19,7 +19,7 @@ docker exec $DB psql -U postgres -d tracksolid_db -c "SELECT COUNT(*) FROM track
|
|||
|
||||
**Run a migration file:**
|
||||
```bash
|
||||
docker exec -i $DB psql -U postgres -d tracksolid_db < 07_your_migration.sql
|
||||
docker exec -i $DB psql -U postgres -d tracksolid_db < migrations/07_your_migration.sql
|
||||
```
|
||||
|
||||
---
|
||||
|
|
@ -91,17 +91,19 @@ dwh/ # DWH migrations for tracksolid_dwh@31.97.44.246:588
|
|||
# 261002_bronze_constraints_audit.sql — ON CONFLICT key assertion
|
||||
# 261003_dwh_roles.sql — role contract assertion
|
||||
# 261004_dwh_observability_views.sql — freshness/failure views
|
||||
02_tracksolid_full_schema_rev.sql # Full schema bootstrap
|
||||
03..06_*.sql # Incremental migrations (06 adds assigned_city, dispatch_log, ops.*)
|
||||
07_analytics_views.sql # Analytics views migration (applied 2026-04-21)
|
||||
migrations/ # Numbered SQL migrations 02–10, applied in order by run_migrations.py
|
||||
# 02 full schema · 03 webhook · 04 distance fix · 05 enhancements
|
||||
# 06 ops/analytics · 07 views · 08 config · 09 trips enrichment
|
||||
# 10_driver_clock_views.sql · 10_pgbouncer_auth.sql
|
||||
Dockerfile # Custom image for ingest/webhook containers
|
||||
pyproject.toml # Python project + uv dependency spec
|
||||
OPERATIONS_MANUAL.md # Day-to-day ops runbook
|
||||
backup/ # pg_dump sidecar scripts and config
|
||||
01_BusinessAnalytics.md # SQL analytics library — read before writing queries
|
||||
20260414_FS__Logistics - final_fixed.csv # 144-device driver/vehicle source data
|
||||
tracksolidApiDocumentation.md # API endpoint reference
|
||||
260412_baseline_report.md # Fleet state snapshot (Apr 2026)
|
||||
data/ # Source CSVs (FS Logistics 144-device list, FSG vehicles)
|
||||
legacy/ # Superseded pre-_rev scripts + old pipeline notes (NOT deployed)
|
||||
docs/manuals/ # OPERATIONS_MANUAL, grafana + DWH manuals, docker commands, DB manual
|
||||
docs/reference/ # 01_BusinessAnalytics.md (SQL library — read before writing queries),
|
||||
# tracksolidApiDocumentation.md, 260507_pgbouncer_deployment.md
|
||||
docs/reports/ # Baseline reports, audit output, improvement reviews
|
||||
```
|
||||
|
||||
---
|
||||
|
|
@ -171,7 +173,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
|
|||
|
||||
## 6. API Critical Facts
|
||||
|
||||
**Always read `tracksolidApiDocumentation.md` before adding a new endpoint call.**
|
||||
**Always read `docs/reference/tracksolidApiDocumentation.md` before adding a new endpoint call.**
|
||||
|
||||
| Fact | Detail |
|
||||
|---|---|
|
||||
|
|
@ -209,7 +211,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
|
|||
|
||||
1. **No prod push without explicit user confirmation.** Always state what you are about to push and wait.
|
||||
2. **Never rewrite a migration that is already applied.** Check `tracksolid.schema_migrations` first. Add a new numbered migration file for any schema change.
|
||||
3. **Read before writing.** Before suggesting any code change, read the relevant source file. Before writing a query, check `01_BusinessAnalytics.md` for an existing pattern.
|
||||
3. **Read before writing.** Before suggesting any code change, read the relevant source file. Before writing a query, check `docs/reference/01_BusinessAnalytics.md` for an existing pattern.
|
||||
4. **Reuse shared utilities.** All DB access via `get_conn()`, all API calls via `api_post()`, all cleaning via `clean()` / `clean_num()` / `clean_int()` / `clean_ts()` in `ts_shared_rev.py`. Do not reinvent these.
|
||||
5. **Resolve container names dynamically.** Never hardcode the Coolify suffix. Use `docker ps --filter name=<service>`.
|
||||
6. **SSH only when asked.** Default workflow is local code → commit → push. SSH into the instance only when explicitly asked to test or run something live.
|
||||
|
|
@ -235,7 +237,7 @@ dwh_control.v_watermark_lag -- Grafana: extract vs. load lag per table
|
|||
| Cities active | Nairobi (primary), Mombasa (deploying), Kampala (4 devices in CSV) |
|
||||
| Service flags | KDK 829A GP (239,264 km), Belta KCU-647D (235,000 km) |
|
||||
|
||||
Latest full snapshot: `260412_baseline_report.md`
|
||||
Latest full snapshot: `docs/reports/260412_baseline_report.md`
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
238
dashboard_api_rev.py
Normal file
238
dashboard_api_rev.py
Normal file
|
|
@ -0,0 +1,238 @@
|
|||
"""
|
||||
dashboard_api_rev.py — Fireside Communications · Map Dashboard Read API
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
Stable replacement for the n8n webhooks that fed the Live Position and Fleet
|
||||
Trips map dashboards. n8n was acting only as a thin HTTP→SQL proxy; this
|
||||
service does the same job directly against the proven reporting.* functions,
|
||||
removing n8n's credential-management / reload / version-drift fragility from
|
||||
the live-data path.
|
||||
|
||||
It REUSES the existing stack: ts_shared_rev's psycopg2 pool and DATABASE_URL,
|
||||
the same Docker image, the same Coolify deploy. The reporting.* functions
|
||||
(already verified to return correct GeoJSON) are the single source of truth.
|
||||
|
||||
Endpoints mirror the original n8n webhook paths so the only frontend change
|
||||
is the base URL (the `N8N_BASE` constant in each dashboard SPA):
|
||||
|
||||
GET /webhook/live-positions?cost_centre=&acc_status=
|
||||
→ { summary, geojson } (reporting.fn_live_positions)
|
||||
GET /webhook/vehicle-track?vehicle_number=&hours=
|
||||
→ GeoJSON Feature (reporting.fn_vehicle_track)
|
||||
GET /webhook/fleet-dashboard
|
||||
→ { drivers, cost_centres, cities, vehicles } (filter options)
|
||||
POST /webhook/fleet-dashboard body: {period, vehicle_numbers, driver,
|
||||
cost_centre, assigned_city,
|
||||
start_date, end_date}
|
||||
→ trips payload (reporting.fn_trips_for_map)
|
||||
GET /health
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from contextlib import asynccontextmanager
|
||||
from datetime import date, datetime, timedelta, timezone
|
||||
|
||||
import psycopg2.extras
|
||||
from fastapi import FastAPI, Request
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.responses import JSONResponse
|
||||
|
||||
from ts_shared_rev import close_pool, get_conn, get_logger
|
||||
|
||||
log = get_logger("dashboard_api")
|
||||
|
||||
# Comma-separated list of allowed browser origins (the dashboard domains).
|
||||
_ALLOWED_ORIGINS = [
|
||||
o.strip()
|
||||
for o in os.getenv(
|
||||
"DASHBOARD_CORS_ORIGINS",
|
||||
"https://liveposition.rahamafresh.com,https://fleetintelligence.rahamafresh.com",
|
||||
).split(",")
|
||||
if o.strip()
|
||||
]
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
log.info("Dashboard API starting (v1.0). Origins=%s", _ALLOWED_ORIGINS)
|
||||
yield
|
||||
close_pool()
|
||||
|
||||
|
||||
app = FastAPI(title="Fireside Map Dashboard API", lifespan=lifespan)
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=_ALLOWED_ORIGINS,
|
||||
allow_methods=["GET", "POST", "OPTIONS"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
_EMPTY_GEOJSON = {"type": "FeatureCollection", "features": []}
|
||||
|
||||
|
||||
def _clean(v):
|
||||
"""Treat missing / blank / sentinel values as None (= SQL wildcard)."""
|
||||
if v is None:
|
||||
return None
|
||||
s = str(v).strip()
|
||||
return s if s and s.lower() not in ("null", "undefined") else None
|
||||
|
||||
|
||||
# ── Health ────────────────────────────────────────────────────────────────────
|
||||
|
||||
@app.get("/health")
|
||||
def health():
|
||||
return {"status": "ok"}
|
||||
|
||||
|
||||
# ── Live positions (#004) ───────────────────────────────────────────────────
|
||||
|
||||
@app.get("/webhook/live-positions")
|
||||
def live_positions(cost_centre: str | None = None, acc_status: str | None = None):
|
||||
try:
|
||||
with get_conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(
|
||||
"SELECT reporting.fn_live_positions(%s, %s)",
|
||||
(_clean(cost_centre), _clean(acc_status)),
|
||||
)
|
||||
payload = cur.fetchone()[0] or {}
|
||||
return JSONResponse(
|
||||
{
|
||||
"summary": payload.get("summary") or {},
|
||||
"geojson": payload.get("geojson") or _EMPTY_GEOJSON,
|
||||
}
|
||||
)
|
||||
except Exception:
|
||||
log.exception("live-positions failed")
|
||||
return JSONResponse(
|
||||
{
|
||||
"error": {
|
||||
"type": "unknown",
|
||||
"message": "Live-position feed is unavailable. Try again in a few seconds.",
|
||||
}
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@app.get("/webhook/vehicle-track")
|
||||
def vehicle_track(vehicle_number: str | None = None, hours: int = 1):
|
||||
veh = _clean(vehicle_number)
|
||||
if not veh:
|
||||
return JSONResponse({"error": "vehicle_number is required"})
|
||||
hours = max(1, min(24, hours or 1))
|
||||
try:
|
||||
with get_conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(
|
||||
"SELECT reporting.fn_vehicle_track(%s, %s::int)", (veh, hours)
|
||||
)
|
||||
feature = cur.fetchone()[0]
|
||||
return JSONResponse(
|
||||
feature
|
||||
or {"type": "Feature", "geometry": {"type": "LineString", "coordinates": []}, "properties": {}}
|
||||
)
|
||||
except Exception:
|
||||
log.exception("vehicle-track failed for %s", veh)
|
||||
return JSONResponse({"error": "vehicle-track unavailable"})
|
||||
|
||||
|
||||
# ── Fleet trips (#002) ───────────────────────────────────────────────────────
|
||||
|
||||
_FILTER_OPTIONS_SQL = """
|
||||
SELECT
|
||||
(SELECT array_agg(driver ORDER BY driver) FROM reporting.v_filter_drivers) AS drivers,
|
||||
(SELECT array_agg(cost_centre ORDER BY cost_centre) FROM reporting.v_filter_cost_centres) AS cost_centres,
|
||||
(SELECT array_agg(assigned_city ORDER BY assigned_city) FROM reporting.v_filter_cities) AS cities,
|
||||
(SELECT jsonb_agg(jsonb_build_object(
|
||||
'vehicle_number', vehicle_number, 'drivers', drivers,
|
||||
'cost_centre', cost_centre, 'assigned_city', assigned_city)
|
||||
ORDER BY vehicle_number) FROM reporting.v_filter_vehicles) AS vehicles
|
||||
"""
|
||||
|
||||
|
||||
@app.get("/webhook/fleet-dashboard")
|
||||
def fleet_filter_options():
|
||||
try:
|
||||
with get_conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(_FILTER_OPTIONS_SQL)
|
||||
row = cur.fetchone() or {}
|
||||
return JSONResponse(
|
||||
{
|
||||
"drivers": row.get("drivers") or [],
|
||||
"cost_centres": row.get("cost_centres") or [],
|
||||
"cities": row.get("cities") or [],
|
||||
"vehicles": row.get("vehicles") or [],
|
||||
}
|
||||
)
|
||||
except Exception:
|
||||
log.exception("fleet-dashboard filter options failed")
|
||||
return JSONResponse({"drivers": [], "cost_centres": [], "cities": [], "vehicles": []})
|
||||
|
||||
|
||||
def _preset_to_range(period: str | None, start_date, end_date):
|
||||
"""Mirror of the n8n preset_to_range node."""
|
||||
today = datetime.now(timezone.utc).date()
|
||||
p = (period or "").strip()
|
||||
if p == "today":
|
||||
return today, today
|
||||
if p == "30d":
|
||||
return today - timedelta(days=29), today
|
||||
if p == "custom":
|
||||
def _d(v, default):
|
||||
v = _clean(v)
|
||||
if not v:
|
||||
return default
|
||||
try:
|
||||
return date.fromisoformat(v)
|
||||
except ValueError:
|
||||
return default
|
||||
return _d(start_date, today), _d(end_date, today)
|
||||
# default + '7d'
|
||||
return today - timedelta(days=6), today
|
||||
|
||||
|
||||
@app.post("/webhook/fleet-dashboard")
|
||||
async def fleet_trips(request: Request):
|
||||
try:
|
||||
body = await request.json()
|
||||
except Exception:
|
||||
body = {}
|
||||
if not isinstance(body, dict):
|
||||
body = {}
|
||||
body = body.get("body", body) if isinstance(body.get("body"), dict) else body
|
||||
|
||||
start, end = _preset_to_range(
|
||||
body.get("period"), body.get("start_date"), body.get("end_date")
|
||||
)
|
||||
veh = _clean(body.get("vehicle_numbers")) # comma-separated string or None
|
||||
try:
|
||||
with get_conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(
|
||||
"""
|
||||
SELECT reporting.fn_trips_for_map(
|
||||
CASE WHEN %(veh)s IS NULL THEN NULL
|
||||
ELSE string_to_array(%(veh)s, ',') END,
|
||||
%(driver)s, %(cc)s, %(city)s, %(start)s::date, %(end)s::date
|
||||
)
|
||||
""",
|
||||
{
|
||||
"veh": veh,
|
||||
"driver": _clean(body.get("driver")),
|
||||
"cc": _clean(body.get("cost_centre")),
|
||||
"city": _clean(body.get("assigned_city")),
|
||||
"start": start,
|
||||
"end": end,
|
||||
},
|
||||
)
|
||||
payload = cur.fetchone()[0]
|
||||
return JSONResponse(payload if payload is not None else {})
|
||||
except Exception:
|
||||
log.exception("fleet-dashboard trips failed")
|
||||
return JSONResponse(
|
||||
{"error": {"type": "unknown", "message": "Fleet feed is unavailable. Try again in a few seconds."}}
|
||||
)
|
||||
|
|
@ -59,6 +59,31 @@ services:
|
|||
timeout: 5s
|
||||
retries: 3
|
||||
|
||||
dashboard_api:
|
||||
# Stable read-API for the Live Position + Fleet Trips map dashboards.
|
||||
# Replaces the n8n webhooks (n8n was only a thin HTTP->SQL proxy).
|
||||
# Calls reporting.fn_live_positions / fn_vehicle_track / fn_trips_for_map.
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
command: sh -c "uvicorn dashboard_api_rev:app --host 0.0.0.0 --port 8890 --workers 2"
|
||||
restart: always
|
||||
depends_on:
|
||||
timescale_db:
|
||||
condition: service_healthy
|
||||
env_file: .env
|
||||
environment:
|
||||
# Browser origins allowed to call this API (the dashboard domains).
|
||||
- DASHBOARD_CORS_ORIGINS=${DASHBOARD_CORS_ORIGINS:-https://liveposition.rahamafresh.com,https://fleetintelligence.rahamafresh.com}
|
||||
# No host port binding — set a domain (e.g. fleetapi.rahamafresh.com) in the
|
||||
# Coolify UI pointing to this service on port 8890. The dashboards then point
|
||||
# their N8N_BASE at that domain; paths (/webhook/...) are unchanged.
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8890/health"]
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
|
||||
grafana:
|
||||
build:
|
||||
context: ./grafana
|
||||
|
|
@ -81,7 +106,7 @@ services:
|
|||
|
||||
pgbouncer:
|
||||
# Connection pooler in front of timescale_db.
|
||||
# Runbook: 260507_pgbouncer_deployment.md
|
||||
# Runbook: docs/reference/260507_pgbouncer_deployment.md
|
||||
# Internal Docker network only — no host port. SCRAM passthrough via
|
||||
# auth_query against the public.user_lookup() function (migration 10).
|
||||
image: edoburu/pgbouncer
|
||||
|
|
|
|||
374
docs/reports/260601_improvement_claude_48.html
Normal file
374
docs/reports/260601_improvement_claude_48.html
Normal file
|
|
@ -0,0 +1,374 @@
|
|||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Tracksolid Stack — Engineering Review & Improvement Plan (2026-06-01)</title>
|
||||
<style>
|
||||
:root {
|
||||
--bg: #0f1115;
|
||||
--panel: #171a21;
|
||||
--panel-2: #1d212b;
|
||||
--ink: #e6e9ef;
|
||||
--ink-dim: #aab2c0;
|
||||
--line: #2a2f3a;
|
||||
--accent: #5b9dff;
|
||||
--hi: #ff5d5d;
|
||||
--med: #ffb454;
|
||||
--lo: #5fd0a0;
|
||||
--good: #5fd0a0;
|
||||
--mono: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
|
||||
--sans: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
|
||||
}
|
||||
* { box-sizing: border-box; }
|
||||
html { scroll-behavior: smooth; }
|
||||
body {
|
||||
margin: 0; background: var(--bg); color: var(--ink);
|
||||
font-family: var(--sans); line-height: 1.6; font-size: 16px;
|
||||
}
|
||||
.wrap { max-width: 980px; margin: 0 auto; padding: 48px 28px 96px; }
|
||||
header.doc {
|
||||
border-bottom: 1px solid var(--line); padding-bottom: 28px; margin-bottom: 36px;
|
||||
}
|
||||
.kicker { color: var(--accent); font-family: var(--mono); font-size: 13px; letter-spacing: .12em; text-transform: uppercase; }
|
||||
h1 { font-size: 30px; line-height: 1.25; margin: 10px 0 14px; }
|
||||
.meta { color: var(--ink-dim); font-size: 14px; }
|
||||
.meta b { color: var(--ink); font-weight: 600; }
|
||||
h2 { font-size: 22px; margin: 44px 0 8px; padding-top: 10px; border-top: 1px solid var(--line); }
|
||||
h2 .num { color: var(--accent); font-family: var(--mono); margin-right: 10px; }
|
||||
h3 { font-size: 17px; margin: 26px 0 6px; }
|
||||
p { margin: 10px 0; }
|
||||
a { color: var(--accent); }
|
||||
code { font-family: var(--mono); font-size: .87em; background: var(--panel-2); padding: 1px 6px; border-radius: 4px; color: #d7e3ff; }
|
||||
pre { background: #0b0d12; border: 1px solid var(--line); border-radius: 8px; padding: 14px 16px; overflow-x: auto; font-family: var(--mono); font-size: 13px; color: #cdd6e4; }
|
||||
ul, ol { margin: 10px 0 10px 4px; padding-left: 22px; }
|
||||
li { margin: 5px 0; }
|
||||
.lead { color: var(--ink-dim); font-size: 16px; }
|
||||
|
||||
.callout { border-left: 3px solid var(--accent); background: var(--panel); border-radius: 0 8px 8px 0; padding: 14px 18px; margin: 18px 0; }
|
||||
.callout.warn { border-left-color: var(--hi); }
|
||||
.callout.warn b { color: var(--hi); }
|
||||
|
||||
.finding { background: var(--panel); border: 1px solid var(--line); border-radius: 10px; padding: 4px 22px 18px; margin: 22px 0; }
|
||||
.badge { display: inline-block; font-family: var(--mono); font-size: 11px; font-weight: 700; letter-spacing: .06em; padding: 3px 9px; border-radius: 999px; text-transform: uppercase; vertical-align: middle; margin-left: 8px; }
|
||||
.b-hi { background: rgba(255,93,93,.15); color: var(--hi); border: 1px solid rgba(255,93,93,.35); }
|
||||
.b-med { background: rgba(255,180,84,.13); color: var(--med); border: 1px solid rgba(255,180,84,.32); }
|
||||
.b-lo { background: rgba(95,208,160,.12); color: var(--lo); border: 1px solid rgba(95,208,160,.3); }
|
||||
.b-sec { background: rgba(91,157,255,.13); color: var(--accent); border: 1px solid rgba(91,157,255,.32); }
|
||||
|
||||
.ref { font-family: var(--mono); font-size: 12.5px; color: var(--ink-dim); }
|
||||
|
||||
table { width: 100%; border-collapse: collapse; margin: 18px 0; font-size: 14.5px; }
|
||||
th, td { text-align: left; padding: 10px 12px; border-bottom: 1px solid var(--line); vertical-align: top; }
|
||||
th { color: var(--ink-dim); font-weight: 600; font-size: 12.5px; text-transform: uppercase; letter-spacing: .05em; }
|
||||
td.up-h { color: var(--hi); font-weight: 600; }
|
||||
td.up-m { color: var(--med); font-weight: 600; }
|
||||
td.up-l { color: var(--lo); font-weight: 600; }
|
||||
|
||||
.pill { font-family: var(--mono); font-size: 11px; padding: 2px 7px; border-radius: 4px; background: var(--panel-2); color: var(--ink-dim); }
|
||||
|
||||
.good-box { border: 1px solid rgba(95,208,160,.3); background: rgba(95,208,160,.05); border-radius: 10px; padding: 6px 22px 16px; margin: 22px 0; }
|
||||
.good-box h2 { border-top: none; color: var(--good); }
|
||||
|
||||
footer { margin-top: 60px; padding-top: 22px; border-top: 1px solid var(--line); color: var(--ink-dim); font-size: 13px; }
|
||||
.toc { background: var(--panel); border: 1px solid var(--line); border-radius: 10px; padding: 16px 22px; margin: 8px 0 0; }
|
||||
.toc ol { margin: 6px 0; }
|
||||
.toc a { text-decoration: none; }
|
||||
.toc a:hover { text-decoration: underline; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="wrap">
|
||||
|
||||
<header class="doc">
|
||||
<div class="kicker">Engineering Review · Fireside Communications · Tracksolid Fleet Stack</div>
|
||||
<h1>Database & Microservice Assessment — Opportunities & Refactoring</h1>
|
||||
<p class="meta">
|
||||
<b>Date:</b> 2026-06-01 ·
|
||||
<b>Reviewer:</b> Claude (Opus 4.8) ·
|
||||
<b>Scope:</b> TimescaleDB/PostGIS schema + migrations, and the three ingestion microservices
|
||||
(<code>ingest_movement_rev.py</code>, <code>ingest_events_rev.py</code>, <code>webhook_receiver_rev.py</code> + shared <code>ts_shared_rev.py</code>)
|
||||
</p>
|
||||
<p class="meta">Findings are ordered by <b>greatest performance upside first</b>, as requested.</p>
|
||||
</header>
|
||||
|
||||
<div class="callout warn">
|
||||
<p><b>Access caveat — read this first.</b> The remote instance was <b>unreachable from the review environment</b>:
|
||||
every probed port (22, 5433, 5432, 443) timed out, so the IP is not whitelisted (or the host was down).
|
||||
I could <b>not</b> run <code>EXPLAIN</code>, read live row/chunk counts, confirm which indexes actually exist,
|
||||
or inspect the running images. Everything below is a <b>static review</b> of the source and migration files.
|
||||
Items needing live confirmation are tagged <span class="pill">verify live</span>.</p>
|
||||
<p style="margin-bottom:0"><b>Immediate security note:</b> <code>.env</code> is <b>committed to git</b> (it is listed in
|
||||
<code>.gitignore</code>, but was tracked before that rule existed, so the rule is a no-op). The live Tracksolid app
|
||||
secret, the Postgres superuser password, and the Grafana admin password are all in the repo history on Forgejo.
|
||||
Treat all three as compromised and rotate them.</p>
|
||||
</div>
|
||||
|
||||
<div class="toc">
|
||||
<strong>Findings</strong>
|
||||
<ol>
|
||||
<li><a href="#f1">Single-threaded scheduler holds a DB transaction open across throttled geocoding</a> <span class="badge b-hi">High</span></li>
|
||||
<li><a href="#f2">dwh_gold daily-metrics ETL is non-functional</a> <span class="badge b-hi">High</span></li>
|
||||
<li><a href="#f3">v_driver_aggregates_daily will not scale; safeguard not applied</a> <span class="badge b-hi">High</span></li>
|
||||
<li><a href="#f4">pgbouncer deployed but bypassed by the application</a> <span class="badge b-med">Medium</span></li>
|
||||
<li><a href="#f5">Migrations race across three containers with no lock</a> <span class="badge b-med">Medium</span></li>
|
||||
<li><a href="#f6">Orphaned migration: 10_driver_clock_views.sql never applied</a> <span class="badge b-med">Medium</span></li>
|
||||
<li><a href="#f7">Security gaps (webhook auth, committed secrets)</a> <span class="badge b-sec">Security</span></li>
|
||||
<li><a href="#f8">Smaller DB-design notes</a> <span class="badge b-lo">Low</span></li>
|
||||
<li><a href="#good">What's genuinely good</a></li>
|
||||
<li><a href="#plan">Suggested order of attack</a></li>
|
||||
</ol>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 1 ====================== -->
|
||||
<h2 id="f1"><span class="num">1</span>Single-threaded scheduler holds a DB transaction open across throttled geocoding<span class="badge b-hi">Highest upside</span></h2>
|
||||
<div class="finding">
|
||||
<p><code>ingest_movement_rev.py</code> runs every job on one <code>schedule</code> thread
|
||||
(<span class="ref">main(), lines 674–695</span>). Within that, <code>poll_trips()</code> opens a transaction
|
||||
(<span class="ref">with get_conn(), line 343</span>) and then, <b>inside that open transaction</b>, calls
|
||||
<code>reverse_geocode()</code> twice per trip (<span class="ref">lines 392–393</span>).
|
||||
<code>reverse_geocode</code> enforces a global <b>1 request/second</b> Nominatim throttle
|
||||
(<span class="ref">ts_shared_rev.py:463, _geocode_throttle</span>).</p>
|
||||
|
||||
<h3>Consequences</h3>
|
||||
<ul>
|
||||
<li>A batch of N new trips can hold a single pooled connection open for <b>N×~2 seconds</b> of network I/O — a
|
||||
long-running transaction that pins a snapshot (bad for autovacuum's cleanup horizon) and ties up a connection.</li>
|
||||
<li>Because the scheduler is one thread, while <code>poll_trips</code> is geocoding, the <b>60-second live-position
|
||||
sweep cannot run</b>. The "live" freshness SLA silently degrades to minutes whenever trips/parking work through a
|
||||
backlog. <code>poll_track_list</code> (30 min) and <code>poll_stale_locations</code> (10 min) share the same
|
||||
thread and also block each other.</li>
|
||||
<li>Every 15 min, <code>poll_trips</code> re-runs the 8-subquery enrichment block (<code>_ENRICH_QUERY</code>,
|
||||
<span class="ref">lines 295–321</span>) for the <b>entire last hour</b> of trips, even though the
|
||||
<code>ON CONFLICT</code> mostly <code>COALESCE</code>s the result away.</li>
|
||||
</ul>
|
||||
|
||||
<h3>Recommendation</h3>
|
||||
<ul>
|
||||
<li>Move geocoding <b>out of the DB transaction</b>: collect trip rows, commit, then geocode + <code>UPDATE</code>
|
||||
in a second pass (or delegate geocoding to a queue / n8n).</li>
|
||||
<li>Gate enrichment on <code>WHERE start_address IS NULL</code> so already-enriched trips don't re-pay the cost.</li>
|
||||
<li>Run the 60s live sweep on its own thread/process so slow reporting jobs cannot starve it.
|
||||
<code>schedule</code> + <code>time.sleep(1)</code> on one thread is the wrong concurrency model when one job is
|
||||
latency-critical and others do long network I/O.</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 2 ====================== -->
|
||||
<h2 id="f2"><span class="num">2</span>The <code>dwh_gold</code> daily-metrics ETL is non-functional<span class="badge b-hi">High</span></h2>
|
||||
<div class="finding">
|
||||
<p><code>dwh_gold.refresh_daily_metrics()</code> (<span class="ref">migration 05, lines 212–264</span>) selects
|
||||
<code>t.imei AS vehicle_key</code> and inserts into <code>fact_daily_fleet_metrics.vehicle_key</code>, which is
|
||||
<code>INTEGER REFERENCES dwh_gold.dim_vehicles(vehicle_key)</code> (<span class="ref">schema lines 237–243</span>).
|
||||
But <code>imei</code> is a 12–15-digit <b>TEXT</b> string:</p>
|
||||
<ul>
|
||||
<li>15-digit IMEIs overflow <code>int4</code> → <em>"integer out of range"</em>.</li>
|
||||
<li>Shorter ones violate the FK because <b>nothing ever populates <code>dim_vehicles</code></b> — no code path
|
||||
inserts into it.</li>
|
||||
</ul>
|
||||
<p>So the function cannot succeed as written, and <code>v_utilisation_daily</code> (which joins
|
||||
<code>fact → dim_vehicles → devices</code>, <span class="ref">migration 07, lines 268–286</span>) will always be
|
||||
empty. CLAUDE.md lists "schedule the nightly ETL" as a LOW open item — but scheduling it today would error on every
|
||||
run.</p>
|
||||
<p style="margin-bottom:0"><b>Recommendation:</b> redesign the gold layer around <code>imei</code> (drop the surrogate
|
||||
key, or populate <code>dim_vehicles</code> from <code>devices</code> first and look up the key), and fix the column
|
||||
type. This is a real bug hiding behind "not scheduled yet."</p>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 3 ====================== -->
|
||||
<h2 id="f3"><span class="num">3</span><code>v_driver_aggregates_daily</code> will not scale, and the safeguard wasn't applied<span class="badge b-hi">High</span></h2>
|
||||
<div class="finding">
|
||||
<p>Migration 07 (<span class="ref">lines 159–223</span>) builds this view with two 31-day scans of
|
||||
<code>position_history</code> plus a <code>LAG()</code> window over <code>source='track_list'</code> rows. There is
|
||||
<b>no index on <code>position_history.source</code></b>, and the only index on the hypertable is the
|
||||
<code>(imei, gps_time)</code> primary key.</p>
|
||||
<p>The view's own header comment says <em>"convert to a continuous aggregate once the hypertable exceeds ~100k rows."</em>
|
||||
At 156 devices writing a row/minute from the poll sweep plus track_list waypoints, you cross 100k in <b>days</b>, not
|
||||
months. <span class="pill">verify live</span> current row + chunk count.</p>
|
||||
<p style="margin-bottom:0"><b>Recommendation:</b> build the speeding/harsh aggregates as a TimescaleDB continuous
|
||||
aggregate (the pattern already exists in <code>v_mileage_daily_cagg</code>), or at minimum add a partial index
|
||||
supporting the <code>source='track_list'</code> + time filter. As-is, the daily driver dashboard does a growing full
|
||||
hypertable scan on every load.</p>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 4 ====================== -->
|
||||
<h2 id="f4"><span class="num">4</span>pgbouncer is deployed but the application bypasses it entirely<span class="badge b-med">Medium</span></h2>
|
||||
<div class="finding">
|
||||
<p><code>docker-compose.yaml</code> adds a pgbouncer sidecar (<span class="ref">lines 82–116</span>) "to cap
|
||||
tracksolid_db connections," but <code>.env</code> sets
|
||||
<code>DATABASE_URL=...@timescale_db:5432/...</code> — the Python pools connect <b>straight to Postgres</b>, not to
|
||||
pgbouncer's 6432.</p>
|
||||
<p>So the connection cap does nothing for the three services. The real ceiling today is the sum of per-process pools:</p>
|
||||
<pre>webhook : uvicorn --workers 2 → 2 procs × ThreadedConnectionPool(max=12) = 24
|
||||
ingest_movement = 12
|
||||
ingest_events = 12
|
||||
total ≈ 48 direct conns</pre>
|
||||
<p>At 80–156 devices this is not a live performance problem — it is wasted/contradictory infrastructure and an
|
||||
intent-vs-reality gap. You also maintain a SCRAM-passthrough <code>user_lookup()</code> SECURITY DEFINER function
|
||||
(<span class="ref">migration 10</span>) with no consumer.</p>
|
||||
<p style="margin-bottom:0"><b>Recommendation:</b> either point <code>DATABASE_URL</code> at <code>pgbouncer:6432</code>
|
||||
(transaction-pool mode disallows session features, but the code uses none beyond <code>client_encoding</code>), or
|
||||
remove the sidecar.</p>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 5 ====================== -->
|
||||
<h2 id="f5"><span class="num">5</span>Migrations race across three containers with no lock<span class="badge b-med">Medium · reliability</span></h2>
|
||||
<div class="finding">
|
||||
<p>All three services run <code>python run_migrations.py</code> on startup (<span class="ref">compose lines 26, 37,
|
||||
48</span>) and start in parallel once the DB is healthy. <code>run_migrations.py</code> does check-then-act
|
||||
(<code>already_applied()</code> → <code>run_file()</code>, <span class="ref">lines 231–242</span>) with <b>no
|
||||
advisory lock</b>. On a fresh database, three containers can pass <code>already_applied()==False</code>
|
||||
simultaneously and run the same file.</p>
|
||||
<ul>
|
||||
<li>Migration 02's <code>CREATE TRIGGER</code> loop (<span class="ref">lines 255–267</span>) has no
|
||||
<code>IF NOT EXISTS</code> — concurrent runs throw, and <code>run_file()</code> treats any <code>ERROR:</code> as
|
||||
fatal → <code>sys.exit(1)</code> → a service refuses to start.</li>
|
||||
<li><code>run_file()</code> greps stderr for <code>ERROR:</code> without <code>-v ON_ERROR_STOP=1</code>, and files
|
||||
02/03 have no <code>BEGIN/COMMIT</code>, so a mid-file failure can leave partial schema that later gets mis-seeded
|
||||
as "applied."</li>
|
||||
</ul>
|
||||
<p style="margin-bottom:0"><b>Recommendation:</b> wrap the run in <code>pg_advisory_lock(<const>)</code> /
|
||||
unlock, and run psql with <code>ON_ERROR_STOP=1</code>. Low effort, removes a class of cold-start flakiness.</p>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 6 ====================== -->
|
||||
<h2 id="f6"><span class="num">6</span>Orphaned migration: <code>10_driver_clock_views.sql</code> is never applied<span class="badge b-med">Medium</span></h2>
|
||||
<div class="finding">
|
||||
<p>The runner's <code>MIGRATIONS</code> list (<span class="ref">run_migrations.py:27–37</span>) includes
|
||||
<code>10_pgbouncer_auth.sql</code> but <b>not</b> <code>10_driver_clock_views.sql</code>. Two files share the
|
||||
<code>10_</code> prefix and the list is hand-maintained, so <code>v_driver_clock_daily/_today</code> (which the n8n
|
||||
tardiness workflow depends on, per the file header) exist only if someone applied them by hand — they are not
|
||||
reproducible from a clean deploy.</p>
|
||||
<p style="margin-bottom:0"><b>Recommendation:</b> rename to <code>11_</code> and add to the list. Better: switch the
|
||||
runner from a hardcoded list to globbing <code>NN_*.sql</code> sorted, so this cannot recur.</p>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 7 ====================== -->
|
||||
<h2 id="f7"><span class="num">7</span>Security gaps worth fixing now<span class="badge b-sec">Security</span></h2>
|
||||
<div class="finding">
|
||||
<ul>
|
||||
<li><b>Webhook auth is effectively off.</b> <code>_validate_token</code>
|
||||
(<span class="ref">webhook_receiver_rev.py:84–87</span>) skips validation entirely when
|
||||
<code>JIMI_WEBHOOK_TOKEN</code> is empty, and it is <b>not set in <code>.env</code></b>. The push endpoints are
|
||||
exposed via Traefik, so anyone who learns the URL can inject arbitrary telemetry/alarms (each <code>/pushgps</code>
|
||||
accepts up to 5000 rows, no rate limit). Set the token and make an unset token <b>fail closed</b> in production.</li>
|
||||
<li><b>Committed secrets</b> (see top banner). Rotate the Tracksolid app secret, Postgres password, and Grafana admin
|
||||
password; <code>git rm --cached .env</code> and scrub history.</li>
|
||||
<li><code>dwh/260423_dwh_ddl_v1.sql</code> plaintext passwords are an existing known item in CLAUDE.md — same class of
|
||||
problem.</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<!-- ====================== FINDING 8 ====================== -->
|
||||
<h2 id="f8"><span class="num">8</span>Smaller DB-design notes<span class="badge b-lo">Low — queue these</span></h2>
|
||||
<div class="finding">
|
||||
<ul>
|
||||
<li><b><code>v_mileage_daily_cagg</code> is built on a column that's mostly NULL.</b> It computes
|
||||
<code>MAX(current_mileage) - MIN(current_mileage)</code> (<span class="ref">schema lines 293–301</span>), but
|
||||
<code>current_mileage</code> is only populated by the poll sweep — <code>track_list</code> and <code>/pushgps</code>
|
||||
inserts leave it NULL, and odometer resets/device swaps produce negative or huge deltas. The aggregate's
|
||||
<code>dist_km</code> is unreliable. Prefer deriving daily distance from <code>trips.distance_km</code>.</li>
|
||||
<li><b><code>ingestion_log</code> has no retention and no index.</b> <code>v_ingestion_health</code> does
|
||||
<code>DISTINCT ON (endpoint) … ORDER BY endpoint, run_at DESC</code> over the whole table, which grows ~875
|
||||
rows/day forever. Add <code>(endpoint, run_at DESC)</code> plus a retention/partition policy.</li>
|
||||
<li><b>Alarm dedup is leaky on the poll path.</b> <code>alarms_dedup UNIQUE (imei, alarm_type, alarm_time)</code>
|
||||
(<span class="ref">schema line 199</span>) — the poll path inserts <code>alertTypeId</code> as
|
||||
<code>alarm_type</code> with no NOT-NULL guard, and <code>NULL</code> defeats the unique constraint
|
||||
(<code>NULL ≠ NULL</code>), so a null-type alarm can duplicate. The webhook path guards this; the poll path
|
||||
doesn't.</li>
|
||||
<li><b><code>live_positions</code>/staleness queries are seq scans</b> (no index on <code>gps_time</code>) — totally
|
||||
fine at ~156 rows today; just don't carry that pattern into anything that scans <code>position_history</code>.</li>
|
||||
<li><b>Dead/ambiguous code in <code>_parse_request</code></b> (<span class="ref">webhook lines 90–143</span>): the
|
||||
JSON-array branch <code>_parse_data_list</code> is never reached (it always falls through to
|
||||
<code>request.form()</code>); harmless but misleading given the docstring claims it handles both.</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<!-- ====================== GOOD ====================== -->
|
||||
<div class="good-box">
|
||||
<h2 id="good" style="border-top:none;"><span class="num" style="color:var(--good)">✓</span>What's genuinely good</h2>
|
||||
<p>So this is balanced — the bones are solid:</p>
|
||||
<ul>
|
||||
<li>Per-row <code>SAVEPOINT</code> isolation so one bad item can't abort a batch.</li>
|
||||
<li>Time-guarded upserts via the shared <code>upsert_live_position</code> helper.</li>
|
||||
<li>Batched <code>execute_values</code> on the high-volume push / track-list paths.</li>
|
||||
<li>Hypertables with compression + retention policies.</li>
|
||||
<li>Parameterized SQL throughout — no injection surface.</li>
|
||||
<li>Clean signal handling and pool teardown.</li>
|
||||
<li>Idempotent migrations with a tracking table and <code>COMMENT ON VIEW</code> provenance.</li>
|
||||
<li><code>sync_devices</code> N+1 already parallelized with a bounded thread pool.</li>
|
||||
</ul>
|
||||
<p style="margin-bottom:0">The issues above are mostly about <b>coupling</b>, <b>one broken ETL</b>, and
|
||||
<b>scale-ahead-of-indexing</b> — not a bad foundation.</p>
|
||||
</div>
|
||||
|
||||
<!-- ====================== PLAN ====================== -->
|
||||
<h2 id="plan"><span class="num">»</span>Suggested order of attack (effort vs. upside)</h2>
|
||||
<table>
|
||||
<thead>
|
||||
<tr><th style="width:38px">#</th><th>Action</th><th style="width:130px">Upside</th><th style="width:80px">Effort</th></tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>1</td>
|
||||
<td>Pull geocoding out of the trips transaction + gate on <code>start_address IS NULL</code>; isolate the 60s sweep on its own thread</td>
|
||||
<td class="up-h">High — restores live freshness, frees connections</td>
|
||||
<td>M</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>2</td>
|
||||
<td>Fix or redesign <code>refresh_daily_metrics</code> / <code>dim_vehicles</code> (imei vs int key)</td>
|
||||
<td class="up-h">High — unblocks all utilisation reporting</td>
|
||||
<td>M</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>3</td>
|
||||
<td>Convert <code>v_driver_aggregates_daily</code> to a continuous aggregate (or add <code>source</code>+time index)</td>
|
||||
<td class="up-h">High and growing</td>
|
||||
<td>M</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>4</td>
|
||||
<td>Set <code>JIMI_WEBHOOK_TOKEN</code>; rotate + untrack <code>.env</code></td>
|
||||
<td class="up-h">High (security)</td>
|
||||
<td>S</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>5</td>
|
||||
<td>Advisory-lock the migration runner + <code>ON_ERROR_STOP=1</code>; add <code>10_driver_clock_views</code> / switch to glob</td>
|
||||
<td class="up-m">Medium (reliability)</td>
|
||||
<td>S</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>6</td>
|
||||
<td>Decide pgbouncer in-or-out; point <code>DATABASE_URL</code> accordingly</td>
|
||||
<td class="up-m">Medium (clarity)</td>
|
||||
<td>S</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>7</td>
|
||||
<td><code>ingestion_log</code> index + retention; fix poll-path alarm null dedup; fix cagg distance source</td>
|
||||
<td class="up-l">Low–medium</td>
|
||||
<td>S</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<div class="callout">
|
||||
<p style="margin:0"><b>Next step for live confirmation:</b> if I can get onto the box (whitelist the review IP for
|
||||
5433, or an SSH tunnel), I'll confirm the <span class="pill">verify live</span> items — actual
|
||||
<code>position_history</code> row/chunk counts, which indexes really exist, whether <code>refresh_daily_metrics</code>
|
||||
has ever succeeded, and <code>EXPLAIN ANALYZE</code> on the heavier views — and tighten the priority order with real
|
||||
numbers.</p>
|
||||
</div>
|
||||
|
||||
<footer>
|
||||
Generated 2026-06-01 by Claude (Opus 4.8) for Fireside Communications · Tracksolid Fleet Intelligence.
|
||||
Static review only — no live database access was available at review time. File references use
|
||||
<span class="ref">file:line</span> against the repository state on branch
|
||||
<code>quality-program-2026-04-12</code>.
|
||||
</footer>
|
||||
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
|
|
@ -49,7 +49,7 @@ from ts_shared_rev import clean, clean_num, clean_ts, get_conn, get_logger
|
|||
|
||||
log = get_logger("csv_import")
|
||||
|
||||
DEFAULT_CSV_PATH = Path(__file__).parent / "20260427_FSG_Vehicles_mitieng.csv"
|
||||
DEFAULT_CSV_PATH = Path(__file__).parent / "data" / "20260427_FSG_Vehicles_mitieng.csv"
|
||||
|
||||
# Columns fetched from DB for diff comparison.
|
||||
DB_COLS = [
|
||||
|
|
|
|||
|
|
@ -221,7 +221,7 @@ def main():
|
|||
|
||||
applied = skipped = 0
|
||||
for sql_file in MIGRATIONS:
|
||||
path = os.path.join("/app", sql_file)
|
||||
path = os.path.join("/app", "migrations", sql_file)
|
||||
|
||||
if not os.path.exists(path):
|
||||
print(f" SKIP {sql_file} (file not found in /app)")
|
||||
|
|
|
|||
|
|
@ -55,10 +55,10 @@ run_sql -c "
|
|||
" > /dev/null
|
||||
|
||||
# ── Find and apply pending migrations ────────────────────────────────────────
|
||||
MIGRATION_FILES=$(find "$SCRIPT_DIR" -maxdepth 1 -name '[0-9][0-9]_*.sql' | sort)
|
||||
MIGRATION_FILES=$(find "$SCRIPT_DIR/migrations" -maxdepth 1 -name '[0-9][0-9]_*.sql' | sort)
|
||||
|
||||
if [[ -z "$MIGRATION_FILES" ]]; then
|
||||
echo "No migration files found in $SCRIPT_DIR"
|
||||
echo "No migration files found in $SCRIPT_DIR/migrations"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue