fleet-platform/260522_fleet_platform_architecture_final.md

528 lines
39 KiB
Markdown
Raw Permalink Normal View History

# Fleet Platform — Greenfield Architecture
**Date:** 2026-05-22 (rev. b — incorporates engineering review of 2026-05-22)
**Author posture:** Senior systems architect
**Status:** Design document. Not a migration plan. Not a port.
---
## 2. Architectural principles
These are the load-bearing decisions. Everything below is derived from them.
### 2.1 The system is an event log with derived state
Every inbound signal — Jimicloud push, polled API response, manual import, ops entry — is appended to an immutable log (`events.raw`). The raw payload is preserved verbatim with metadata (`received_at`, `source`, `signature`). A separate parser produces typed events in `events.parsed`. Projectors read `events.parsed` and update "current state" tables (`state.live_positions`, `state.trips`, etc.) deterministically.
This is event sourcing kept lightweight: no Kafka, no Debezium, no separate event store. The log lives in Postgres as a hypertable. Projectors are SQL functions invoked by triggers or scheduled jobs.
What this fixes structurally:
- Race conditions in write paths — projectors are single-writer per state table.
- Contract drift — re-parse from `events.raw` with corrected logic, no upstream re-fetch.
- Time-travel debugging — "what did Jimi send at 03:47 on Tuesday?" is a `SELECT`.
- Replay — rebuild any state table from the log.
### 2.2 Contracts are typed, in code, and continuously verified
Every endpoint (Jimicloud → us, us → dashboard, us → routing) has a Pydantic model in the codebase, version-pinned per upstream API minor revision. A scheduled contract-check job hits a sandbox account daily and asserts the shape still matches. Drift produces an alert, not a `NULL` in production.
### 2.3 The platform is one codebase, one image, three container roles
One FastAPI application, one Git repository, one Docker image. The same image runs in three container roles selected by entrypoint command — `platform-gateway` (push receivers + dashboard API + routing endpoint + auth), `platform-worker` (parser + projectors + geocode worker + matview refresher), `platform-cron` (contract checker + OSM loader + scheduled polls + SLO measurement). Configuration and database connection are shared; process memory is not.
This is the same pattern Sidekiq / Celery / Rails apps use: single codebase, single deployment artefact, multiple runtime roles. It is not microservices — there is no service mesh, no inter-service contracts, no separate repos, no per-service teams. There is one PR review surface and one image to roll back. What it is is **fate isolation**: a heavy historical query that triggers an OOM in the worker role does not take down the gateway; an OSM loader that pegs CPU in the cron role does not stall webhook receivers. The original "one process" design conflated single-codebase (correct) with single-process (an avoidable failure mode).
Microservices solve organisational scaling problems we don't have. Container-role separation solves the fate-sharing problem we do have. Two developers and one fleet do not need a service mesh; they do need the gateway to keep responding to Jimi pushes while a 90-day report runs.
### 2.4 Lifecycle state and operational state are orthogonal and explicit
Devices have a **lifecycle state machine** (`provisioned → active → suspended → decommissioned`) stored explicitly in `domain.devices.lifecycle`. Vehicles have an **operational state** (`moving | parked | offline | unknown`) derived from the latest fix — never stored, always computed.
### 2.5 SLOs are first-class, not implicit thresholds
Service-level objectives live in `slo.targets`, one row per metric, with the current threshold value. Dashboards render against them. Grafana monitors against them. Changing a threshold is one `UPDATE`, not a scavenger hunt through SQL + JS.
### 2.6 Deploys are by image tag, not by branch
CI builds an image on push to `main`, tags it `<registry>/fleet-platform:<git-sha>` plus a moving `:latest`. Coolify deploys by tag. "What's deployed?" is `docker inspect`.
### 2.7 Dashboards are thin renderers over a typed read model
Each dashboard polls one endpoint that returns a complete, render-ready payload. The endpoint's shape is the contract. All state logic, palette assignment, plate-tail extraction, EAT formatting, KPI counting happens server-side. The HTML/JS is a render layer.
### 2.8 Projection is event-driven, not poll-driven
Parsers and projectors do not run on fixed schedules. The gateway, after inserting into `events.raw`, issues a Postgres `NOTIFY events_raw_new` in the same transaction. The worker role holds long-lived `LISTEN` connections and wakes immediately, parses the new row, writes `events.parsed`, issues `NOTIFY events_parsed_new`, and the relevant projector wakes and updates `state.*`. The chain is parse → project in tens of milliseconds, not stacked polling windows.
APScheduler is retained — but only for things that are genuinely time-triggered: the daily contract checker, the monthly OSM loader, the per-minute SLO measurement, the 60s polled-ingest sweep. Anything that should fire on data arrival fires on data arrival.
This structurally avoids the "parser every 10s + projector every 10s = up-to-20s of internal lag before the SLO timer even starts" problem. The freshness budget is spent on Jimi's transport, not on our scheduler.
### 2.9 External enrichment is asynchronous and replayable
Nominatim, Mapbox, weather, traffic — all run as separate projectors reading `events.parsed`, writing to side tables. Non-critical. Can be down for an hour without affecting the platform. Can be re-run without re-fetching upstream.
---
## 3. The system, at a glance
```
Browser
┌──────────────────┴──────────────────┐
│ live history routes │ 3 HTML pages
│ │ importing fleet-core.js
└──────────────────┬──────────────────┘
│ HTTPS · JWT (mandatory)
api.rahamafresh.com
┌──────────────┴──────────────┐
│ platform-gateway │ ← one image,
│ (FastAPI) │ three roles
│ /push/* │
│ /api/views/* │
│ /api/routes │
│ /api/auth/token │
│ │
│ gateway contract: │
│ HMAC-verify + INSERT │
│ events.raw + NOTIFY │
│ + 200 OK. Nothing else. │
└──────────────┬──────────────┘
┌──────────────┴──────────────┐
│ platform-worker │ ← same image,
│ (FastAPI workers) │ worker role
│ LISTEN events_raw_new │
│ → parser → events.parsed │
│ → projectors → state.* │
│ geocode_worker │
│ matview_refresh │
│ map_match (geo) │
└──────────────┬──────────────┘
┌──────────────┴──────────────┐
│ platform-cron │ ← same image,
│ (APScheduler) │ cron role
│ polls: live/trips/ │
│ parking/track/devices/ │
│ stale (60s / 10m) │
│ contract_check (daily) │
│ osm_loader (monthly) │
│ slo_measurement (1m) │
│ hr_sync (3h) │
└──────────────┬──────────────┘
pgbouncer (txn mode)
┌──────────────┴──────────────┐
│ TimescaleDB-HA │
│ + PostGIS + pgRouting │
│ │
│ events / state / domain │
│ geo / serve / slo / auth │
└──────────────┬──────────────┘
┌─────────┴─────────┐
│ db_backup → rustfs│
│ grafana │
└───────────────────┘
```
One codebase. One image. Three container roles. One database. One repo. One branch. One image tag is what production is running.
The three roles do not share a Python process. They share configuration, database connection string, and Pydantic models. The gateway can keep returning 200 OK to Jimi while the worker is busy parsing a backlog, and the cron role can run a contract check that consumes CPU without affecting either. A failure in one role does not take the others down.
---
## 4. Tech stack
| Layer | Choice | Why this |
|---|---|---|
| **Language** | Python 3.12 | Team already knows it. No second language to operate. |
| **Web framework** | FastAPI | Typed (Pydantic), async-native, OpenAPI for free, fast enough. |
| **In-process scheduler** | APScheduler | Sturdier successor to the `schedule` library; cron syntax + persistence. |
| **Database** | PostgreSQL 16 + TimescaleDB 2.15 | Hypertables and continuous aggregates are real value. |
| **GIS** | PostGIS 3 | Required. Non-negotiable. |
| **Routing** | pgRouting | In-database A\*. One connection string. Upgrade path to OSRM exists. |
| **Connection pool** | pgbouncer (transaction mode) | Works. |
| **Migrations** | dbmate | Single static binary, forward-only SQL, diff-able `schema.sql`. |
| **Auth** | JWT (FastAPI native) + bcrypt | Short-lived tokens for dashboards. HMAC for webhooks. |
| **Rate limiting** | slowapi | FastAPI-native. |
| **Frontend** | Vanilla JS, ES modules, no bundler | Three HTML pages, one shared `fleet-core.js`. |
| **Map library** | MapLibre GL JS 4.x | Already in current dashboards. Free for OSM/Carto basemaps. |
| **Basemap** | Carto Voyager (default), Mapbox dark-v11 (scaffolded) | No mandatory token. |
| **Reverse geocoding** | Nominatim primary, Mapbox fallback | Asynchronous, queued, never blocks ingest. |
| **Edge / proxy** | Coolify-managed Traefik → nginx:alpine → rustfs / fleet-platform | Continues current pattern. We know it works. |
| **Object store** | rustfs | Already deployed. Hosts static dashboards + DB backups. |
| **Observability** | Grafana + Postgres views + container logs (structlog JSON) | One Grafana instance reads `slo.*` directly. No separate metrics stack. |
| **CI/CD** | GitHub Actions → image registry → Coolify image-tag deploy | Build on push to `main`, tag with git SHA. |
| **OSM data** | Geofabrik Kenya + Uganda extracts, monthly refresh | Reproducible, atomic swap via `ALTER SCHEMA RENAME`. |
| **Secrets** | `.env` (dev) + Coolify env vars (prod) | Nothing in git. `.env.example` documents required keys. |
Explicitly **not** in the stack: message queue (Redis/RabbitMQ/SQS), separate event store (Kafka/Debezium), n8n for service-layer work, frontend framework (React/Vue/Svelte), ORM (psycopg + SQL).
---
## 5. Schema — layered by purpose, not by feature
```
events — immutable log; the source of truth
events.raw — hypertable, append-only, partition by received_at
events.parsed — hypertable, derived from events.raw by parser
state — derived projections (rebuildable from events)
state.devices — current device metadata
state.live_positions — one row per device, latest fix
state.position_history — hypertable, full timeline
state.trips — closed trips
state.parking_events
state.alarms — hypertable
state.obd_readings — hypertable
state.heartbeats — hypertable
state.device_events — hypertable (login/logout)
state.fuel_readings — hypertable
state.temperature_readings — hypertable
state.lbs_readings — hypertable
state.geocoded_positions — async enrichment side table
domain — business entities + lifecycle
domain.accounts — Jimi sub-accounts (FK target)
domain.vehicles — vehicle identity (plate, model, depot)
domain.devices — device <-> vehicle mapping with lifecycle
domain.drivers
domain.cost_centres
domain.assigned_cities
geo — PostGIS + pgRouting + OSM
geo.ways — pgRouting topology
geo.ways_vertices_pgr
geo.segment_observations — hypertable, map-matched fixes
geo.cagg_segment_speed_band — CAGG (segment_id, dow, hod) → avg speed
geo.pois — Fireside HQ + future depots
slo — explicit service level objectives
slo.targets — one row per SLO with current threshold
slo.measurements — hypertable, computed every minute
slo.v_current_status — view: is each SLO currently met?
serve — dashboard + API contracts (SQL functions)
serve.normalize_plate(text)
serve.fn_live_view(filters jsonb)
serve.fn_history_view(filters jsonb)
serve.fn_route(origin, dest, depart_at)
serve.fn_dispatch_view()
serve.v_fleet_overview
ops — dispatch / tickets / CRM-ish
ops.tickets
ops.dispatch_log
ops.cost_rates
ops.service_log
ops.odometer_readings
auth — service auth
auth.accounts
auth.tokens
```
**Read this schema top-down:** events are immutable truth, state is derived, domain owns business identity, geo owns geometry, slo owns commitments, serve owns the dashboard contract, ops owns workflow, auth owns access. Each schema has one reason to exist.
**Where the round-6 dedup logic lives:** in `serve.fn_live_view` and `serve.v_live_dedup`. One function, one place. Changing the rule changes one place.
**Where SLO thresholds live:** `slo.targets`. Not scattered constants.
---
## 6. The ingest pipeline
```
Jimicloud push ─────┐
Jimicloud poll ─────┼─→ events.raw ─NOTIFY→ parser ─→ events.parsed ─NOTIFY→ projectors ─→ state.*
CSV import ─────┤ ↘
Ops action ─────┘ geocode_worker ─→ state.geocoded_positions
```
**The gateway contract is minimal and inviolable.** The `platform-gateway` role does, per request, exactly: (1) HMAC signature verification, (2) one `INSERT` into `events.raw`, (3) one `NOTIFY events_raw_new`, (4) return `200 OK`. No Pydantic parsing, no PostGIS calculation, no geocoding, no projector work. All of that happens in the worker role, downstream. p95 < 100 ms is achievable with ~10x headroom under this contract; the Jimi-side timeout is never the constraint.
This is what makes Postgres-as-queue safe at this scale: the gateway's per-request work is constant-time and CPU-trivial. The event-loop-starvation risk that a co-tenanted parser/projector would create is eliminated by container-role separation (§2.3), not by introducing a separate buffer.
**`events.raw` is the contract.** Every gateway endpoint writes here first, verbatim. No parsing at the gateway. If parsing is wrong, we re-parse later. The raw row is the immutable record that "we received this" — no in-memory buffer sits between Jimi and durable storage.
**The parser is versioned.** `events.raw.parser_version` records which parser handled each row. If `alertTypeId` becomes `alertId`, we bump the parser, re-parse affected rows. `events.raw` is untouched.
**Projectors are single-writer per state table.** One projector owns `state.live_positions`. It reads `events.parsed` of kind `position_fix`, applies dedup, writes the upsert. No other code writes there. Race conditions between the 60 s sweep, alarm cross-feed, and stale rescue cease to exist because they all produce events the projector orders monotonically.
**Parser and projectors wake on NOTIFY, not on a timer.** Workers hold long-lived `LISTEN events_raw_new` / `LISTEN events_parsed_new` connections. New raw row → parser wakes within milliseconds → writes parsed row → projector wakes within milliseconds → updates state. There is no stacked polling delay between stages. A timer-based fallback (every 5 s) catches the rare case of a missed `NOTIFY` (e.g. connection blip); under normal operation the timer never fires because the listener already drained the row. For workers competing for the same queue, draining uses `SELECT … FOR UPDATE SKIP LOCKED` so multiple worker instances can scale out without double-processing.
**The contract checker runs daily.** Calls each Jimi endpoint against a sandbox account, validates response against the current Pydantic model, alerts on drift.
**Polling workers go through `events.raw`.** The cron role's `poll_live_positions` calls `jimi.user.device.location.list`, persists each device's payload to `events.raw`, ACKs. The parser and projector handle the rest. Polls and pushes are indistinguishable downstream of `events.raw`.
**Backfill / replay is trivial.** "Re-run the projector for trips between 2026-01-01 and 2026-03-01" is a SQL statement.
**On not using Redis as an ingest buffer.** A common reflex at this point is to put Redis between the gateway and Postgres to absorb spikes. We are not doing that, for the following reasons:
- The gateway's per-request work under the contract above is single-digit milliseconds against a healthy Postgres. The 100 ms Jimi timeout has ~10x headroom. There is no current latency problem to solve.
- Redis would create a window in which an event has been acknowledged to Jimi but is not yet in `events.raw`. That weakens the "every signal is captured immutably before it is interpreted" invariant from §2.1: the durable record would become "Redis OR Postgres," not "Postgres". On a Redis crash without `appendfsync always`, in-flight events are lost with no trace.
- Replay semantics become more complicated: `events.raw` ceases to be a complete record of inbound traffic until after Redis has drained. Reasoning about "did we receive that?" requires checking two systems.
- Operational surface grows: one more container to monitor, back up, tune memory on, and reason about during incidents.
If the gateway's own workload — not its co-tenants' — ever exceeds what a synchronous Postgres `INSERT` can handle (a roughly 10x-from-today problem, per §10.3 P1 in the PRD), we revisit. Tracked as open question Q7 in §15.
---
## 7. The serving layer
Three dashboard endpoints, one shape:
```
GET /api/views/live?filters=... → {summary, geojson, slo_status}
GET /api/views/history?filters=... → {summary, geojson, slo_status}
POST /api/routes → {route_geojson, eta_sec, distance_m, observed_basis}
```
Each endpoint maps 1:1 to a SQL function in `serve.fn_*`. The Pydantic response model is the contract.
**Filters are a single `jsonb` parameter** (cost_centre, assigned_city, vehicle_numbers[], date_range). Adding a filter is one SQL change, no API signature break.
**Auth:** JWT mandatory on every endpoint, read and write — including dashboard reads, which are no longer public. The legacy public-read posture is not preserved (see §15 Q1, closed). `/api/auth/token` issues short-lived JWTs from `auth.accounts` credentials. Dashboards request a token on load, cache locally, refresh before expiry. Scopes: `read:fleet` (all dashboard reads), `write:ops` (driver assignments, alarm acks, service log entries, admin actions on operational records), `admin:fleet` (lifecycle transitions, device provisioning, audit access). Push endpoints use HMAC shared secret per source (Jimi push, WhatsApp fuel microservice, HR extract).
**Rationale for closing Q1 in favor of authenticated:** by Phase 4 the platform carries driver shift start-locations (home-area information), driver names and phone numbers from the HR extract, customer-site visit patterns through trip endpoints, and plate-to-cost-centre mappings that reveal commercial relationships. This is no longer the same security posture as "anonymous vehicle dots on a public map" — even ignoring the original case for not exposing live fleet positions to competitors, the dataset has grown into one that cannot be public-read responsibly.
**Rate limits:** dashboards 60 req/min/IP, routes 10 req/min/IP, push 1000 req/min.
---
## 8. Frontend — three pages, one renderer
```
fleet-core.js ES module, ~600 lines, MapLibre as only dependency
initMap(elementId, opts)
renderView(payload) ← the universal renderer
initFilters(state, onChange)
poiLayer(map, pois)
costCentrePalette(name) → colour
normalisePlate(s)
apiFetch(path, params)
clockEAT(elementId)
authClient (token cache + refresh)
index-live.html ~100 lines, polls /api/views/live every 15 s
index-history.html ~100 lines, polls /api/views/history on form submit
index-routes.html ~100 lines, click-and-route, calls /api/routes
```
The renderer is dumb: take a `{summary, geojson, slo_status}` payload, populate KPIs, replace the GeoJSON source, render SLO badges. It doesn't know what "OFFLINE" means; the server attaches the right `style_class` to each feature. The renderer paints.
This structurally fixes the "1,400-line dashboard with embedded business logic" problem.
---
## 9. Routing layer (geo)
OSM Kenya + Uganda extracts loaded by `osm_loader` (monthly cron) into a staging schema. `osm2pgsql` produces `geo.ways` + `geo.ways_vertices_pgr`. Staging → live is `ALTER SCHEMA RENAME`.
**Edge weights are hybrid:**
1. `map_match` projector reads `events.parsed` of kind `position_fix`, finds nearest `geo.ways` segment within 30 m, writes `(segment_id, observed_at, speed_kmh)` to `geo.segment_observations`.
2. A CAGG rolls it into `geo.cagg_segment_speed_band` keyed by `(segment_id, dow, hod)`.
3. `serve.fn_route` calls `pgr_aStar` with a cost function that reads the CAGG for the departure hour-of-day, falls back to OSM `maxspeed`, finally to a global average.
**Response includes `observed_basis`** — which segments came from observed data, which from tags, which from fallback. Dispatch can see "this route is 80% observed-data based". Trust calibration belongs in the response.
**Active re-routing is out of scope for v1.** Endpoint is "suggest a route at this departure time".
---
## 10. SLOs and observability
```sql
slo.targets:
metric | threshold | window
─────────────────────────────┼───────────┼────────
fix_freshness_pct_60s | 95 | 5 min
trip_lag_p95_sec | 600 | 1 h
route_p95_ms | 500 | 5 min
parser_lag_p95_sec | 30 | 5 min
contract_drift_days | 1 | 1 d
```
`slo.measurements` is a hypertable populated every minute by a worker. `slo.v_current_status` exposes the live state. Grafana dashboards and alerts read directly from `slo.*`.
Dashboards display SLO-aware status: "Fleet below freshness SLO: 3 vehicles" instead of "3 vehicles OFFLINE 24h+".
Logs ship to container stdout (Coolify aggregates) as structured JSON (`structlog`).
---
## 11. Expected benefits over current architecture and functionality
This is the section that justifies the rebuild. Each row is a concrete, measurable improvement.
### 11.1 Reliability and correctness
| Pain in current system | What changes | Why |
|---|---|---|
| ~10 `[FIX-MNN]` hot-patches per year for write-path races, contract drift, dedup logic | Each category is structurally impossible | Single-writer projectors + versioned parser + one-place dedup rule |
| Silent data loss when Jimi renames a field (weeks to detect) | Drift caught within 24 h | Daily contract checker against sandbox API |
| `STALE_GPS_MS=10min` / `OFFLINE=24h` / `freshness magic` scattered across 12+ places | One row per threshold in `slo.targets` | SLO-first design |
| "OFFLINE" mixes broken device, parked vehicle, expired subscription, decommissioned | Lifecycle and operational state are separate | `domain.devices.lifecycle` + computed operational state |
| Production runs from a non-`main` branch; cherry-picks needed | One branch. Image-tag deploys | CI builds tagged image. Coolify pulls by tag |
| Nominatim slowdown stalls trip ingest | Geocoding is its own worker, never blocks | Async projector pattern |
| No way to replay a dropped event | Every event is in `events.raw`; re-parse anytime | Event sourcing |
| Heavy historical query / OOM in one component can crash everything (single-process fate sharing) | A heavy query in the worker role does not affect the gateway; a runaway cron does not affect either | Container-role separation (same image, gateway / worker / cron) |
| Internal stage-to-stage lag (parser poll + projector poll = up to 20s before SLO timer starts) | Parser and projector wake on `NOTIFY` within milliseconds of the upstream write | Event-driven projection chain |
| Live dashboard publicly readable (legacy posture) | All endpoints require JWT; scope-gated reads | Mandatory auth from day one |
### 11.2 Operations
| Pain in current system | What changes | Why |
|---|---|---|
| Three Docker images, three rebuild cycles per shared-helper change | One image, one rebuild — same image runs in three container roles | Single FastAPI codebase consolidates ingest + API + workers; runtime roles separated by entrypoint |
| ~7 containers (webhook_receiver, ingest_movement, ingest_events, timescale, grafana, pgbouncer, db_backup) | 8 containers: `db`, `pgbouncer`, `platform-gateway`, `platform-worker`, `platform-cron`, `dashboard-proxy`, `grafana`, `db_backup` | Three runtime roles from one image replace three independent Python services; fate isolation gained, build complexity unchanged |
| Coolify per-service redeploy required for shared-helper changes | One redeploy ships everything | One image |
| "What's deployed?" requires `git log` + Coolify UI + container exec | `docker inspect` returns image digest = git SHA | Image-tag deploys |
| Rollback is `git revert + rebuild + redeploy` (~5 minutes) | Rollback is `coolify deploy :<prev-tag>` (~30 seconds) | Pre-built images in registry |
| Schema migrations are bespoke Python with no formal "down" path | dbmate handles `up`/`down`, generates `schema.sql` snapshot for PR review | Standard tool |
| n8n workflow JSON holds dashboard contracts; not code-reviewable | Pydantic models in code; OpenAPI generated; PRs review contracts | Contracts in code |
| Grafana queries assemble metrics ad-hoc | `slo.*` schema is the metrics layer; Grafana is a thin renderer | Pre-aggregated SLO measurements |
### 11.3 Development velocity
| Pain in current system | What changes | Why |
|---|---|---|
| Adding a third dashboard means re-implementing palette + POI + EAT clock + map setup | `fleet-core.js` exports those primitives; new dashboard is ~100 lines of HTML | Shared renderer |
| Dedup logic change (round 6) required modifying SQL CTE + JS `vehicleState()` | Dedup change is one SQL function | Server-side state computation |
| Trip enrichment (FIX-M20) required modifying `poll_trips` + adding migration + adjusting webhook handler | Enrichment is a new projector reading existing events | Decoupled enrichment |
| Source code is image-baked in dev; typo fix requires rebuild | Bind-mounted in dev; baked in prod (via `APP_MODE` build arg) | Dev/prod parity without dev pain |
| Test suite must use mock DB because shared module assumes pool exists at import time | Tests use real Postgres (docker-compose), shared module is lazy-init | Cleaner module boundaries |
| Adding multi-account support required retrofitting `TARGETS` env var across all polling code | Multi-account is a NOT NULL FK from commit one | Designed-in, not bolted-on |
### 11.4 New capabilities not present today
| Capability | What it enables |
|---|---|
| **Event replay** | "Re-build the last 90 days of trips with corrected enrichment logic" is a SQL statement |
| **Time-travel debug** | "What payload did Jimi send for IMEI X at 03:47 on Tuesday?" is `SELECT FROM events.raw WHERE imei = ...` |
| **Routing (A\* with time-banded weights)** | Dispatch can suggest a route at a given departure time, with `observed_basis` showing trust level per segment |
| **SLO-driven alerting** | Grafana alert when fix-freshness falls below 95% during business hours; dashboards render SLO state, not arbitrary thresholds |
| **Lifecycle state machine** | Decommissioned devices don't appear in operational dashboards. Suspended devices show a distinct visual state. No more "OFFLINE 24h+" sweep including retired vehicles |
| **Versioned parsers** | When Jimi changes a field name, bump parser version, backfill `events.parsed`, no data lost |
| **Daily contract check** | Upstream API drift caught next day, not next quarter |
| **`observed_basis` in route responses** | Operator trust calibration: "this ETA is based on 80% observed data" vs "mostly OSM tags" |
| **One-place dedup rule** | Future fleet expansion (e.g. adding a third device class) is a one-line change to `serve.fn_live_view` |
### 11.5 Quantified expectations (best-effort estimates)
These are forecasts based on the architectural changes, not measured. Re-baseline after Phase G.
- **Mean time to detect API contract drift:** ~90 days today → **<24 hours** (contract checker).
- **Mean time to detect data freshness regression:** unbounded today → **5 minutes** (SLO alerting).
- **Rollback time:** ~5 minutes today (rebuild + redeploy) → **~30 seconds** (image-tag swap).
- **Add a new dashboard:** ~2 weeks today (re-implement scaffolding) → **~2 days** (~100 lines of HTML against existing renderer).
- **Add a new ingest source (e.g. a 4th sub-account or a new push type):** ~3-5 days today (touch 3 Python files + migration + n8n) → **~half a day** (new gateway endpoint + new parser + new projector — each a single file).
- **Reproduce a production data issue locally:** Hours-days today (re-fetch from Jimi, hope it returns same data) → **Minutes** (`pg_dump events.raw`, restore, replay).
- **Cold-start a new dev:** ~1 day today (figure out which container does what) → **<10 minutes** (`git clone && docker compose up`).
### 11.6 What does NOT improve
Honest pushback — not everything gets better.
- **Raw write throughput:** identical. Both systems are nowhere near Postgres limits at ~80 vehicles.
- **Map rendering performance:** identical. MapLibre is the same library.
- **Geocoding latency:** identical. Nominatim is still rate-limited.
- **Jimi API rate limits:** unchanged. Their problem, not ours.
- **Operator UI learning curve:** dashboards look familiar but the SLO terminology is new; expect a brief training window.
---
## 12. Phased rollout
The architectural shift is large; the rollout is incremental. Each phase produces a verifiable artefact.
| Phase | Weeks | Deliverable | DoD |
|---|---|---|---|
| **A. Foundation** | 1-2 | Repo, docker-compose, all schemas, dbmate, FastAPI `/health`. CI builds image, Coolify deploys `:latest` | `curl /health` returns DB connectivity from a tagged image |
| **B. Event log + parser** | 3 | `/push/*` endpoints write `events.raw`. Parser worker drains to `events.parsed`. Pydantic models versioned. Contract checker scheduled | Replayed historical Jimi push lands in both `raw` and `parsed` |
| **C. Projectors** | 4-5 | Each `state.*` table has a projector. Multi-account from commit one. Polling workers write `events.raw` only | 24-h soak; `slo.v_current_status` all green; no duplicate fixes |
| **D. Serve layer** | 6 | `serve.fn_*` functions. Geocode worker. Matview refresh inside scheduler | Every dashboard endpoint returns valid Pydantic JSON |
| **E. Dashboards** | 7-8 | `fleet-core.js` + three HTML pages | Feature parity for a chosen 30-day test window |
| **F. Routing** | 9-11 | OSM loader, `geo.ways`, map-match projector, `cagg_segment_speed_band`, `fn_route`, `index-routes.html` | <500 ms p95 routing endpoint |
| **G. Cutover** | 12 | Push mirror forwards events to both old and new for 7 days. DNS cut. 48 h hot-standby. Old stack decommissioned | 7 days post-cutover with no rollback |
**Realistic: 12 weeks for two devs** to ship feature parity + routing v1 + the architectural invariants.
---
## 13. Deployment
**Containers:** `db` (TimescaleDB-HA + PostGIS + pgRouting), `pgbouncer`, `platform-gateway` (FastAPI, gateway role), `platform-worker` (FastAPI, worker role — parser + projectors + geocoder + matview refresh + map-match), `platform-cron` (FastAPI, cron role — polls + contract check + OSM loader + SLO measurement + HR sync), `dashboard-proxy` (nginx → rustfs), `grafana`, `db_backup`. Eight containers. The three `platform-*` containers run the same image with different entrypoint commands.
**Image strategy:** CI builds on push to `main`, tags `<registry>/fleet-platform:<sha>` and `:latest`. Coolify deploys by tag. `docker inspect` answers "what's running". Rollback is `coolify deploy :<prev-tag>` and is the same operation for all three roles.
**Migrations:** `dbmate up` runs on the `platform-worker` container start (only) before FastAPI boots. Forward-only. `schema.sql` is `dbmate dump`, committed, PR-reviewed. The other two roles wait on a startup probe that confirms migration completion before they start serving traffic.
**Healthchecks:**
- `db`: `pg_isready`
- `pgbouncer`: `pg_isready -p 6432`
- `platform-gateway`: `GET /health/gateway` (DB conn + last successful HMAC verify)
- `platform-worker`: `GET /health/worker` (DB conn + last parser run age + LISTEN connection alive)
- `platform-cron`: `GET /health/cron` (DB conn + last scheduled-job tick age)
- `dashboard-proxy`: nginx `/healthz`
- `grafana`: `/api/health`
- `db_backup`: touchfile updated by cron
**Secrets:** `.env` in dev, Coolify env vars in prod. `.env.example` lists every key.
**Domains:** `api.rahamafresh.com` (fleet-platform), `live.rahamafresh.com`, `fleetintelligence.rahamafresh.com`, `routes.rahamafresh.com`, `grafana.rahamafresh.com`.
**Backups:** rustfs sidecar (existing pattern). Add weekly `--schema=events` slice + monthly `--schema=geo` slice for fast partial restore.
**Local dev:** `git clone && cp .env.example .env && docker compose up`. Source bind-mounted in dev mode (driven by `APP_MODE=dev`); `uvicorn --reload` picks up edits. Build is for prod.
---
## 14. What we explicitly drop
- **n8n for dashboard contracts.** Service-layer logic is in code.
- **Three independent Python processes with a shared module imported into each.** Replaced by one codebase running in three container roles from the same image. The fate-sharing failure mode goes away; the rebuild-three-images-for-one-change failure mode goes away. Operationally it is three containers, but architecturally it is one service.
- **`reporting` and `tracksolid` schemas as a mental model.** Replaced by `events / state / domain / geo / serve / slo / ops / auth`.
- **`enabled_flag=1` magic.** Replaced by a `state.active_devices` view.
- **Magic-number thresholds scattered through SQL/JS.** Replaced by rows in `slo.targets`.
- **Synchronous Nominatim in the write path.** Replaced by an async projector.
- **Manual branch-to-prod mapping.** Replaced by image tags.
- **Per-feature bolt-on tables.** New domain entities go in `domain` or a new schema with one reason to exist.
- **Public-read dashboards.** Replaced by JWT-required reads from day one.
- **Poll-driven internal stages.** Replaced by `LISTEN/NOTIFY`-driven parse → project; APScheduler retained only for genuinely time-triggered jobs.
---
## 15. Open architectural questions
Decisions the team needs to make before / during execution:
1. **Auth posture.** ~~Dashboards public-read (current) or login-gated?~~ **Closed: login-gated.** All endpoints require JWT from day one. Public-read is not preserved; the dataset has grown into one that cannot be public-read responsibly (see §7 rationale). Three scopes: `read:fleet`, `write:ops`, `admin:fleet`.
2. **Routing v1 scope.** Suggest-route only, or proactive deviation alerts? v1 = suggest. v2 = active. *(Note: routing is scope-deferred from the PRD into a companion project; this question is preserved for that project's reference.)*
3. **SLO targets.** Actual numbers? Freshness < 60 s? 90 s? 120 s? Pick before Phase G.
4. **n8n retention.** Drop entirely, or keep for cross-system orchestration (Slack alerts, CRM bridges)? Default = drop unless a concrete workflow needs it.
5. **Image registry.** `ghcr.io` (free, GitHub) or self-hosted (`registry.rahamafresh.com`)? Affects CI complexity.
6. **Analytics layer.** Out of scope here. If longitudinal reporting becomes a need, design as a separate concern reading from `state.position_history` / `state.trips` — not folded into the operational stack.
7. **Redis ingest buffer — re-evaluate trigger.** Not adopted in v1 (see §6 rationale: weakens immutability invariant, adds failure surface, no current latency problem). Re-evaluate when *any* of the following becomes true: (a) gateway p95 latency exceeds 50 ms sustained for a week against the contracted "HMAC + INSERT + 200 OK" workload; (b) Postgres `INSERT` rate against `events.raw` approaches the chunk-write throughput ceiling on the current VPS class; (c) push receiver concurrency exceeds 200 in-flight requests during a normal hour. Until then, container-role separation provides the fate isolation; `LISTEN/NOTIFY` provides the wake-on-arrival pattern.
---
## 16. Verification — done when
1. Every dashboard URL has a working equivalent with feature parity for a chosen 30-day test window.
2. `slo.v_current_status` shows all SLOs green for 7 consecutive days post-cutover.
3. `events.raw` can be replayed to rebuild `state.*` from scratch within an hour. (Demonstrate by truncating `state.live_positions` and re-projecting.)
4. The contract checker has caught at least one synthetic API change in staging, then run green for 7 days in production.
5. `git log origin/main` is the source of truth for what Coolify runs.
6. The old three-repo stack is archived; the `webhook_receiver` in the old stack receives no traffic.
7. Routing endpoint returns a valid LineString in <500 ms p95.
8. A new dev clones the repo, `docker compose up`, working local stack in <10 minutes.
---