fleet-platform/260522_fleet_platform_architecture_final.md

39 KiB

Fleet Platform — Greenfield Architecture

Date: 2026-05-22 (rev. b — incorporates engineering review of 2026-05-22) Author posture: Senior systems architect Status: Design document. Not a migration plan. Not a port.


2. Architectural principles

These are the load-bearing decisions. Everything below is derived from them.

2.1 The system is an event log with derived state

Every inbound signal — Jimicloud push, polled API response, manual import, ops entry — is appended to an immutable log (events.raw). The raw payload is preserved verbatim with metadata (received_at, source, signature). A separate parser produces typed events in events.parsed. Projectors read events.parsed and update "current state" tables (state.live_positions, state.trips, etc.) deterministically.

This is event sourcing kept lightweight: no Kafka, no Debezium, no separate event store. The log lives in Postgres as a hypertable. Projectors are SQL functions invoked by triggers or scheduled jobs.

What this fixes structurally:

  • Race conditions in write paths — projectors are single-writer per state table.
  • Contract drift — re-parse from events.raw with corrected logic, no upstream re-fetch.
  • Time-travel debugging — "what did Jimi send at 03:47 on Tuesday?" is a SELECT.
  • Replay — rebuild any state table from the log.

2.2 Contracts are typed, in code, and continuously verified

Every endpoint (Jimicloud → us, us → dashboard, us → routing) has a Pydantic model in the codebase, version-pinned per upstream API minor revision. A scheduled contract-check job hits a sandbox account daily and asserts the shape still matches. Drift produces an alert, not a NULL in production.

2.3 The platform is one codebase, one image, three container roles

One FastAPI application, one Git repository, one Docker image. The same image runs in three container roles selected by entrypoint command — platform-gateway (push receivers + dashboard API + routing endpoint + auth), platform-worker (parser + projectors + geocode worker + matview refresher), platform-cron (contract checker + OSM loader + scheduled polls + SLO measurement). Configuration and database connection are shared; process memory is not.

This is the same pattern Sidekiq / Celery / Rails apps use: single codebase, single deployment artefact, multiple runtime roles. It is not microservices — there is no service mesh, no inter-service contracts, no separate repos, no per-service teams. There is one PR review surface and one image to roll back. What it is is fate isolation: a heavy historical query that triggers an OOM in the worker role does not take down the gateway; an OSM loader that pegs CPU in the cron role does not stall webhook receivers. The original "one process" design conflated single-codebase (correct) with single-process (an avoidable failure mode).

Microservices solve organisational scaling problems we don't have. Container-role separation solves the fate-sharing problem we do have. Two developers and one fleet do not need a service mesh; they do need the gateway to keep responding to Jimi pushes while a 90-day report runs.

2.4 Lifecycle state and operational state are orthogonal and explicit

Devices have a lifecycle state machine (provisioned → active → suspended → decommissioned) stored explicitly in domain.devices.lifecycle. Vehicles have an operational state (moving | parked | offline | unknown) derived from the latest fix — never stored, always computed.

2.5 SLOs are first-class, not implicit thresholds

Service-level objectives live in slo.targets, one row per metric, with the current threshold value. Dashboards render against them. Grafana monitors against them. Changing a threshold is one UPDATE, not a scavenger hunt through SQL + JS.

2.6 Deploys are by image tag, not by branch

CI builds an image on push to main, tags it <registry>/fleet-platform:<git-sha> plus a moving :latest. Coolify deploys by tag. "What's deployed?" is docker inspect.

2.7 Dashboards are thin renderers over a typed read model

Each dashboard polls one endpoint that returns a complete, render-ready payload. The endpoint's shape is the contract. All state logic, palette assignment, plate-tail extraction, EAT formatting, KPI counting happens server-side. The HTML/JS is a render layer.

2.8 Projection is event-driven, not poll-driven

Parsers and projectors do not run on fixed schedules. The gateway, after inserting into events.raw, issues a Postgres NOTIFY events_raw_new in the same transaction. The worker role holds long-lived LISTEN connections and wakes immediately, parses the new row, writes events.parsed, issues NOTIFY events_parsed_new, and the relevant projector wakes and updates state.*. The chain is parse → project in tens of milliseconds, not stacked polling windows.

APScheduler is retained — but only for things that are genuinely time-triggered: the daily contract checker, the monthly OSM loader, the per-minute SLO measurement, the 60s polled-ingest sweep. Anything that should fire on data arrival fires on data arrival.

This structurally avoids the "parser every 10s + projector every 10s = up-to-20s of internal lag before the SLO timer even starts" problem. The freshness budget is spent on Jimi's transport, not on our scheduler.

2.9 External enrichment is asynchronous and replayable

Nominatim, Mapbox, weather, traffic — all run as separate projectors reading events.parsed, writing to side tables. Non-critical. Can be down for an hour without affecting the platform. Can be re-run without re-fetching upstream.


3. The system, at a glance

                              Browser
                                 │
              ┌──────────────────┴──────────────────┐
              │   live      history      routes      │   3 HTML pages
              │                                      │   importing fleet-core.js
              └──────────────────┬──────────────────┘
                                 │ HTTPS · JWT (mandatory)
                          api.rahamafresh.com
                                 │
                  ┌──────────────┴──────────────┐
                  │  platform-gateway           │  ← one image,
                  │  (FastAPI)                  │     three roles
                  │   /push/*                   │
                  │   /api/views/*              │
                  │   /api/routes               │
                  │   /api/auth/token           │
                  │                             │
                  │   gateway contract:         │
                  │   HMAC-verify + INSERT      │
                  │   events.raw + NOTIFY       │
                  │   + 200 OK. Nothing else.   │
                  └──────────────┬──────────────┘
                                 │
                  ┌──────────────┴──────────────┐
                  │  platform-worker            │  ← same image,
                  │  (FastAPI workers)          │     worker role
                  │   LISTEN events_raw_new     │
                  │    → parser → events.parsed │
                  │    → projectors → state.*   │
                  │   geocode_worker            │
                  │   matview_refresh           │
                  │   map_match (geo)           │
                  └──────────────┬──────────────┘
                                 │
                  ┌──────────────┴──────────────┐
                  │  platform-cron              │  ← same image,
                  │  (APScheduler)              │     cron role
                  │   polls: live/trips/        │
                  │    parking/track/devices/   │
                  │    stale (60s / 10m)        │
                  │   contract_check (daily)    │
                  │   osm_loader (monthly)      │
                  │   slo_measurement (1m)      │
                  │   hr_sync (3h)              │
                  └──────────────┬──────────────┘
                                 │
                          pgbouncer (txn mode)
                                 │
                  ┌──────────────┴──────────────┐
                  │  TimescaleDB-HA             │
                  │  + PostGIS + pgRouting      │
                  │                             │
                  │  events / state / domain    │
                  │  geo / serve / slo / auth   │
                  └──────────────┬──────────────┘
                                 │
                       ┌─────────┴─────────┐
                       │ db_backup → rustfs│
                       │ grafana           │
                       └───────────────────┘

One codebase. One image. Three container roles. One database. One repo. One branch. One image tag is what production is running.

The three roles do not share a Python process. They share configuration, database connection string, and Pydantic models. The gateway can keep returning 200 OK to Jimi while the worker is busy parsing a backlog, and the cron role can run a contract check that consumes CPU without affecting either. A failure in one role does not take the others down.


4. Tech stack

Layer Choice Why this
Language Python 3.12 Team already knows it. No second language to operate.
Web framework FastAPI Typed (Pydantic), async-native, OpenAPI for free, fast enough.
In-process scheduler APScheduler Sturdier successor to the schedule library; cron syntax + persistence.
Database PostgreSQL 16 + TimescaleDB 2.15 Hypertables and continuous aggregates are real value.
GIS PostGIS 3 Required. Non-negotiable.
Routing pgRouting In-database A*. One connection string. Upgrade path to OSRM exists.
Connection pool pgbouncer (transaction mode) Works.
Migrations dbmate Single static binary, forward-only SQL, diff-able schema.sql.
Auth JWT (FastAPI native) + bcrypt Short-lived tokens for dashboards. HMAC for webhooks.
Rate limiting slowapi FastAPI-native.
Frontend Vanilla JS, ES modules, no bundler Three HTML pages, one shared fleet-core.js.
Map library MapLibre GL JS 4.x Already in current dashboards. Free for OSM/Carto basemaps.
Basemap Carto Voyager (default), Mapbox dark-v11 (scaffolded) No mandatory token.
Reverse geocoding Nominatim primary, Mapbox fallback Asynchronous, queued, never blocks ingest.
Edge / proxy Coolify-managed Traefik → nginx:alpine → rustfs / fleet-platform Continues current pattern. We know it works.
Object store rustfs Already deployed. Hosts static dashboards + DB backups.
Observability Grafana + Postgres views + container logs (structlog JSON) One Grafana instance reads slo.* directly. No separate metrics stack.
CI/CD GitHub Actions → image registry → Coolify image-tag deploy Build on push to main, tag with git SHA.
OSM data Geofabrik Kenya + Uganda extracts, monthly refresh Reproducible, atomic swap via ALTER SCHEMA RENAME.
Secrets .env (dev) + Coolify env vars (prod) Nothing in git. .env.example documents required keys.

Explicitly not in the stack: message queue (Redis/RabbitMQ/SQS), separate event store (Kafka/Debezium), n8n for service-layer work, frontend framework (React/Vue/Svelte), ORM (psycopg + SQL).


5. Schema — layered by purpose, not by feature

events                — immutable log; the source of truth
  events.raw                  — hypertable, append-only, partition by received_at
  events.parsed               — hypertable, derived from events.raw by parser

state                 — derived projections (rebuildable from events)
  state.devices               — current device metadata
  state.live_positions        — one row per device, latest fix
  state.position_history      — hypertable, full timeline
  state.trips                 — closed trips
  state.parking_events
  state.alarms                — hypertable
  state.obd_readings          — hypertable
  state.heartbeats            — hypertable
  state.device_events         — hypertable (login/logout)
  state.fuel_readings         — hypertable
  state.temperature_readings  — hypertable
  state.lbs_readings          — hypertable
  state.geocoded_positions    — async enrichment side table

domain                — business entities + lifecycle
  domain.accounts             — Jimi sub-accounts (FK target)
  domain.vehicles             — vehicle identity (plate, model, depot)
  domain.devices              — device <-> vehicle mapping with lifecycle
  domain.drivers
  domain.cost_centres
  domain.assigned_cities

geo                   — PostGIS + pgRouting + OSM
  geo.ways                          — pgRouting topology
  geo.ways_vertices_pgr
  geo.segment_observations          — hypertable, map-matched fixes
  geo.cagg_segment_speed_band       — CAGG (segment_id, dow, hod) → avg speed
  geo.pois                          — Fireside HQ + future depots

slo                   — explicit service level objectives
  slo.targets                       — one row per SLO with current threshold
  slo.measurements                  — hypertable, computed every minute
  slo.v_current_status              — view: is each SLO currently met?

serve                 — dashboard + API contracts (SQL functions)
  serve.normalize_plate(text)
  serve.fn_live_view(filters jsonb)
  serve.fn_history_view(filters jsonb)
  serve.fn_route(origin, dest, depart_at)
  serve.fn_dispatch_view()
  serve.v_fleet_overview

ops                   — dispatch / tickets / CRM-ish
  ops.tickets
  ops.dispatch_log
  ops.cost_rates
  ops.service_log
  ops.odometer_readings

auth                  — service auth
  auth.accounts
  auth.tokens

Read this schema top-down: events are immutable truth, state is derived, domain owns business identity, geo owns geometry, slo owns commitments, serve owns the dashboard contract, ops owns workflow, auth owns access. Each schema has one reason to exist.

Where the round-6 dedup logic lives: in serve.fn_live_view and serve.v_live_dedup. One function, one place. Changing the rule changes one place.

Where SLO thresholds live: slo.targets. Not scattered constants.


6. The ingest pipeline

Jimicloud push  ─────┐
Jimicloud poll  ─────┼─→  events.raw  ─NOTIFY→  parser  ─→  events.parsed  ─NOTIFY→  projectors  ─→  state.*
CSV import      ─────┤                                                                        ↘
Ops action      ─────┘                                                                          geocode_worker  ─→  state.geocoded_positions

The gateway contract is minimal and inviolable. The platform-gateway role does, per request, exactly: (1) HMAC signature verification, (2) one INSERT into events.raw, (3) one NOTIFY events_raw_new, (4) return 200 OK. No Pydantic parsing, no PostGIS calculation, no geocoding, no projector work. All of that happens in the worker role, downstream. p95 < 100 ms is achievable with ~10x headroom under this contract; the Jimi-side timeout is never the constraint.

This is what makes Postgres-as-queue safe at this scale: the gateway's per-request work is constant-time and CPU-trivial. The event-loop-starvation risk that a co-tenanted parser/projector would create is eliminated by container-role separation (§2.3), not by introducing a separate buffer.

events.raw is the contract. Every gateway endpoint writes here first, verbatim. No parsing at the gateway. If parsing is wrong, we re-parse later. The raw row is the immutable record that "we received this" — no in-memory buffer sits between Jimi and durable storage.

The parser is versioned. events.raw.parser_version records which parser handled each row. If alertTypeId becomes alertId, we bump the parser, re-parse affected rows. events.raw is untouched.

Projectors are single-writer per state table. One projector owns state.live_positions. It reads events.parsed of kind position_fix, applies dedup, writes the upsert. No other code writes there. Race conditions between the 60 s sweep, alarm cross-feed, and stale rescue cease to exist because they all produce events the projector orders monotonically.

Parser and projectors wake on NOTIFY, not on a timer. Workers hold long-lived LISTEN events_raw_new / LISTEN events_parsed_new connections. New raw row → parser wakes within milliseconds → writes parsed row → projector wakes within milliseconds → updates state. There is no stacked polling delay between stages. A timer-based fallback (every 5 s) catches the rare case of a missed NOTIFY (e.g. connection blip); under normal operation the timer never fires because the listener already drained the row. For workers competing for the same queue, draining uses SELECT … FOR UPDATE SKIP LOCKED so multiple worker instances can scale out without double-processing.

The contract checker runs daily. Calls each Jimi endpoint against a sandbox account, validates response against the current Pydantic model, alerts on drift.

Polling workers go through events.raw. The cron role's poll_live_positions calls jimi.user.device.location.list, persists each device's payload to events.raw, ACKs. The parser and projector handle the rest. Polls and pushes are indistinguishable downstream of events.raw.

Backfill / replay is trivial. "Re-run the projector for trips between 2026-01-01 and 2026-03-01" is a SQL statement.

On not using Redis as an ingest buffer. A common reflex at this point is to put Redis between the gateway and Postgres to absorb spikes. We are not doing that, for the following reasons:

  • The gateway's per-request work under the contract above is single-digit milliseconds against a healthy Postgres. The 100 ms Jimi timeout has ~10x headroom. There is no current latency problem to solve.
  • Redis would create a window in which an event has been acknowledged to Jimi but is not yet in events.raw. That weakens the "every signal is captured immutably before it is interpreted" invariant from §2.1: the durable record would become "Redis OR Postgres," not "Postgres". On a Redis crash without appendfsync always, in-flight events are lost with no trace.
  • Replay semantics become more complicated: events.raw ceases to be a complete record of inbound traffic until after Redis has drained. Reasoning about "did we receive that?" requires checking two systems.
  • Operational surface grows: one more container to monitor, back up, tune memory on, and reason about during incidents.

If the gateway's own workload — not its co-tenants' — ever exceeds what a synchronous Postgres INSERT can handle (a roughly 10x-from-today problem, per §10.3 P1 in the PRD), we revisit. Tracked as open question Q7 in §15.


7. The serving layer

Three dashboard endpoints, one shape:

GET  /api/views/live?filters=...      → {summary, geojson, slo_status}
GET  /api/views/history?filters=...   → {summary, geojson, slo_status}
POST /api/routes                       → {route_geojson, eta_sec, distance_m, observed_basis}

Each endpoint maps 1:1 to a SQL function in serve.fn_*. The Pydantic response model is the contract.

Filters are a single jsonb parameter (cost_centre, assigned_city, vehicle_numbers[], date_range). Adding a filter is one SQL change, no API signature break.

Auth: JWT mandatory on every endpoint, read and write — including dashboard reads, which are no longer public. The legacy public-read posture is not preserved (see §15 Q1, closed). /api/auth/token issues short-lived JWTs from auth.accounts credentials. Dashboards request a token on load, cache locally, refresh before expiry. Scopes: read:fleet (all dashboard reads), write:ops (driver assignments, alarm acks, service log entries, admin actions on operational records), admin:fleet (lifecycle transitions, device provisioning, audit access). Push endpoints use HMAC shared secret per source (Jimi push, WhatsApp fuel microservice, HR extract).

Rationale for closing Q1 in favor of authenticated: by Phase 4 the platform carries driver shift start-locations (home-area information), driver names and phone numbers from the HR extract, customer-site visit patterns through trip endpoints, and plate-to-cost-centre mappings that reveal commercial relationships. This is no longer the same security posture as "anonymous vehicle dots on a public map" — even ignoring the original case for not exposing live fleet positions to competitors, the dataset has grown into one that cannot be public-read responsibly.

Rate limits: dashboards 60 req/min/IP, routes 10 req/min/IP, push 1000 req/min.


8. Frontend — three pages, one renderer

fleet-core.js          ES module, ~600 lines, MapLibre as only dependency
  initMap(elementId, opts)
  renderView(payload)              ← the universal renderer
  initFilters(state, onChange)
  poiLayer(map, pois)
  costCentrePalette(name) → colour
  normalisePlate(s)
  apiFetch(path, params)
  clockEAT(elementId)
  authClient (token cache + refresh)

index-live.html        ~100 lines, polls /api/views/live every 15 s
index-history.html     ~100 lines, polls /api/views/history on form submit
index-routes.html      ~100 lines, click-and-route, calls /api/routes

The renderer is dumb: take a {summary, geojson, slo_status} payload, populate KPIs, replace the GeoJSON source, render SLO badges. It doesn't know what "OFFLINE" means; the server attaches the right style_class to each feature. The renderer paints.

This structurally fixes the "1,400-line dashboard with embedded business logic" problem.


9. Routing layer (geo)

OSM Kenya + Uganda extracts loaded by osm_loader (monthly cron) into a staging schema. osm2pgsql produces geo.ways + geo.ways_vertices_pgr. Staging → live is ALTER SCHEMA RENAME.

Edge weights are hybrid:

  1. map_match projector reads events.parsed of kind position_fix, finds nearest geo.ways segment within 30 m, writes (segment_id, observed_at, speed_kmh) to geo.segment_observations.
  2. A CAGG rolls it into geo.cagg_segment_speed_band keyed by (segment_id, dow, hod).
  3. serve.fn_route calls pgr_aStar with a cost function that reads the CAGG for the departure hour-of-day, falls back to OSM maxspeed, finally to a global average.

Response includes observed_basis — which segments came from observed data, which from tags, which from fallback. Dispatch can see "this route is 80% observed-data based". Trust calibration belongs in the response.

Active re-routing is out of scope for v1. Endpoint is "suggest a route at this departure time".


10. SLOs and observability

slo.targets:
  metric                       | threshold | window
  ─────────────────────────────┼───────────┼────────
  fix_freshness_pct_60s        |  95       | 5 min
  trip_lag_p95_sec             | 600       | 1 h
  route_p95_ms                 | 500       | 5 min
  parser_lag_p95_sec           |  30       | 5 min
  contract_drift_days          |   1       | 1 d

slo.measurements is a hypertable populated every minute by a worker. slo.v_current_status exposes the live state. Grafana dashboards and alerts read directly from slo.*.

Dashboards display SLO-aware status: "Fleet below freshness SLO: 3 vehicles" instead of "3 vehicles OFFLINE 24h+".

Logs ship to container stdout (Coolify aggregates) as structured JSON (structlog).


11. Expected benefits over current architecture and functionality

This is the section that justifies the rebuild. Each row is a concrete, measurable improvement.

11.1 Reliability and correctness

Pain in current system What changes Why
~10 [FIX-MNN] hot-patches per year for write-path races, contract drift, dedup logic Each category is structurally impossible Single-writer projectors + versioned parser + one-place dedup rule
Silent data loss when Jimi renames a field (weeks to detect) Drift caught within 24 h Daily contract checker against sandbox API
STALE_GPS_MS=10min / OFFLINE=24h / freshness magic scattered across 12+ places One row per threshold in slo.targets SLO-first design
"OFFLINE" mixes broken device, parked vehicle, expired subscription, decommissioned Lifecycle and operational state are separate domain.devices.lifecycle + computed operational state
Production runs from a non-main branch; cherry-picks needed One branch. Image-tag deploys CI builds tagged image. Coolify pulls by tag
Nominatim slowdown stalls trip ingest Geocoding is its own worker, never blocks Async projector pattern
No way to replay a dropped event Every event is in events.raw; re-parse anytime Event sourcing
Heavy historical query / OOM in one component can crash everything (single-process fate sharing) A heavy query in the worker role does not affect the gateway; a runaway cron does not affect either Container-role separation (same image, gateway / worker / cron)
Internal stage-to-stage lag (parser poll + projector poll = up to 20s before SLO timer starts) Parser and projector wake on NOTIFY within milliseconds of the upstream write Event-driven projection chain
Live dashboard publicly readable (legacy posture) All endpoints require JWT; scope-gated reads Mandatory auth from day one

11.2 Operations

Pain in current system What changes Why
Three Docker images, three rebuild cycles per shared-helper change One image, one rebuild — same image runs in three container roles Single FastAPI codebase consolidates ingest + API + workers; runtime roles separated by entrypoint
~7 containers (webhook_receiver, ingest_movement, ingest_events, timescale, grafana, pgbouncer, db_backup) 8 containers: db, pgbouncer, platform-gateway, platform-worker, platform-cron, dashboard-proxy, grafana, db_backup Three runtime roles from one image replace three independent Python services; fate isolation gained, build complexity unchanged
Coolify per-service redeploy required for shared-helper changes One redeploy ships everything One image
"What's deployed?" requires git log + Coolify UI + container exec docker inspect returns image digest = git SHA Image-tag deploys
Rollback is git revert + rebuild + redeploy (~5 minutes) Rollback is coolify deploy :<prev-tag> (~30 seconds) Pre-built images in registry
Schema migrations are bespoke Python with no formal "down" path dbmate handles up/down, generates schema.sql snapshot for PR review Standard tool
n8n workflow JSON holds dashboard contracts; not code-reviewable Pydantic models in code; OpenAPI generated; PRs review contracts Contracts in code
Grafana queries assemble metrics ad-hoc slo.* schema is the metrics layer; Grafana is a thin renderer Pre-aggregated SLO measurements

11.3 Development velocity

Pain in current system What changes Why
Adding a third dashboard means re-implementing palette + POI + EAT clock + map setup fleet-core.js exports those primitives; new dashboard is ~100 lines of HTML Shared renderer
Dedup logic change (round 6) required modifying SQL CTE + JS vehicleState() Dedup change is one SQL function Server-side state computation
Trip enrichment (FIX-M20) required modifying poll_trips + adding migration + adjusting webhook handler Enrichment is a new projector reading existing events Decoupled enrichment
Source code is image-baked in dev; typo fix requires rebuild Bind-mounted in dev; baked in prod (via APP_MODE build arg) Dev/prod parity without dev pain
Test suite must use mock DB because shared module assumes pool exists at import time Tests use real Postgres (docker-compose), shared module is lazy-init Cleaner module boundaries
Adding multi-account support required retrofitting TARGETS env var across all polling code Multi-account is a NOT NULL FK from commit one Designed-in, not bolted-on

11.4 New capabilities not present today

Capability What it enables
Event replay "Re-build the last 90 days of trips with corrected enrichment logic" is a SQL statement
Time-travel debug "What payload did Jimi send for IMEI X at 03:47 on Tuesday?" is SELECT FROM events.raw WHERE imei = ...
Routing (A* with time-banded weights) Dispatch can suggest a route at a given departure time, with observed_basis showing trust level per segment
SLO-driven alerting Grafana alert when fix-freshness falls below 95% during business hours; dashboards render SLO state, not arbitrary thresholds
Lifecycle state machine Decommissioned devices don't appear in operational dashboards. Suspended devices show a distinct visual state. No more "OFFLINE 24h+" sweep including retired vehicles
Versioned parsers When Jimi changes a field name, bump parser version, backfill events.parsed, no data lost
Daily contract check Upstream API drift caught next day, not next quarter
observed_basis in route responses Operator trust calibration: "this ETA is based on 80% observed data" vs "mostly OSM tags"
One-place dedup rule Future fleet expansion (e.g. adding a third device class) is a one-line change to serve.fn_live_view

11.5 Quantified expectations (best-effort estimates)

These are forecasts based on the architectural changes, not measured. Re-baseline after Phase G.

  • Mean time to detect API contract drift: ~90 days today → <24 hours (contract checker).
  • Mean time to detect data freshness regression: unbounded today → 5 minutes (SLO alerting).
  • Rollback time: ~5 minutes today (rebuild + redeploy) → ~30 seconds (image-tag swap).
  • Add a new dashboard: ~2 weeks today (re-implement scaffolding) → ~2 days (~100 lines of HTML against existing renderer).
  • Add a new ingest source (e.g. a 4th sub-account or a new push type): ~3-5 days today (touch 3 Python files + migration + n8n) → ~half a day (new gateway endpoint + new parser + new projector — each a single file).
  • Reproduce a production data issue locally: Hours-days today (re-fetch from Jimi, hope it returns same data) → Minutes (pg_dump events.raw, restore, replay).
  • Cold-start a new dev: ~1 day today (figure out which container does what) → <10 minutes (git clone && docker compose up).

11.6 What does NOT improve

Honest pushback — not everything gets better.

  • Raw write throughput: identical. Both systems are nowhere near Postgres limits at ~80 vehicles.
  • Map rendering performance: identical. MapLibre is the same library.
  • Geocoding latency: identical. Nominatim is still rate-limited.
  • Jimi API rate limits: unchanged. Their problem, not ours.
  • Operator UI learning curve: dashboards look familiar but the SLO terminology is new; expect a brief training window.

12. Phased rollout

The architectural shift is large; the rollout is incremental. Each phase produces a verifiable artefact.

Phase Weeks Deliverable DoD
A. Foundation 1-2 Repo, docker-compose, all schemas, dbmate, FastAPI /health. CI builds image, Coolify deploys :latest curl /health returns DB connectivity from a tagged image
B. Event log + parser 3 /push/* endpoints write events.raw. Parser worker drains to events.parsed. Pydantic models versioned. Contract checker scheduled Replayed historical Jimi push lands in both raw and parsed
C. Projectors 4-5 Each state.* table has a projector. Multi-account from commit one. Polling workers write events.raw only 24-h soak; slo.v_current_status all green; no duplicate fixes
D. Serve layer 6 serve.fn_* functions. Geocode worker. Matview refresh inside scheduler Every dashboard endpoint returns valid Pydantic JSON
E. Dashboards 7-8 fleet-core.js + three HTML pages Feature parity for a chosen 30-day test window
F. Routing 9-11 OSM loader, geo.ways, map-match projector, cagg_segment_speed_band, fn_route, index-routes.html <500 ms p95 routing endpoint
G. Cutover 12 Push mirror forwards events to both old and new for 7 days. DNS cut. 48 h hot-standby. Old stack decommissioned 7 days post-cutover with no rollback

Realistic: 12 weeks for two devs to ship feature parity + routing v1 + the architectural invariants.


13. Deployment

Containers: db (TimescaleDB-HA + PostGIS + pgRouting), pgbouncer, platform-gateway (FastAPI, gateway role), platform-worker (FastAPI, worker role — parser + projectors + geocoder + matview refresh + map-match), platform-cron (FastAPI, cron role — polls + contract check + OSM loader + SLO measurement + HR sync), dashboard-proxy (nginx → rustfs), grafana, db_backup. Eight containers. The three platform-* containers run the same image with different entrypoint commands.

Image strategy: CI builds on push to main, tags <registry>/fleet-platform:<sha> and :latest. Coolify deploys by tag. docker inspect answers "what's running". Rollback is coolify deploy :<prev-tag> and is the same operation for all three roles.

Migrations: dbmate up runs on the platform-worker container start (only) before FastAPI boots. Forward-only. schema.sql is dbmate dump, committed, PR-reviewed. The other two roles wait on a startup probe that confirms migration completion before they start serving traffic.

Healthchecks:

  • db: pg_isready
  • pgbouncer: pg_isready -p 6432
  • platform-gateway: GET /health/gateway (DB conn + last successful HMAC verify)
  • platform-worker: GET /health/worker (DB conn + last parser run age + LISTEN connection alive)
  • platform-cron: GET /health/cron (DB conn + last scheduled-job tick age)
  • dashboard-proxy: nginx /healthz
  • grafana: /api/health
  • db_backup: touchfile updated by cron

Secrets: .env in dev, Coolify env vars in prod. .env.example lists every key.

Domains: api.rahamafresh.com (fleet-platform), live.rahamafresh.com, fleetintelligence.rahamafresh.com, routes.rahamafresh.com, grafana.rahamafresh.com.

Backups: rustfs sidecar (existing pattern). Add weekly --schema=events slice + monthly --schema=geo slice for fast partial restore.

Local dev: git clone && cp .env.example .env && docker compose up. Source bind-mounted in dev mode (driven by APP_MODE=dev); uvicorn --reload picks up edits. Build is for prod.


14. What we explicitly drop

  • n8n for dashboard contracts. Service-layer logic is in code.
  • Three independent Python processes with a shared module imported into each. Replaced by one codebase running in three container roles from the same image. The fate-sharing failure mode goes away; the rebuild-three-images-for-one-change failure mode goes away. Operationally it is three containers, but architecturally it is one service.
  • reporting and tracksolid schemas as a mental model. Replaced by events / state / domain / geo / serve / slo / ops / auth.
  • enabled_flag=1 magic. Replaced by a state.active_devices view.
  • Magic-number thresholds scattered through SQL/JS. Replaced by rows in slo.targets.
  • Synchronous Nominatim in the write path. Replaced by an async projector.
  • Manual branch-to-prod mapping. Replaced by image tags.
  • Per-feature bolt-on tables. New domain entities go in domain or a new schema with one reason to exist.
  • Public-read dashboards. Replaced by JWT-required reads from day one.
  • Poll-driven internal stages. Replaced by LISTEN/NOTIFY-driven parse → project; APScheduler retained only for genuinely time-triggered jobs.

15. Open architectural questions

Decisions the team needs to make before / during execution:

  1. Auth posture. Dashboards public-read (current) or login-gated? Closed: login-gated. All endpoints require JWT from day one. Public-read is not preserved; the dataset has grown into one that cannot be public-read responsibly (see §7 rationale). Three scopes: read:fleet, write:ops, admin:fleet.
  2. Routing v1 scope. Suggest-route only, or proactive deviation alerts? v1 = suggest. v2 = active. (Note: routing is scope-deferred from the PRD into a companion project; this question is preserved for that project's reference.)
  3. SLO targets. Actual numbers? Freshness < 60 s? 90 s? 120 s? Pick before Phase G.
  4. n8n retention. Drop entirely, or keep for cross-system orchestration (Slack alerts, CRM bridges)? Default = drop unless a concrete workflow needs it.
  5. Image registry. ghcr.io (free, GitHub) or self-hosted (registry.rahamafresh.com)? Affects CI complexity.
  6. Analytics layer. Out of scope here. If longitudinal reporting becomes a need, design as a separate concern reading from state.position_history / state.trips — not folded into the operational stack.
  7. Redis ingest buffer — re-evaluate trigger. Not adopted in v1 (see §6 rationale: weakens immutability invariant, adds failure surface, no current latency problem). Re-evaluate when any of the following becomes true: (a) gateway p95 latency exceeds 50 ms sustained for a week against the contracted "HMAC + INSERT + 200 OK" workload; (b) Postgres INSERT rate against events.raw approaches the chunk-write throughput ceiling on the current VPS class; (c) push receiver concurrency exceeds 200 in-flight requests during a normal hour. Until then, container-role separation provides the fate isolation; LISTEN/NOTIFY provides the wake-on-arrival pattern.

16. Verification — done when

  1. Every dashboard URL has a working equivalent with feature parity for a chosen 30-day test window.
  2. slo.v_current_status shows all SLOs green for 7 consecutive days post-cutover.
  3. events.raw can be replayed to rebuild state.* from scratch within an hour. (Demonstrate by truncating state.live_positions and re-projecting.)
  4. The contract checker has caught at least one synthetic API change in staging, then run green for 7 days in production.
  5. git log origin/main is the source of truth for what Coolify runs.
  6. The old three-repo stack is archived; the webhook_receiver in the old stack receives no traffic.
  7. Routing endpoint returns a valid LineString in <500 ms p95.
  8. A new dev clones the repo, docker compose up, working local stack in <10 minutes.