From 004fed7ab9e63a49bafbb05d25664669c24545a7 Mon Sep 17 00:00:00 2001 From: David Kiania Date: Wed, 8 Apr 2026 17:59:05 +0300 Subject: [PATCH] Add operations manual with verification queries per service Comprehensive guide covering: - Service architecture and scheduled tasks - Per-service verification SQL queries grouped by service - Health dashboard queries for monitoring - Polling vs push coexistence and dedup strategy - Environment variables, data retention, troubleshooting Co-Authored-By: Claude Opus 4.6 --- OPERATIONS_MANUAL.md | 593 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 593 insertions(+) create mode 100644 OPERATIONS_MANUAL.md diff --git a/OPERATIONS_MANUAL.md b/OPERATIONS_MANUAL.md new file mode 100644 index 0000000..2d0d890 --- /dev/null +++ b/OPERATIONS_MANUAL.md @@ -0,0 +1,593 @@ +# Fireside Communications — Tracksolid Pro Telemetry Stack +## Operations Manual & Verification Guide + +--- + +## 1. Service Architecture + +``` +JIMI TRACKSOLID PRO API + | + +-- POLLING (Pull) +-- PUSH (Webhook) + | (Fallback / Catch-up) | (Real-time) + | | + ingest_movement ingest_events webhook_receiver + (60s/15m/daily) (5m polling) (FastAPI :8000) + | | | + +------------------+----------------+ + | + timescale_db + (PG16 + TimescaleDB + PostGIS) + | + grafana + (Dashboards :3000) +``` + +| Service | Purpose | Restart Policy | +|---------|---------|----------------| +| `timescale_db` | PostgreSQL 16 + TimescaleDB 2.15 + PostGIS 3 | always | +| `ingest_movement` | GPS positions, trips, parking, device sync (polling) | always | +| `ingest_events` | Alarm event polling (catch-up/fallback) | always | +| `webhook_receiver` | Real-time push data from Jimi (OBD, faults, GPS, alarms, heartbeats, trips) | always | +| `grafana` | Visualization dashboards (read-only DB access) | always | + +--- + +## 2. Verification Tests — Grouped by Service + +### 2.1 timescale_db + +**Check database is healthy:** +```sql +SELECT PostGIS_Full_Version(); +SELECT * FROM timescaledb_information.hypertables; +``` + +**Verify all critical tables exist:** +```sql +SELECT table_schema, table_name +FROM information_schema.tables +WHERE table_schema IN ('tracksolid', 'dwh_gold') +ORDER BY table_schema, table_name; +``` + +Expected tables: +- `tracksolid.devices` +- `tracksolid.api_token_cache` +- `tracksolid.ingestion_log` +- `tracksolid.live_positions` +- `tracksolid.position_history` (hypertable) +- `tracksolid.trips` +- `tracksolid.parking_events` +- `tracksolid.alarms` +- `tracksolid.obd_readings` +- `tracksolid.fault_codes` +- `tracksolid.heartbeats` (hypertable) +- `dwh_gold.dim_vehicles` +- `dwh_gold.fact_daily_fleet_metrics` + +**Verify hypertables are configured:** +```sql +SELECT hypertable_schema, hypertable_name, compression_enabled +FROM timescaledb_information.hypertables; +``` + +Expected: +| hypertable_schema | hypertable_name | compression_enabled | +|---|---|---| +| tracksolid | position_history | true | +| tracksolid | heartbeats | false | + +**Verify retention policies:** +```sql +SELECT application_name, schedule_interval, config +FROM timescaledb_information.jobs +WHERE application_name LIKE '%retention%' OR application_name LIKE '%compression%'; +``` + +--- + +### 2.2 ingest_movement + +This service runs four scheduled tasks. Verify each one is producing data. + +#### 2.2.1 Device Registry Sync (Daily @ 02:00 UTC) + +**API endpoint:** `jimi.user.device.list` + `jimi.track.device.detail` +**Table:** `tracksolid.devices` +**Schedule:** Daily at 02:00 UTC + on startup + +```sql +-- Check devices are registered +SELECT COUNT(*) AS total_devices, + COUNT(*) FILTER (WHERE enabled_flag = 1) AS enabled, + MIN(last_synced_at) AS oldest_sync, + MAX(last_synced_at) AS latest_sync +FROM tracksolid.devices; +``` + +```sql +-- Sample device records +SELECT imei, device_name, vehicle_number, driver_name, enabled_flag, last_synced_at +FROM tracksolid.devices +ORDER BY last_synced_at DESC +LIMIT 10; +``` + +**Healthy indicator:** `latest_sync` within the last 24 hours, `total_devices > 0`. + +#### 2.2.2 Live Positions (Every 60 seconds) + +**API endpoint:** `jimi.user.device.location.list` +**Tables:** `tracksolid.live_positions`, `tracksolid.position_history` +**Schedule:** Every 60 seconds + +```sql +-- Check live positions are being updated +SELECT COUNT(*) AS tracked_devices, + MIN(updated_at) AS oldest_update, + MAX(updated_at) AS latest_update, + ROUND(AVG(EXTRACT(EPOCH FROM (NOW() - updated_at)))) AS avg_age_seconds +FROM tracksolid.live_positions; +``` + +```sql +-- Fleet status overview +SELECT connectivity_status, COUNT(*) AS device_count +FROM tracksolid.v_fleet_status +GROUP BY connectivity_status; +``` + +```sql +-- Position history volume (last 24h) +SELECT COUNT(*) AS records_24h, + COUNT(DISTINCT imei) AS active_devices, + MIN(gps_time) AS earliest, + MAX(gps_time) AS latest +FROM tracksolid.position_history +WHERE gps_time > NOW() - INTERVAL '24 hours'; +``` + +**Healthy indicator:** `latest_update` within last 2 minutes, `avg_age_seconds < 120`. + +#### 2.2.3 Trip Reports (Every 15 minutes) + +**API endpoint:** `jimi.device.track.mileage` +**Table:** `tracksolid.trips` +**Schedule:** Every 15 minutes (1-hour lookback) + +```sql +-- Trip data freshness +SELECT COUNT(*) AS trips_24h, + COUNT(DISTINCT imei) AS vehicles_with_trips, + ROUND(SUM(distance_m) / 1000, 1) AS total_km, + ROUND(AVG(avg_speed_kmh), 1) AS fleet_avg_speed, + MAX(updated_at) AS latest_trip_update +FROM tracksolid.trips +WHERE start_time > NOW() - INTERVAL '24 hours'; +``` + +```sql +-- Trips with driving time (runTimeSecond captured) +SELECT imei, start_time, end_time, + ROUND(distance_m / 1000, 1) AS km, + avg_speed_kmh, max_speed_kmh, + driving_time_s, source +FROM tracksolid.trips +WHERE start_time > NOW() - INTERVAL '24 hours' +ORDER BY start_time DESC +LIMIT 10; +``` + +**Healthy indicator:** `trips_24h > 0` during business hours, `latest_trip_update` within last 20 minutes. + +#### 2.2.4 Parking Events (Every 15 minutes) + +**API endpoint:** `jimi.open.platform.report.parking` +**Table:** `tracksolid.parking_events` +**Schedule:** Every 15 minutes (1-hour lookback) + +```sql +-- Parking data freshness +SELECT COUNT(*) AS events_24h, + COUNT(DISTINCT imei) AS vehicles_parked, + ROUND(AVG(duration_seconds) / 60, 1) AS avg_park_minutes, + MAX(start_time) AS latest_event +FROM tracksolid.parking_events +WHERE start_time > NOW() - INTERVAL '24 hours'; +``` + +**Healthy indicator:** `events_24h > 0` if fleet is active. + +#### 2.2.5 Ingestion Log (All ingest_movement tasks) + +```sql +-- Movement pipeline health +SELECT endpoint, run_at, success, rows_upserted, rows_inserted, duration_ms, error_message +FROM tracksolid.ingestion_log +WHERE endpoint IN ( + 'jimi.user.device.list+detail', + 'jimi.user.device.location.list', + 'jimi.device.track.mileage', + 'jimi.open.platform.report.parking' +) +ORDER BY run_at DESC +LIMIT 20; +``` + +--- + +### 2.3 ingest_events + +#### 2.3.1 Alarm Polling (Every 5 minutes) + +**API endpoint:** `jimi.device.alarm.list` +**Table:** `tracksolid.alarms` +**Schedule:** Every 5 minutes (30-minute lookback) + +```sql +-- Alarm data freshness +SELECT COUNT(*) AS alarms_24h, + COUNT(DISTINCT imei) AS devices_with_alarms, + COUNT(DISTINCT alarm_type) AS alarm_types, + MAX(alarm_time) AS latest_alarm +FROM tracksolid.alarms +WHERE alarm_time > NOW() - INTERVAL '24 hours'; +``` + +```sql +-- Alarm breakdown by type +SELECT alarm_type, source, COUNT(*) AS count +FROM tracksolid.alarms +WHERE alarm_time > NOW() - INTERVAL '7 days' +GROUP BY alarm_type, source +ORDER BY count DESC; +``` + +#### 2.3.2 Ingestion Log (ingest_events tasks) + +```sql +SELECT endpoint, run_at, success, rows_inserted, duration_ms, error_message +FROM tracksolid.ingestion_log +WHERE endpoint = 'jimi.device.alarm.list' +ORDER BY run_at DESC +LIMIT 10; +``` + +**Healthy indicator:** Successful runs every ~5 minutes, `duration_ms < 10000`. + +--- + +### 2.4 webhook_receiver + +#### 2.4.1 Health Check + +```bash +# From within the Docker network or via Coolify domain +curl -f https:///health +# Expected: {"status":"ok"} +``` + +#### 2.4.2 OBD Diagnostics (/pushobd) — Priority 1 + +**Table:** `tracksolid.obd_readings` +**Note:** Push-only. No data until Jimi platform is configured to send webhooks. + +```sql +-- OBD data volume +SELECT COUNT(*) AS total_readings, + COUNT(DISTINCT imei) AS devices_reporting, + MAX(reading_time) AS latest_reading, + COUNT(*) FILTER (WHERE obd_data IS NOT NULL) AS with_full_payload +FROM tracksolid.obd_readings; +``` + +```sql +-- Recent OBD data sample +SELECT imei, reading_time, car_type, acc_state, + obd_data->>'dataID1' AS rpm_or_data1, + obd_data->>'dataID2' AS data2 +FROM tracksolid.obd_readings +ORDER BY reading_time DESC +LIMIT 10; +``` + +#### 2.4.3 DTC Fault Codes (/pushfaultinfo) — Priority 1 + +**Table:** `tracksolid.fault_codes` + +```sql +-- Fault code summary +SELECT COUNT(*) AS total_faults, + COUNT(DISTINCT imei) AS affected_devices, + COUNT(DISTINCT fault_code) AS unique_codes, + MAX(reported_at) AS latest_fault +FROM tracksolid.fault_codes; +``` + +```sql +-- Most common fault codes +SELECT fault_code, COUNT(*) AS occurrences, + COUNT(DISTINCT imei) AS affected_devices +FROM tracksolid.fault_codes +GROUP BY fault_code +ORDER BY occurrences DESC +LIMIT 20; +``` + +#### 2.4.4 Push Alarms (/pushalarm) + +**Table:** `tracksolid.alarms` (source = 'push') + +```sql +-- Push vs poll alarm comparison +SELECT source, COUNT(*) AS count, MAX(alarm_time) AS latest +FROM tracksolid.alarms +WHERE alarm_time > NOW() - INTERVAL '7 days' +GROUP BY source; +``` + +#### 2.4.5 Push GPS (/pushgps) + +**Table:** `tracksolid.position_history` (source = 'push') + +```sql +-- Push vs poll position comparison +SELECT source, COUNT(*) AS count, MAX(gps_time) AS latest +FROM tracksolid.position_history +WHERE gps_time > NOW() - INTERVAL '24 hours' +GROUP BY source; +``` + +#### 2.4.6 Heartbeats (/pushhb) + +**Table:** `tracksolid.heartbeats` + +```sql +-- Heartbeat volume +SELECT COUNT(*) AS total_heartbeats, + COUNT(DISTINCT imei) AS reporting_devices, + MAX(gate_time) AS latest_heartbeat +FROM tracksolid.heartbeats; +``` + +```sql +-- Device health from heartbeats (last 24h) +SELECT imei, + COUNT(*) AS heartbeat_count, + ROUND(AVG(power_level)) AS avg_power, + ROUND(AVG(gsm_signal)) AS avg_signal, + MAX(gate_time) AS last_seen +FROM tracksolid.heartbeats +WHERE gate_time > NOW() - INTERVAL '24 hours' +GROUP BY imei +ORDER BY heartbeat_count DESC +LIMIT 20; +``` + +#### 2.4.7 Push Trip Reports (/pushtripreport) + +**Table:** `tracksolid.trips` (source = 'push') + +```sql +-- Push trips with fuel data +SELECT imei, start_time, end_time, + ROUND(distance_m / 1000, 1) AS km, + fuel_consumed_l, idle_time_s, source +FROM tracksolid.trips +WHERE source = 'push' +ORDER BY start_time DESC +LIMIT 10; +``` + +#### 2.4.8 Ingestion Log (All webhook endpoints) + +```sql +SELECT endpoint, run_at, success, rows_inserted, duration_ms +FROM tracksolid.ingestion_log +WHERE endpoint LIKE 'webhook/%' +ORDER BY run_at DESC +LIMIT 20; +``` + +#### 2.4.9 Test Webhook Manually + +Send a test OBD push (with empty token for initial testing): +```bash +curl -X POST https:///pushobd \ + -d 'token=&data_list=[{"deviceImei":"TEST_IMEI","obdJson":{"event_time":"2026-04-08 12:00:00","lat":51.5,"lng":-0.1,"AccState":1,"dataID1":2500}}]' +# Expected: {"code":0,"msg":"success"} +``` + +**Note:** The TEST_IMEI must exist in `tracksolid.devices` (FK constraint) or the insert will be skipped. + +--- + +### 2.5 grafana + +```bash +# Verify Grafana is accessible +curl -f https:///api/health +# Expected: {"commit":"...","database":"ok","version":"11.0.0"} +``` + +**Configure data source in Grafana UI:** +- Type: PostgreSQL +- Host: `timescale_db:5432` (internal Docker network) +- Database: `tracksolid_db` +- User: `grafana_ro` +- SSL Mode: disable (internal network) + +--- + +## 3. Overall Health Dashboard Queries + +### 3.1 Ingestion Pipeline Status (All Services) + +```sql +SELECT * FROM tracksolid.v_ingestion_health +ORDER BY seconds_ago ASC; +``` + +| endpoint | run_at | success | seconds_ago | Status | +|----------|--------|---------|-------------|--------| +| jimi.user.device.location.list | recent | true | < 120 | OK | +| jimi.device.alarm.list | recent | true | < 600 | OK | +| jimi.device.track.mileage | recent | true | < 1200 | OK | +| webhook/pushobd | recent | true | varies | OK | + +**Alert thresholds:** +- `seconds_ago > 300` for live positions = **WARNING** +- `seconds_ago > 900` for trips/alarms = **WARNING** +- `success = false` for any endpoint = **CRITICAL** + +### 3.2 Data Volume Summary (Last 24 Hours) + +```sql +SELECT 'devices' AS metric, COUNT(*)::TEXT AS value FROM tracksolid.devices WHERE enabled_flag = 1 +UNION ALL +SELECT 'live_positions', COUNT(*)::TEXT FROM tracksolid.live_positions WHERE updated_at > NOW() - INTERVAL '2 minutes' +UNION ALL +SELECT 'position_history_24h', COUNT(*)::TEXT FROM tracksolid.position_history WHERE gps_time > NOW() - INTERVAL '24 hours' +UNION ALL +SELECT 'trips_24h', COUNT(*)::TEXT FROM tracksolid.trips WHERE start_time > NOW() - INTERVAL '24 hours' +UNION ALL +SELECT 'alarms_24h', COUNT(*)::TEXT FROM tracksolid.alarms WHERE alarm_time > NOW() - INTERVAL '24 hours' +UNION ALL +SELECT 'parking_24h', COUNT(*)::TEXT FROM tracksolid.parking_events WHERE start_time > NOW() - INTERVAL '24 hours' +UNION ALL +SELECT 'obd_total', COUNT(*)::TEXT FROM tracksolid.obd_readings +UNION ALL +SELECT 'fault_codes_total', COUNT(*)::TEXT FROM tracksolid.fault_codes +UNION ALL +SELECT 'heartbeats_24h', COUNT(*)::TEXT FROM tracksolid.heartbeats WHERE gate_time > NOW() - INTERVAL '24 hours'; +``` + +### 3.3 Token Health + +```sql +SELECT account, access_token IS NOT NULL AS has_token, + expires_at, + ROUND(EXTRACT(EPOCH FROM (expires_at - NOW())) / 60) AS minutes_until_expiry, + obtained_at +FROM tracksolid.api_token_cache; +``` + +**Healthy indicator:** `minutes_until_expiry > 0`. Token auto-refreshes when < 30 minutes remaining. + +### 3.4 Database Size + +```sql +SELECT hypertable_schema || '.' || hypertable_name AS table_name, + pg_size_pretty(hypertable_size(format('%I.%I', hypertable_schema, hypertable_name))) AS size, + num_chunks +FROM timescaledb_information.hypertables; + +SELECT schemaname || '.' || relname AS table_name, + pg_size_pretty(pg_total_relation_size(relid)) AS total_size +FROM pg_stat_user_tables +WHERE schemaname = 'tracksolid' +ORDER BY pg_total_relation_size(relid) DESC; +``` + +--- + +## 4. Polling vs Push Coexistence + +Both polling and webhook services can write to the same tables. Deduplication is handled via `ON CONFLICT` clauses: + +| Data Type | Polling | Webhook | Dedup Strategy | +|-----------|---------|---------|----------------| +| GPS | `poll_live_positions` (60s) | `/pushgps` | `ON CONFLICT (imei, gps_time) DO NOTHING` | +| Alarms | `poll_alarms` (5m) | `/pushalarm` | `ON CONFLICT (imei, alarm_type, alarm_time) DO NOTHING` | +| Trips | `poll_trips` (15m) | `/pushtripreport` | `ON CONFLICT (imei, start_time) DO UPDATE` | +| OBD | None (push-only) | `/pushobd` | `ON CONFLICT (imei, reading_time) DO UPDATE` | +| Fault Codes | None (push-only) | `/pushfaultinfo` | `ON CONFLICT (imei, reported_at, fault_code) DO NOTHING` | +| Heartbeats | None (push-only) | `/pushhb` | `ON CONFLICT (imei, gate_time) DO NOTHING` | +| Parking | `poll_parking` (15m) | None | `ON CONFLICT (imei, start_time, event_type) DO NOTHING` | + +The `source` column ('poll' or 'push') tracks data origin where applicable. + +--- + +## 5. Environment Variables + +| Variable | Required | Used By | Description | +|----------|----------|---------|-------------| +| `TRACKSOLID_APP_KEY` | Yes | All Python services | OAuth2 application key | +| `TRACKSOLID_APP_SECRET` | Yes | All Python services | OAuth2 application secret | +| `TRACKSOLID_USER_ID` | Yes | All Python services | Tracksolid account user ID | +| `TRACKSOLID_PWD_MD5` | Yes | All Python services | MD5 hash of user password | +| `TRACKSOLID_TARGET_ACCOUNT` | No | ingest_movement | Defaults to USER_ID | +| `TRACKSOLID_API_URL` | No | All Python services | Defaults to `https://eu-open.tracksolidpro.com/route/rest` | +| `DATABASE_URL` | Yes | All Python services | Full PostgreSQL connection string | +| `POSTGRES_DB` | Yes | timescale_db | Database name | +| `POSTGRES_USER` | Yes | timescale_db | Database superuser | +| `POSTGRES_PASSWORD` | Yes | timescale_db | Database password | +| `GRAFANA_ADMIN_PASSWORD` | Yes | grafana | Grafana admin UI password | +| `JIMI_WEBHOOK_TOKEN` | No | webhook_receiver | Webhook auth token (empty = skip validation) | +| `DB_POOL_MAX` | No | All Python services | Max DB connections (default: 12) | + +--- + +## 6. Scheduled Task Summary + +| Service | Task | Interval | API Endpoint | Tables | +|---------|------|----------|--------------|--------| +| ingest_movement | `sync_devices` | Daily 02:00 UTC | `jimi.user.device.list` + `jimi.track.device.detail` | devices | +| ingest_movement | `poll_live_positions` | 60 seconds | `jimi.user.device.location.list` | live_positions, position_history | +| ingest_movement | `poll_trips` | 15 minutes | `jimi.device.track.mileage` | trips | +| ingest_movement | `poll_parking` | 15 minutes | `jimi.open.platform.report.parking` | parking_events | +| ingest_events | `poll_alarms` | 5 minutes | `jimi.device.alarm.list` | alarms | + +--- + +## 7. Data Retention + +| Table | Retention | Managed By | +|-------|-----------|------------| +| `position_history` | 90 days | TimescaleDB retention policy (auto) | +| `heartbeats` | 30 days | TimescaleDB retention policy (auto) | +| `position_history` (compressed) | After 14 days | TimescaleDB compression policy (auto) | +| All other tables | Indefinite | Manual cleanup if needed | + +--- + +## 8. Troubleshooting + +### No data in any table +```sql +-- Check ingestion log for errors +SELECT * FROM tracksolid.ingestion_log +WHERE success = false +ORDER BY run_at DESC LIMIT 20; +``` + +### Token auth failures +```sql +-- Check token status +SELECT account, expires_at, + CASE WHEN expires_at < NOW() THEN 'EXPIRED' ELSE 'VALID' END AS status +FROM tracksolid.api_token_cache; +``` + +If expired, the service auto-refreshes. Persistent failures indicate credential issues in `.env`. + +### Webhook not receiving data +1. Verify the webhook domain is configured in Coolify and routed to `webhook_receiver:8000` +2. Verify the Jimi Tracksolid Pro platform is configured to push to your webhook URL +3. Check `/health` endpoint is reachable +4. Check ingestion log: `SELECT * FROM tracksolid.ingestion_log WHERE endpoint LIKE 'webhook/%' ORDER BY run_at DESC LIMIT 10;` + +### High ingestion latency +```sql +-- Check slow endpoints +SELECT endpoint, ROUND(AVG(duration_ms)) AS avg_ms, MAX(duration_ms) AS max_ms +FROM tracksolid.ingestion_log +WHERE run_at > NOW() - INTERVAL '1 hour' +GROUP BY endpoint +ORDER BY avg_ms DESC; +``` + +### Rate limiting +Look for `Rate limit hit` in container logs. The system auto-backs off (10-30s). Persistent rate limiting may require reducing polling frequency or contacting Jimi support.