From 82761e1e3f585b4c4643bb2dc3f8ec5320b06f8e Mon Sep 17 00:00:00 2001 From: David Kiania Date: Thu, 9 Apr 2026 00:12:48 +0300 Subject: [PATCH] Add Grafana NOC operational manual Covers pre-deployment checklist, post-deploy verification steps for each panel, database verification queries, troubleshooting guide, and day-to-day NOC operations reference. Co-Authored-By: Claude Sonnet 4.6 --- grafanaOperationalManual.md | 647 ++++++++++++++++++++++++++++++++++++ 1 file changed, 647 insertions(+) create mode 100644 grafanaOperationalManual.md diff --git a/grafanaOperationalManual.md b/grafanaOperationalManual.md new file mode 100644 index 0000000..2f5a2fd --- /dev/null +++ b/grafanaOperationalManual.md @@ -0,0 +1,647 @@ +# Grafana NOC Fleet Dashboard — Operational Manual + +## Table of Contents + +1. [Pre-Deployment Checklist](#1-pre-deployment-checklist) +2. [Deploy](#2-deploy) +3. [Post-Deployment Verification](#3-post-deployment-verification) +4. [Dashboard Panel Verification](#4-dashboard-panel-verification) +5. [Database Verification Queries](#5-database-verification-queries) +6. [Troubleshooting](#6-troubleshooting) +7. [Day-to-Day NOC Operations](#7-day-to-day-noc-operations) +8. [Maintenance](#8-maintenance) + +--- + +## 1. Pre-Deployment Checklist + +Run these checks before starting Grafana for the first time. + +### 1.1 Environment Variables + +Confirm `.env` contains all required Grafana variables: + +```bash +grep -E 'GRAFANA_ADMIN_PASSWORD|GRAFANA_DB_RO_PASSWORD' .env +``` + +Expected output — both lines must be present and non-empty: +``` +GRAFANA_ADMIN_PASSWORD= +GRAFANA_DB_RO_PASSWORD= +``` + +If `GRAFANA_DB_RO_PASSWORD` is missing, add it before continuing: +```bash +echo "GRAFANA_DB_RO_PASSWORD=" >> .env +``` + +### 1.2 Verify grafana_ro User Exists in Postgres + +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "\du grafana_ro" +``` + +Expected output: +``` + Role name | Attributes +------------+--------------------------- + grafana_ro | Cannot login, ... +``` + +If the role is missing, create it: +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " + CREATE ROLE grafana_ro WITH LOGIN PASSWORD '' NOSUPERUSER NOCREATEDB NOCREATEROLE; + GRANT USAGE ON SCHEMA tracksolid TO grafana_ro; + GRANT SELECT ON ALL TABLES IN SCHEMA tracksolid TO grafana_ro; + ALTER DEFAULT PRIVILEGES IN SCHEMA tracksolid GRANT SELECT ON TABLES TO grafana_ro; +" +``` + +### 1.3 Verify Provisioning Files Are Present + +```bash +ls -R grafana/provisioning/ +``` + +Expected output: +``` +grafana/provisioning/: +dashboards dashboards-json datasources + +grafana/provisioning/datasources: +tracksolid_postgres.yaml + +grafana/provisioning/dashboards: +noc_fleet.yaml + +grafana/provisioning/dashboards-json: +noc_fleet_dashboard.json +``` + +### 1.4 Verify Database Has Live Data + +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT COUNT(*) AS devices, + COUNT(lp.imei) AS with_position +FROM tracksolid.devices d +LEFT JOIN tracksolid.live_positions lp USING (imei) +WHERE d.enabled_flag = 1; +" +``` + +Expected: `devices` = total enabled vehicles, `with_position` > 0 (at least some vehicles have GPS fixes). + +--- + +## 2. Deploy + +### Start Grafana + +```bash +docker compose up -d grafana +``` + +### Confirm Container Is Running + +```bash +docker compose ps grafana +``` + +Expected `STATUS`: `Up` (not `Restarting` or `Exit`). + +### Tail Startup Logs + +```bash +docker compose logs --follow grafana +``` + +Watch for these lines — they confirm provisioning loaded successfully: + +``` +msg="Starting Grafana" +msg="Provisioning datasource" name=TracksolidDB +msg="Finished provisioning data sources" +msg="Inserting/updating dashboard" name="NOC Fleet Operations — Live" +msg="HTTP Server Listen" address=0.0.0.0:3000 +``` + +Press `Ctrl+C` to stop following once you see the server listening. + +--- + +## 3. Post-Deployment Verification + +### 3.1 Check Provisioning Loaded + +```bash +docker compose logs grafana | grep -E "provision|Inserting|dashboard|datasource" +``` + +| What to look for | Means | +|---|---| +| `Provisioning datasource` with `name=TracksolidDB` | Datasource YAML was read | +| `Finished provisioning data sources` | Datasource created successfully | +| `Inserting/updating dashboard` with `noc-fleet-live` | Dashboard JSON was loaded | +| `Failed to provision` or `Error` | See Troubleshooting section | + +### 3.2 Verify Datasource via API + +```bash +curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \ + http://localhost:3000/api/datasources/name/TracksolidDB \ + | python3 -m json.tool | grep -E '"name"|"uid"|"type"|"url"' +``` + +Expected: +```json +"name": "TracksolidDB", +"type": "postgres", +"uid": "tracksolid_pg", +"url": "timescale_db:5432", +``` + +### 3.3 Test Datasource Connection via API + +```bash +curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \ + -X POST http://localhost:3000/api/datasources/uid/tracksolid_pg/health \ + | python3 -m json.tool +``` + +Expected: +```json +{ + "message": "Database Connection OK", + "status": "OK" +} +``` + +If status is `ERROR` — see [Troubleshooting: Datasource Connection Fails](#datasource-connection-fails). + +### 3.4 Verify Dashboard Is Registered + +```bash +curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \ + http://localhost:3000/api/dashboards/uid/noc-fleet-live \ + | python3 -m json.tool | grep -E '"title"|"uid"|"version"' +``` + +Expected: +```json +"title": "NOC Fleet Operations — Live", +"uid": "noc-fleet-live", +"version": 1, +``` + +### 3.5 Open the Dashboard in a Browser + +Navigate to: `http://localhost:3000/d/noc-fleet-live` + +Login with: +- **Username:** `admin` +- **Password:** value of `GRAFANA_ADMIN_PASSWORD` from `.env` + +The NOC dashboard should load as the home page automatically. + +--- + +## 4. Dashboard Panel Verification + +Work through each panel to confirm it renders correctly. + +### 4.1 Stat Panels (Row 1) + +| Panel | Expected | Red Flag | +|---|---|---| +| Total Vehicles | Integer matching enabled device count (should be ~80) | Shows 0 or `-` | +| Online Now | Integer ≤ Total Vehicles | Shows `-` | +| Recent (5-30m) | Integer ≤ Total Vehicles | Shows `-` | +| Offline | Integer ≤ Total Vehicles | Shows `-` | +| Moving Now | Integer ≤ Online Now | Shows `-` | +| Avg Speed (km/h) | Numeric value, green < 80 / amber 80-120 / red > 120 | Shows `No data` | + +**Online + Recent + Offline should sum to Total Vehicles.** + +Check the sum: +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT + connectivity_status, + COUNT(*) AS vehicles +FROM tracksolid.v_fleet_status +GROUP BY connectivity_status +ORDER BY connectivity_status; +" +``` + +### 4.2 Geomap Panel + +Work through this checklist visually: + +- [ ] Map loads with dark Carto basemap (not a grey blank tile) +- [ ] Arrow markers appear on the map (not dots or circles) +- [ ] Markers are clustered around East Africa (Nairobi / Mombasa / Kampala area) +- [ ] Arrows point in different directions — not all the same +- [ ] Clicking a marker opens a tooltip showing: Plate, Driver, Speed, Heading, Status, Location +- [ ] `lat` and `lng` fields are NOT visible in the tooltip (hidden by field overrides) +- [ ] `imei` field is NOT visible in the tooltip + +**Verify direction data exists:** +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT + MIN(direction) AS min_dir, + MAX(direction) AS max_dir, + COUNT(*) FILTER (WHERE direction IS NOT NULL) AS with_direction, + COUNT(*) AS total +FROM tracksolid.live_positions; +" +``` + +If `with_direction` = 0, all arrows will point North (0°) — this means the GPS devices haven't sent bearing data yet, which is normal for parked vehicles. + +### 4.3 Vehicle Status Table + +- [ ] Table shows all enabled vehicles (row count matches Total Vehicles stat) +- [ ] Rows sort: Online (green) → Recent (amber) → Offline (red) +- [ ] "Status" column has colour-coded background +- [ ] "Speed (km/h)" column shows colour-coded text (green/amber/red) +- [ ] "Last Fix" column shows a readable timestamp (not a raw epoch number) +- [ ] "Min Ago" column shows integers (minutes since last GPS fix) +- [ ] Vehicles with no GPS fix show `null` in speed/location columns but still appear + +**Spot-check a specific vehicle:** +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT + d.vehicle_number, d.driver_name, lp.speed, + lp.gps_time, lp.lat, lp.lng +FROM tracksolid.devices d +LEFT JOIN tracksolid.live_positions lp USING (imei) +WHERE d.enabled_flag = 1 +LIMIT 5; +" +``` + +### 4.4 Ingestion Health Panel (Collapsed) + +Expand the panel by clicking its title row. + +- [ ] Rows appear for each polling endpoint (`live_positions`, `trips`, `alarms`, etc.) +- [ ] "Result" column shows `OK` (green) for all active endpoints +- [ ] "Last Run" timestamps are recent (within the last few minutes for live_positions) +- [ ] No `FAIL` (red) entries + +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT endpoint, success, seconds_ago, error_message +FROM tracksolid.v_ingestion_health +ORDER BY endpoint; +" +``` + +### 4.5 Auto-Refresh Verification + +1. Note the "Last Fix" timestamp for any vehicle in the table +2. Wait 30 seconds +3. Confirm the page auto-refreshes (watch the Grafana spinner in the top-right) +4. Confirm "Last Fix" values have updated for Online vehicles + +--- + +## 5. Database Verification Queries + +Run these directly against the database to validate the data powering each panel. + +### Fleet Status Summary +```sql +SELECT + connectivity_status, + COUNT(*) AS vehicles, + ROUND(AVG(speed)::numeric, 1) AS avg_speed_kmh +FROM tracksolid.v_fleet_status +GROUP BY connectivity_status +ORDER BY connectivity_status; +``` + +### Live Position Freshness +```sql +SELECT + COUNT(*) AS total_positions, + COUNT(*) FILTER (WHERE gps_time >= NOW() - INTERVAL '5 minutes') AS fresh_5m, + COUNT(*) FILTER (WHERE gps_time >= NOW() - INTERVAL '30 minutes') AS fresh_30m, + MAX(gps_time) AS newest_fix, + MIN(gps_time) AS oldest_fix +FROM tracksolid.live_positions; +``` + +### Moving Vehicles +```sql +SELECT + d.vehicle_number, d.driver_name, + lp.speed, lp.direction, lp.loc_desc, lp.gps_time +FROM tracksolid.devices d +INNER JOIN tracksolid.live_positions lp USING (imei) +WHERE d.enabled_flag = 1 + AND lp.speed > 0 + AND lp.acc_status = '1' +ORDER BY lp.speed DESC; +``` + +### Vehicles With No GPS Fix +```sql +SELECT + d.vehicle_number, d.vehicle_name, d.driver_name, d.city +FROM tracksolid.devices d +LEFT JOIN tracksolid.live_positions lp USING (imei) +WHERE d.enabled_flag = 1 + AND lp.imei IS NULL +ORDER BY d.vehicle_number; +``` + +### Ingestion Pipeline Health +```sql +SELECT + endpoint, + run_at, + success, + imei_count, + rows_upserted, + duration_ms, + seconds_ago, + error_message +FROM tracksolid.v_ingestion_health +ORDER BY endpoint; +``` + +### Stale Positions (no update in > 1 hour) +```sql +SELECT + d.vehicle_number, d.driver_name, d.city, + lp.gps_time, + EXTRACT(EPOCH FROM (NOW() - lp.gps_time))::int / 60 AS minutes_since_fix +FROM tracksolid.devices d +INNER JOIN tracksolid.live_positions lp USING (imei) +WHERE d.enabled_flag = 1 + AND lp.gps_time < NOW() - INTERVAL '1 hour' +ORDER BY lp.gps_time ASC; +``` + +--- + +## 6. Troubleshooting + +### Grafana Container Keeps Restarting + +```bash +docker compose logs grafana --tail=50 +``` + +Common causes: + +| Error in logs | Fix | +|---|---| +| `failed to connect to server` | Database not healthy yet — wait 30s and retry | +| `permission denied` on provisioning path | Check the `./grafana/provisioning` directory exists and is readable | +| `GF_SECURITY_ADMIN_PASSWORD not set` | Add `GRAFANA_ADMIN_PASSWORD` to `.env` | + +### Datasource Connection Fails + +```bash +docker compose logs grafana | grep -i "datasource\|postgres\|connect" +``` + +Check the password is correct: +```bash +docker compose exec timescale_db psql -U grafana_ro -d tracksolid_db -c "SELECT 1;" +``` + +If this fails with `authentication failed`, reset the password: +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \ + "ALTER ROLE grafana_ro WITH PASSWORD '';" +``` + +Then update `GRAFANA_DB_RO_PASSWORD` in `.env` and restart: +```bash +docker compose restart grafana +``` + +### Dashboard Shows "No Data" + +**Step 1 — confirm the datasource UID matches:** +```bash +curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \ + http://localhost:3000/api/datasources \ + | python3 -m json.tool | grep uid +``` + +Must show `"uid": "tracksolid_pg"`. If the UID is different, the dashboard JSON references the wrong datasource. + +**Step 2 — test the query directly in Grafana:** +1. Open the dashboard +2. Click the panel title → Edit +3. Switch to the Query tab +4. Click "Run query" +5. If it errors, the SQL or connection is the issue + +**Step 3 — verify data exists:** +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \ + "SELECT COUNT(*) FROM tracksolid.live_positions;" +``` + +If this returns 0, the ingestion pipeline has not populated data yet. Check ingestion service logs: +```bash +docker compose logs ingest_movement --tail=30 +docker compose logs ingest_events --tail=30 +``` + +### Geomap Shows No Markers + +**Check coordinates are valid:** +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT COUNT(*) AS total, + COUNT(*) FILTER (WHERE lat IS NOT NULL AND lng IS NOT NULL) AS with_coords, + COUNT(*) FILTER (WHERE lat BETWEEN -90 AND 90 AND lng BETWEEN -180 AND 180) AS valid_coords +FROM tracksolid.live_positions; +" +``` + +If `with_coords` = 0: GPS data has not arrived yet. Wait for the next ingest cycle (60 seconds). + +If `valid_coords` < `with_coords`: some coordinates are out of range — data quality issue on the device side. + +**Check the basemap is loading:** + +Open browser DevTools (F12) → Network tab → filter for `tile` or `carto`. If Carto tile requests are failing, there may be no internet access from the browser. Try switching the basemap to OpenStreetMap in the Grafana panel editor temporarily. + +### Arrows All Point North (No Direction Data) + +This is expected for parked vehicles — direction is only meaningful when the vehicle is moving. Confirm: + +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT COUNT(*) FILTER (WHERE direction > 0) AS with_bearing, + COUNT(*) AS total +FROM tracksolid.live_positions; +" +``` + +If `with_bearing` = 0 even for moving vehicles, the GPS device firmware may not be sending bearing in the position payload. Check with the Tracksolid account settings. + +### Dashboard Not Loading as Home Page + +Confirm the environment variable is set: +```bash +docker compose exec grafana env | grep DASHBOARDS +``` + +Expected: +``` +GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/etc/grafana/provisioning/dashboards-json/noc_fleet_dashboard.json +``` + +If missing, confirm `env_file: .env` is present in the grafana service block of `docker-compose.yaml` and restart: +```bash +docker compose restart grafana +``` + +### Provisioning Changes Not Reflecting + +Grafana polls the provisioning directory every 30 seconds. If changes to the dashboard JSON are not appearing: + +```bash +docker compose restart grafana +docker compose logs grafana | grep -i "provision\|dashboard" +``` + +--- + +## 7. Day-to-Day NOC Operations + +### Morning Health Check (run at shift start) + +```bash +# 1. Confirm all services are up +docker compose ps + +# 2. Check ingestion pipeline ran recently +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \ + "SELECT endpoint, run_at, success, seconds_ago FROM tracksolid.v_ingestion_health ORDER BY endpoint;" + +# 3. Check live position freshness +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \ + "SELECT connectivity_status, COUNT(*) FROM tracksolid.v_fleet_status GROUP BY 1 ORDER BY 1;" +``` + +### Investigating a Specific Vehicle + +Replace `KAA 123A` with the actual plate: + +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT + d.vehicle_number, d.vehicle_name, d.driver_name, d.driver_phone, + lp.lat, lp.lng, lp.speed, lp.direction, + lp.acc_status, lp.gps_time, lp.loc_desc, + EXTRACT(EPOCH FROM (NOW() - lp.gps_time))::int / 60 AS minutes_ago +FROM tracksolid.devices d +LEFT JOIN tracksolid.live_positions lp USING (imei) +WHERE d.vehicle_number = 'KAA 123A'; +" +``` + +### Checking Recent Trips for a Vehicle + +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT + t.start_time, t.end_time, + ROUND(t.distance_m / 1000.0, 2) AS distance_km, + t.avg_speed_kmh, t.max_speed_kmh, + ROUND(t.driving_time_s / 3600.0, 2) AS driving_hours +FROM tracksolid.trips t +INNER JOIN tracksolid.devices d USING (imei) +WHERE d.vehicle_number = 'KAA 123A' + AND t.start_time >= NOW() - INTERVAL '24 hours' +ORDER BY t.start_time DESC; +" +``` + +### Checking Recent Alarms + +```bash +docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c " +SELECT + d.vehicle_number, d.driver_name, + a.alarm_type, a.alarm_name, a.alarm_time, a.speed +FROM tracksolid.alarms a +INNER JOIN tracksolid.devices d USING (imei) +WHERE a.alarm_time >= NOW() - INTERVAL '24 hours' +ORDER BY a.alarm_time DESC +LIMIT 20; +" +``` + +--- + +## 8. Maintenance + +### Restart Grafana Only + +```bash +docker compose restart grafana +``` + +### Restart Full Stack + +```bash +docker compose down && docker compose up -d +``` + +### Update the Dashboard JSON + +1. Edit `grafana/provisioning/dashboards-json/noc_fleet_dashboard.json` +2. Grafana auto-reloads within 30 seconds (no restart needed) +3. Commit the change: `git add grafana/ && git commit -m "Update NOC dashboard"` + +### Check Container Resource Usage + +```bash +docker stats grafana timescale_db --no-stream +``` + +### Grafana Data Volume Size + +```bash +docker system df -v | grep grafana-data +``` + +### View Grafana Version + +```bash +docker compose exec grafana grafana-server -v +``` + +Expected: `Version 11.0.0` + +--- + +## Quick Reference + +| Task | Command | +|---|---| +| Start Grafana | `docker compose up -d grafana` | +| Stop Grafana | `docker compose stop grafana` | +| Restart Grafana | `docker compose restart grafana` | +| View logs | `docker compose logs grafana --tail=50` | +| Check provisioning | `docker compose logs grafana \| grep -i provision` | +| Test datasource | `curl -s -u admin:$GRAFANA_ADMIN_PASSWORD http://localhost:3000/api/datasources/uid/tracksolid_pg/health` | +| Open dashboard | `http://localhost:3000/d/noc-fleet-live` | +| Fleet summary | `SELECT connectivity_status, COUNT(*) FROM tracksolid.v_fleet_status GROUP BY 1;` | +| Ingestion health | `SELECT * FROM tracksolid.v_ingestion_health ORDER BY endpoint;` |