tracksolid_timescale_grafan.../grafanaOperationalManual.md

648 lines
18 KiB
Markdown
Raw Permalink Normal View History

# Grafana NOC Fleet Dashboard — Operational Manual
## Table of Contents
1. [Pre-Deployment Checklist](#1-pre-deployment-checklist)
2. [Deploy](#2-deploy)
3. [Post-Deployment Verification](#3-post-deployment-verification)
4. [Dashboard Panel Verification](#4-dashboard-panel-verification)
5. [Database Verification Queries](#5-database-verification-queries)
6. [Troubleshooting](#6-troubleshooting)
7. [Day-to-Day NOC Operations](#7-day-to-day-noc-operations)
8. [Maintenance](#8-maintenance)
---
## 1. Pre-Deployment Checklist
Run these checks before starting Grafana for the first time.
### 1.1 Environment Variables
Confirm `.env` contains all required Grafana variables:
```bash
grep -E 'GRAFANA_ADMIN_PASSWORD|GRAFANA_DB_RO_PASSWORD' .env
```
Expected output — both lines must be present and non-empty:
```
GRAFANA_ADMIN_PASSWORD=<your admin password>
GRAFANA_DB_RO_PASSWORD=<grafana_ro postgres password>
```
If `GRAFANA_DB_RO_PASSWORD` is missing, add it before continuing:
```bash
echo "GRAFANA_DB_RO_PASSWORD=<password>" >> .env
```
### 1.2 Verify grafana_ro User Exists in Postgres
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "\du grafana_ro"
```
Expected output:
```
Role name | Attributes
------------+---------------------------
grafana_ro | Cannot login, ...
```
If the role is missing, create it:
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
CREATE ROLE grafana_ro WITH LOGIN PASSWORD '<password>' NOSUPERUSER NOCREATEDB NOCREATEROLE;
GRANT USAGE ON SCHEMA tracksolid TO grafana_ro;
GRANT SELECT ON ALL TABLES IN SCHEMA tracksolid TO grafana_ro;
ALTER DEFAULT PRIVILEGES IN SCHEMA tracksolid GRANT SELECT ON TABLES TO grafana_ro;
"
```
### 1.3 Verify Provisioning Files Are Present
```bash
ls -R grafana/provisioning/
```
Expected output:
```
grafana/provisioning/:
dashboards dashboards-json datasources
grafana/provisioning/datasources:
tracksolid_postgres.yaml
grafana/provisioning/dashboards:
noc_fleet.yaml
grafana/provisioning/dashboards-json:
noc_fleet_dashboard.json
```
### 1.4 Verify Database Has Live Data
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT COUNT(*) AS devices,
COUNT(lp.imei) AS with_position
FROM tracksolid.devices d
LEFT JOIN tracksolid.live_positions lp USING (imei)
WHERE d.enabled_flag = 1;
"
```
Expected: `devices` = total enabled vehicles, `with_position` > 0 (at least some vehicles have GPS fixes).
---
## 2. Deploy
### Start Grafana
```bash
docker compose up -d grafana
```
### Confirm Container Is Running
```bash
docker compose ps grafana
```
Expected `STATUS`: `Up` (not `Restarting` or `Exit`).
### Tail Startup Logs
```bash
docker compose logs --follow grafana
```
Watch for these lines — they confirm provisioning loaded successfully:
```
msg="Starting Grafana"
msg="Provisioning datasource" name=TracksolidDB
msg="Finished provisioning data sources"
msg="Inserting/updating dashboard" name="NOC Fleet Operations — Live"
msg="HTTP Server Listen" address=0.0.0.0:3000
```
Press `Ctrl+C` to stop following once you see the server listening.
---
## 3. Post-Deployment Verification
### 3.1 Check Provisioning Loaded
```bash
docker compose logs grafana | grep -E "provision|Inserting|dashboard|datasource"
```
| What to look for | Means |
|---|---|
| `Provisioning datasource` with `name=TracksolidDB` | Datasource YAML was read |
| `Finished provisioning data sources` | Datasource created successfully |
| `Inserting/updating dashboard` with `noc-fleet-live` | Dashboard JSON was loaded |
| `Failed to provision` or `Error` | See Troubleshooting section |
### 3.2 Verify Datasource via API
```bash
curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \
http://localhost:3000/api/datasources/name/TracksolidDB \
| python3 -m json.tool | grep -E '"name"|"uid"|"type"|"url"'
```
Expected:
```json
"name": "TracksolidDB",
"type": "postgres",
"uid": "tracksolid_pg",
"url": "timescale_db:5432",
```
### 3.3 Test Datasource Connection via API
```bash
curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \
-X POST http://localhost:3000/api/datasources/uid/tracksolid_pg/health \
| python3 -m json.tool
```
Expected:
```json
{
"message": "Database Connection OK",
"status": "OK"
}
```
If status is `ERROR` — see [Troubleshooting: Datasource Connection Fails](#datasource-connection-fails).
### 3.4 Verify Dashboard Is Registered
```bash
curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \
http://localhost:3000/api/dashboards/uid/noc-fleet-live \
| python3 -m json.tool | grep -E '"title"|"uid"|"version"'
```
Expected:
```json
"title": "NOC Fleet Operations — Live",
"uid": "noc-fleet-live",
"version": 1,
```
### 3.5 Open the Dashboard in a Browser
Navigate to: `http://localhost:3000/d/noc-fleet-live`
Login with:
- **Username:** `admin`
- **Password:** value of `GRAFANA_ADMIN_PASSWORD` from `.env`
The NOC dashboard should load as the home page automatically.
---
## 4. Dashboard Panel Verification
Work through each panel to confirm it renders correctly.
### 4.1 Stat Panels (Row 1)
| Panel | Expected | Red Flag |
|---|---|---|
| Total Vehicles | Integer matching enabled device count (should be ~80) | Shows 0 or `-` |
| Online Now | Integer ≤ Total Vehicles | Shows `-` |
| Recent (5-30m) | Integer ≤ Total Vehicles | Shows `-` |
| Offline | Integer ≤ Total Vehicles | Shows `-` |
| Moving Now | Integer ≤ Online Now | Shows `-` |
| Avg Speed (km/h) | Numeric value, green < 80 / amber 80-120 / red > 120 | Shows `No data` |
**Online + Recent + Offline should sum to Total Vehicles.**
Check the sum:
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT
connectivity_status,
COUNT(*) AS vehicles
FROM tracksolid.v_fleet_status
GROUP BY connectivity_status
ORDER BY connectivity_status;
"
```
### 4.2 Geomap Panel
Work through this checklist visually:
- [ ] Map loads with dark Carto basemap (not a grey blank tile)
- [ ] Arrow markers appear on the map (not dots or circles)
- [ ] Markers are clustered around East Africa (Nairobi / Mombasa / Kampala area)
- [ ] Arrows point in different directions — not all the same
- [ ] Clicking a marker opens a tooltip showing: Plate, Driver, Speed, Heading, Status, Location
- [ ] `lat` and `lng` fields are NOT visible in the tooltip (hidden by field overrides)
- [ ] `imei` field is NOT visible in the tooltip
**Verify direction data exists:**
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT
MIN(direction) AS min_dir,
MAX(direction) AS max_dir,
COUNT(*) FILTER (WHERE direction IS NOT NULL) AS with_direction,
COUNT(*) AS total
FROM tracksolid.live_positions;
"
```
If `with_direction` = 0, all arrows will point North (0°) — this means the GPS devices haven't sent bearing data yet, which is normal for parked vehicles.
### 4.3 Vehicle Status Table
- [ ] Table shows all enabled vehicles (row count matches Total Vehicles stat)
- [ ] Rows sort: Online (green) → Recent (amber) → Offline (red)
- [ ] "Status" column has colour-coded background
- [ ] "Speed (km/h)" column shows colour-coded text (green/amber/red)
- [ ] "Last Fix" column shows a readable timestamp (not a raw epoch number)
- [ ] "Min Ago" column shows integers (minutes since last GPS fix)
- [ ] Vehicles with no GPS fix show `null` in speed/location columns but still appear
**Spot-check a specific vehicle:**
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT
d.vehicle_number, d.driver_name, lp.speed,
lp.gps_time, lp.lat, lp.lng
FROM tracksolid.devices d
LEFT JOIN tracksolid.live_positions lp USING (imei)
WHERE d.enabled_flag = 1
LIMIT 5;
"
```
### 4.4 Ingestion Health Panel (Collapsed)
Expand the panel by clicking its title row.
- [ ] Rows appear for each polling endpoint (`live_positions`, `trips`, `alarms`, etc.)
- [ ] "Result" column shows `OK` (green) for all active endpoints
- [ ] "Last Run" timestamps are recent (within the last few minutes for live_positions)
- [ ] No `FAIL` (red) entries
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT endpoint, success, seconds_ago, error_message
FROM tracksolid.v_ingestion_health
ORDER BY endpoint;
"
```
### 4.5 Auto-Refresh Verification
1. Note the "Last Fix" timestamp for any vehicle in the table
2. Wait 30 seconds
3. Confirm the page auto-refreshes (watch the Grafana spinner in the top-right)
4. Confirm "Last Fix" values have updated for Online vehicles
---
## 5. Database Verification Queries
Run these directly against the database to validate the data powering each panel.
### Fleet Status Summary
```sql
SELECT
connectivity_status,
COUNT(*) AS vehicles,
ROUND(AVG(speed)::numeric, 1) AS avg_speed_kmh
FROM tracksolid.v_fleet_status
GROUP BY connectivity_status
ORDER BY connectivity_status;
```
### Live Position Freshness
```sql
SELECT
COUNT(*) AS total_positions,
COUNT(*) FILTER (WHERE gps_time >= NOW() - INTERVAL '5 minutes') AS fresh_5m,
COUNT(*) FILTER (WHERE gps_time >= NOW() - INTERVAL '30 minutes') AS fresh_30m,
MAX(gps_time) AS newest_fix,
MIN(gps_time) AS oldest_fix
FROM tracksolid.live_positions;
```
### Moving Vehicles
```sql
SELECT
d.vehicle_number, d.driver_name,
lp.speed, lp.direction, lp.loc_desc, lp.gps_time
FROM tracksolid.devices d
INNER JOIN tracksolid.live_positions lp USING (imei)
WHERE d.enabled_flag = 1
AND lp.speed > 0
AND lp.acc_status = '1'
ORDER BY lp.speed DESC;
```
### Vehicles With No GPS Fix
```sql
SELECT
d.vehicle_number, d.vehicle_name, d.driver_name, d.city
FROM tracksolid.devices d
LEFT JOIN tracksolid.live_positions lp USING (imei)
WHERE d.enabled_flag = 1
AND lp.imei IS NULL
ORDER BY d.vehicle_number;
```
### Ingestion Pipeline Health
```sql
SELECT
endpoint,
run_at,
success,
imei_count,
rows_upserted,
duration_ms,
seconds_ago,
error_message
FROM tracksolid.v_ingestion_health
ORDER BY endpoint;
```
### Stale Positions (no update in > 1 hour)
```sql
SELECT
d.vehicle_number, d.driver_name, d.city,
lp.gps_time,
EXTRACT(EPOCH FROM (NOW() - lp.gps_time))::int / 60 AS minutes_since_fix
FROM tracksolid.devices d
INNER JOIN tracksolid.live_positions lp USING (imei)
WHERE d.enabled_flag = 1
AND lp.gps_time < NOW() - INTERVAL '1 hour'
ORDER BY lp.gps_time ASC;
```
---
## 6. Troubleshooting
### Grafana Container Keeps Restarting
```bash
docker compose logs grafana --tail=50
```
Common causes:
| Error in logs | Fix |
|---|---|
| `failed to connect to server` | Database not healthy yet — wait 30s and retry |
| `permission denied` on provisioning path | Check the `./grafana/provisioning` directory exists and is readable |
| `GF_SECURITY_ADMIN_PASSWORD not set` | Add `GRAFANA_ADMIN_PASSWORD` to `.env` |
### Datasource Connection Fails
```bash
docker compose logs grafana | grep -i "datasource\|postgres\|connect"
```
Check the password is correct:
```bash
docker compose exec timescale_db psql -U grafana_ro -d tracksolid_db -c "SELECT 1;"
```
If this fails with `authentication failed`, reset the password:
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \
"ALTER ROLE grafana_ro WITH PASSWORD '<new_password>';"
```
Then update `GRAFANA_DB_RO_PASSWORD` in `.env` and restart:
```bash
docker compose restart grafana
```
### Dashboard Shows "No Data"
**Step 1 — confirm the datasource UID matches:**
```bash
curl -s -u admin:${GRAFANA_ADMIN_PASSWORD} \
http://localhost:3000/api/datasources \
| python3 -m json.tool | grep uid
```
Must show `"uid": "tracksolid_pg"`. If the UID is different, the dashboard JSON references the wrong datasource.
**Step 2 — test the query directly in Grafana:**
1. Open the dashboard
2. Click the panel title → Edit
3. Switch to the Query tab
4. Click "Run query"
5. If it errors, the SQL or connection is the issue
**Step 3 — verify data exists:**
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \
"SELECT COUNT(*) FROM tracksolid.live_positions;"
```
If this returns 0, the ingestion pipeline has not populated data yet. Check ingestion service logs:
```bash
docker compose logs ingest_movement --tail=30
docker compose logs ingest_events --tail=30
```
### Geomap Shows No Markers
**Check coordinates are valid:**
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT COUNT(*) AS total,
COUNT(*) FILTER (WHERE lat IS NOT NULL AND lng IS NOT NULL) AS with_coords,
COUNT(*) FILTER (WHERE lat BETWEEN -90 AND 90 AND lng BETWEEN -180 AND 180) AS valid_coords
FROM tracksolid.live_positions;
"
```
If `with_coords` = 0: GPS data has not arrived yet. Wait for the next ingest cycle (60 seconds).
If `valid_coords` < `with_coords`: some coordinates are out of range data quality issue on the device side.
**Check the basemap is loading:**
Open browser DevTools (F12) → Network tab → filter for `tile` or `carto`. If Carto tile requests are failing, there may be no internet access from the browser. Try switching the basemap to OpenStreetMap in the Grafana panel editor temporarily.
### Arrows All Point North (No Direction Data)
This is expected for parked vehicles — direction is only meaningful when the vehicle is moving. Confirm:
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT COUNT(*) FILTER (WHERE direction > 0) AS with_bearing,
COUNT(*) AS total
FROM tracksolid.live_positions;
"
```
If `with_bearing` = 0 even for moving vehicles, the GPS device firmware may not be sending bearing in the position payload. Check with the Tracksolid account settings.
### Dashboard Not Loading as Home Page
Confirm the environment variable is set:
```bash
docker compose exec grafana env | grep DASHBOARDS
```
Expected:
```
GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/etc/grafana/provisioning/dashboards-json/noc_fleet_dashboard.json
```
If missing, confirm `env_file: .env` is present in the grafana service block of `docker-compose.yaml` and restart:
```bash
docker compose restart grafana
```
### Provisioning Changes Not Reflecting
Grafana polls the provisioning directory every 30 seconds. If changes to the dashboard JSON are not appearing:
```bash
docker compose restart grafana
docker compose logs grafana | grep -i "provision\|dashboard"
```
---
## 7. Day-to-Day NOC Operations
### Morning Health Check (run at shift start)
```bash
# 1. Confirm all services are up
docker compose ps
# 2. Check ingestion pipeline ran recently
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \
"SELECT endpoint, run_at, success, seconds_ago FROM tracksolid.v_ingestion_health ORDER BY endpoint;"
# 3. Check live position freshness
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c \
"SELECT connectivity_status, COUNT(*) FROM tracksolid.v_fleet_status GROUP BY 1 ORDER BY 1;"
```
### Investigating a Specific Vehicle
Replace `KAA 123A` with the actual plate:
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT
d.vehicle_number, d.vehicle_name, d.driver_name, d.driver_phone,
lp.lat, lp.lng, lp.speed, lp.direction,
lp.acc_status, lp.gps_time, lp.loc_desc,
EXTRACT(EPOCH FROM (NOW() - lp.gps_time))::int / 60 AS minutes_ago
FROM tracksolid.devices d
LEFT JOIN tracksolid.live_positions lp USING (imei)
WHERE d.vehicle_number = 'KAA 123A';
"
```
### Checking Recent Trips for a Vehicle
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT
t.start_time, t.end_time,
ROUND(t.distance_m / 1000.0, 2) AS distance_km,
t.avg_speed_kmh, t.max_speed_kmh,
ROUND(t.driving_time_s / 3600.0, 2) AS driving_hours
FROM tracksolid.trips t
INNER JOIN tracksolid.devices d USING (imei)
WHERE d.vehicle_number = 'KAA 123A'
AND t.start_time >= NOW() - INTERVAL '24 hours'
ORDER BY t.start_time DESC;
"
```
### Checking Recent Alarms
```bash
docker compose exec timescale_db psql -U $POSTGRES_USER -d tracksolid_db -c "
SELECT
d.vehicle_number, d.driver_name,
a.alarm_type, a.alarm_name, a.alarm_time, a.speed
FROM tracksolid.alarms a
INNER JOIN tracksolid.devices d USING (imei)
WHERE a.alarm_time >= NOW() - INTERVAL '24 hours'
ORDER BY a.alarm_time DESC
LIMIT 20;
"
```
---
## 8. Maintenance
### Restart Grafana Only
```bash
docker compose restart grafana
```
### Restart Full Stack
```bash
docker compose down && docker compose up -d
```
### Update the Dashboard JSON
1. Edit `grafana/provisioning/dashboards-json/noc_fleet_dashboard.json`
2. Grafana auto-reloads within 30 seconds (no restart needed)
3. Commit the change: `git add grafana/ && git commit -m "Update NOC dashboard"`
### Check Container Resource Usage
```bash
docker stats grafana timescale_db --no-stream
```
### Grafana Data Volume Size
```bash
docker system df -v | grep grafana-data
```
### View Grafana Version
```bash
docker compose exec grafana grafana-server -v
```
Expected: `Version 11.0.0`
---
## Quick Reference
| Task | Command |
|---|---|
| Start Grafana | `docker compose up -d grafana` |
| Stop Grafana | `docker compose stop grafana` |
| Restart Grafana | `docker compose restart grafana` |
| View logs | `docker compose logs grafana --tail=50` |
| Check provisioning | `docker compose logs grafana \| grep -i provision` |
| Test datasource | `curl -s -u admin:$GRAFANA_ADMIN_PASSWORD http://localhost:3000/api/datasources/uid/tracksolid_pg/health` |
| Open dashboard | `http://localhost:3000/d/noc-fleet-live` |
| Fleet summary | `SELECT connectivity_status, COUNT(*) FROM tracksolid.v_fleet_status GROUP BY 1;` |
| Ingestion health | `SELECT * FROM tracksolid.v_ingestion_health ORDER BY endpoint;` |