148 lines
6 KiB
Markdown
148 lines
6 KiB
Markdown
# Bug Reduction Quality Program — Design Spec
|
|
|
|
**Date:** 2026-04-12
|
|
**Project:** Fireside Communications Fleet Telemetry Ingestion Platform
|
|
**Repo:** `55_ts_coolify_gemini_prod`
|
|
**Status:** Approved — Implementation in Progress
|
|
|
|
## Problem
|
|
|
|
The platform has been running in production since late 2025 ingesting GPS and telemetry data from ~63 fleet vehicles. All bugs discovered to date (FIX-M11, FIX-M13, FIX-M16, FIX-E06, BUG-01 through BUG-05) were caught manually in production — via data inspection, Grafana anomalies, or customer reports. There are:
|
|
|
|
- Zero automated tests
|
|
- No linting or type-checking configuration
|
|
- No CI/CD pipeline
|
|
- No programmatic DB health monitoring
|
|
|
|
Any code change risks silent regressions. Any API field mapping change risks data going silently to NULL. Any schema change risks data corruption that may not be noticed for days.
|
|
|
|
## Goal
|
|
|
|
A layered quality program that:
|
|
1. **Finds existing bugs and data issues** without modifying source code
|
|
2. **Prevents future regressions** by locking in known-correct behaviour
|
|
3. **Monitors production DB health** on a daily schedule
|
|
|
|
## Constraints
|
|
|
|
- Existing source files MUST NOT be modified in Phase 1
|
|
- All additions are new files only (config, tests, CI workflows, audit scripts)
|
|
- Must run in CI (Forgejo Actions, self-hosted runner) and production (scheduled DB audit)
|
|
|
|
---
|
|
|
|
## Architecture: Three Parallel Workstreams
|
|
|
|
### Workstream 1 — Static Analysis
|
|
|
|
**Tools:** `ruff` (linting) + `mypy` (type checking)
|
|
**Trigger:** Every push / pull request via Forgejo Actions
|
|
**Risk:** Zero — read-only analysis of existing source
|
|
|
|
Surfaces:
|
|
- Undefined names, unused imports (ruff/F rules)
|
|
- Likely bugs: mutable defaults, string formatting issues (ruff/B rules)
|
|
- Type errors: untyped returns, Optional not handled (mypy)
|
|
- Modern Python upgrade opportunities (ruff/UP rules)
|
|
|
|
First run will be noisy — output becomes the bug backlog.
|
|
|
|
### Workstream 2 — Test Suite
|
|
|
|
**Framework:** pytest + pytest-asyncio
|
|
**Trigger:** Every push / pull request via Forgejo Actions
|
|
**Isolation:** Integration tests use a Docker TimescaleDB service container
|
|
|
|
**Unit tests** (pure Python, no DB):
|
|
- `test_clean_helpers.py` — `clean()`, `clean_num()`, `clean_ts()`, `is_valid_fix()` — these gate all data into the DB
|
|
- `test_api_signing.py` — `build_sign()` MD5 signature correctness
|
|
- `test_field_mapping.py` — locks in the three most bug-prone field mappings:
|
|
- FIX-E06: poll alarms use `alertTypeId`/`alarmTypeName`/`alertTime` (not `alarmType`)
|
|
- FIX-M16: trip distance arrives in metres, stored as km (÷ 1000)
|
|
- BUG-03: BCD timestamps `YYMMDDHHmmss` parsed correctly
|
|
|
|
**Integration tests** (real TimescaleDB):
|
|
- `test_movement_pipeline.py` — `poll_live_positions()` full round-trip, UPSERT idempotency
|
|
- `test_events_pipeline.py` — `poll_alarms()` field mapping, NULL alarm_type rejection
|
|
- `test_webhook_endpoints.py` — FastAPI endpoints with mock Jimi payloads, SAVEPOINT isolation
|
|
|
|
### Workstream 3 — DB Audit
|
|
|
|
**Runner:** `db_audit/run_audit.py` (Python)
|
|
**Trigger:** Daily at 06:00 EAT (03:00 UTC) via scheduled Forgejo workflow + `workflow_dispatch` for manual runs
|
|
**Output:** Rows written to `tracksolid.health_checks` table; queryable from Grafana
|
|
|
|
Six health checks:
|
|
|
|
| Check | File | Critical | Warning |
|
|
|---|---|---|---|
|
|
| Stale devices | `stale_devices.sql` | — | Any enabled device with no GPS fix >2h |
|
|
| NULL integrity | `null_integrity.sql` | Any NULL imei or gps_time in telemetry tables | — |
|
|
| Distance outliers | `distance_outliers.sql` | — | Any trip >500km or <0km in last 7 days |
|
|
| Duplicate positions | `duplicate_positions.sql` | Any (imei, gps_time) duplicate in position_history | — |
|
|
| Data gaps | `data_gaps.sql` | — | Any enabled device with no data in 7 days |
|
|
| Enum drift | `enum_drift.sql` | — | Unexpected value in source/severity columns |
|
|
|
|
Exit code: `1` on any `critical`, `0` on `ok`/`warning`.
|
|
|
|
---
|
|
|
|
## File Layout
|
|
|
|
```
|
|
55_ts_coolify_gemini_prod/
|
|
├── pyproject.toml ← ADD: ruff + mypy + pytest config + dev deps
|
|
├── .forgejo/
|
|
│ └── workflows/
|
|
│ ├── ci-static.yml
|
|
│ ├── ci-tests.yml
|
|
│ └── scheduled-audit.yml
|
|
├── tests/
|
|
│ ├── conftest.py
|
|
│ ├── fixtures/
|
|
│ │ ├── api_responses.py
|
|
│ │ └── schema.sql
|
|
│ ├── unit/
|
|
│ │ ├── test_clean_helpers.py
|
|
│ │ ├── test_api_signing.py
|
|
│ │ └── test_field_mapping.py
|
|
│ └── integration/
|
|
│ ├── test_movement_pipeline.py
|
|
│ ├── test_events_pipeline.py
|
|
│ └── test_webhook_endpoints.py
|
|
└── db_audit/
|
|
├── run_audit.py
|
|
├── checks/
|
|
│ ├── stale_devices.sql
|
|
│ ├── null_integrity.sql
|
|
│ ├── distance_outliers.sql
|
|
│ ├── duplicate_positions.sql
|
|
│ ├── data_gaps.sql
|
|
│ └── enum_drift.sql
|
|
└── schema/
|
|
└── health_checks_table.sql
|
|
```
|
|
|
|
---
|
|
|
|
## Forgejo Runner Setup
|
|
|
|
Before CI can run, a self-hosted runner must be registered on the Coolify server:
|
|
|
|
1. Forgejo → Settings → Actions → Runners → Register Runner → copy token
|
|
2. On Coolify server: `docker run -d --name forgejo-runner gitea/act_runner:latest register --instance https://repo.rahamafresh.com --token <TOKEN> --name coolify-runner --labels self-hosted`
|
|
3. Verify runner appears as active in Forgejo
|
|
|
|
Required Forgejo secrets:
|
|
- `DATABASE_URL` — production DB connection string (for scheduled audit)
|
|
- `TEST_DATABASE_URL` — set automatically by CI service container
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
| Workstream | Pass Criteria |
|
|
|---|---|
|
|
| Static Analysis | Push triggers CI-static; ruff + mypy produce output report; job exits non-zero on violations |
|
|
| Test Suite | Push triggers CI-tests; all unit tests pass; integration tests pass against service container DB |
|
|
| DB Audit | Manual run populates `health_checks` table; findings match known issues (44 silent devices, etc.); scheduled run fires at 06:00 EAT |
|