docs: add quality program design spec

2026-04-12 21:31:56 +03:00 · 2026-04-12 21:31:56 +03:00 · 75d3417a2b
commit 75d3417a2b
parent f9834564ab
1 changed files with 148 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-12-bug-reduction-quality-program-design.md
+++ b/docs/superpowers/specs/2026-04-12-bug-reduction-quality-program-design.md
@ -0,0 +1,148 @@
+# Bug Reduction Quality Program — Design Spec
+
+**Date:** 2026-04-12  
+**Project:** Fireside Communications Fleet Telemetry Ingestion Platform  
+**Repo:** `55_ts_coolify_gemini_prod`  
+**Status:** Approved — Implementation in Progress
+
+## Problem
+
+The platform has been running in production since late 2025 ingesting GPS and telemetry data from ~63 fleet vehicles. All bugs discovered to date (FIX-M11, FIX-M13, FIX-M16, FIX-E06, BUG-01 through BUG-05) were caught manually in production — via data inspection, Grafana anomalies, or customer reports. There are:
+
+- Zero automated tests
+- No linting or type-checking configuration
+- No CI/CD pipeline
+- No programmatic DB health monitoring
+
+Any code change risks silent regressions. Any API field mapping change risks data going silently to NULL. Any schema change risks data corruption that may not be noticed for days.
+
+## Goal
+
+A layered quality program that:
+1. **Finds existing bugs and data issues** without modifying source code
+2. **Prevents future regressions** by locking in known-correct behaviour
+3. **Monitors production DB health** on a daily schedule
+
+## Constraints
+
+- Existing source files MUST NOT be modified in Phase 1
+- All additions are new files only (config, tests, CI workflows, audit scripts)
+- Must run in CI (Forgejo Actions, self-hosted runner) and production (scheduled DB audit)
+
+---
+
+## Architecture: Three Parallel Workstreams
+
+### Workstream 1 — Static Analysis
+
+**Tools:** `ruff` (linting) + `mypy` (type checking)  
+**Trigger:** Every push / pull request via Forgejo Actions  
+**Risk:** Zero — read-only analysis of existing source  
+
+Surfaces:
+- Undefined names, unused imports (ruff/F rules)
+- Likely bugs: mutable defaults, string formatting issues (ruff/B rules)
+- Type errors: untyped returns, Optional not handled (mypy)
+- Modern Python upgrade opportunities (ruff/UP rules)
+
+First run will be noisy — output becomes the bug backlog.
+
+### Workstream 2 — Test Suite
+
+**Framework:** pytest + pytest-asyncio  
+**Trigger:** Every push / pull request via Forgejo Actions  
+**Isolation:** Integration tests use a Docker TimescaleDB service container  
+
+**Unit tests** (pure Python, no DB):
+- `test_clean_helpers.py` — `clean()`, `clean_num()`, `clean_ts()`, `is_valid_fix()` — these gate all data into the DB
+- `test_api_signing.py` — `build_sign()` MD5 signature correctness
+- `test_field_mapping.py` — locks in the three most bug-prone field mappings:
+  - FIX-E06: poll alarms use `alertTypeId`/`alarmTypeName`/`alertTime` (not `alarmType`)
+  - FIX-M16: trip distance arrives in metres, stored as km (÷ 1000)
+  - BUG-03: BCD timestamps `YYMMDDHHmmss` parsed correctly
+
+**Integration tests** (real TimescaleDB):
+- `test_movement_pipeline.py` — `poll_live_positions()` full round-trip, UPSERT idempotency
+- `test_events_pipeline.py` — `poll_alarms()` field mapping, NULL alarm_type rejection
+- `test_webhook_endpoints.py` — FastAPI endpoints with mock Jimi payloads, SAVEPOINT isolation
+
+### Workstream 3 — DB Audit
+
+**Runner:** `db_audit/run_audit.py` (Python)  
+**Trigger:** Daily at 06:00 EAT (03:00 UTC) via scheduled Forgejo workflow + `workflow_dispatch` for manual runs  
+**Output:** Rows written to `tracksolid.health_checks` table; queryable from Grafana  
+
+Six health checks:
+
+| Check | File | Critical | Warning |
+|---|---|---|---|
+| Stale devices | `stale_devices.sql` | — | Any enabled device with no GPS fix >2h |
+| NULL integrity | `null_integrity.sql` | Any NULL imei or gps_time in telemetry tables | — |
+| Distance outliers | `distance_outliers.sql` | — | Any trip >500km or <0km in last 7 days |
+| Duplicate positions | `duplicate_positions.sql` | Any (imei, gps_time) duplicate in position_history | — |
+| Data gaps | `data_gaps.sql` | — | Any enabled device with no data in 7 days |
+| Enum drift | `enum_drift.sql` | — | Unexpected value in source/severity columns |
+
+Exit code: `1` on any `critical`, `0` on `ok`/`warning`.
+
+---
+
+## File Layout
+
+```
+55_ts_coolify_gemini_prod/
+├── pyproject.toml              ← ADD: ruff + mypy + pytest config + dev deps
+├── .forgejo/
+│   └── workflows/
+│       ├── ci-static.yml
+│       ├── ci-tests.yml
+│       └── scheduled-audit.yml
+├── tests/
+│   ├── conftest.py
+│   ├── fixtures/
+│   │   ├── api_responses.py
+│   │   └── schema.sql
+│   ├── unit/
+│   │   ├── test_clean_helpers.py
+│   │   ├── test_api_signing.py
+│   │   └── test_field_mapping.py
+│   └── integration/
+│       ├── test_movement_pipeline.py
+│       ├── test_events_pipeline.py
+│       └── test_webhook_endpoints.py
+└── db_audit/
+    ├── run_audit.py
+    ├── checks/
+    │   ├── stale_devices.sql
+    │   ├── null_integrity.sql
+    │   ├── distance_outliers.sql
+    │   ├── duplicate_positions.sql
+    │   ├── data_gaps.sql
+    │   └── enum_drift.sql
+    └── schema/
+        └── health_checks_table.sql
+```
+
+---
+
+## Forgejo Runner Setup
+
+Before CI can run, a self-hosted runner must be registered on the Coolify server:
+
+1. Forgejo → Settings → Actions → Runners → Register Runner → copy token
+2. On Coolify server: `docker run -d --name forgejo-runner gitea/act_runner:latest register --instance https://repo.rahamafresh.com --token <TOKEN> --name coolify-runner --labels self-hosted`
+3. Verify runner appears as active in Forgejo
+
+Required Forgejo secrets:
+- `DATABASE_URL` — production DB connection string (for scheduled audit)
+- `TEST_DATABASE_URL` — set automatically by CI service container
+
+---
+
+## Verification
+
+| Workstream | Pass Criteria |
+|---|---|
+| Static Analysis | Push triggers CI-static; ruff + mypy produce output report; job exits non-zero on violations |
+| Test Suite | Push triggers CI-tests; all unit tests pass; integration tests pass against service container DB |
+| DB Audit | Manual run populates `health_checks` table; findings match known issues (44 silent devices, etc.); scheduled run fires at 06:00 EAT |