docs: add quality program design spec

This commit is contained in:
David Kiania 2026-04-12 21:31:56 +03:00
parent f9834564ab
commit 75d3417a2b

View file

@ -0,0 +1,148 @@
# Bug Reduction Quality Program — Design Spec
**Date:** 2026-04-12
**Project:** Fireside Communications Fleet Telemetry Ingestion Platform
**Repo:** `55_ts_coolify_gemini_prod`
**Status:** Approved — Implementation in Progress
## Problem
The platform has been running in production since late 2025 ingesting GPS and telemetry data from ~63 fleet vehicles. All bugs discovered to date (FIX-M11, FIX-M13, FIX-M16, FIX-E06, BUG-01 through BUG-05) were caught manually in production — via data inspection, Grafana anomalies, or customer reports. There are:
- Zero automated tests
- No linting or type-checking configuration
- No CI/CD pipeline
- No programmatic DB health monitoring
Any code change risks silent regressions. Any API field mapping change risks data going silently to NULL. Any schema change risks data corruption that may not be noticed for days.
## Goal
A layered quality program that:
1. **Finds existing bugs and data issues** without modifying source code
2. **Prevents future regressions** by locking in known-correct behaviour
3. **Monitors production DB health** on a daily schedule
## Constraints
- Existing source files MUST NOT be modified in Phase 1
- All additions are new files only (config, tests, CI workflows, audit scripts)
- Must run in CI (Forgejo Actions, self-hosted runner) and production (scheduled DB audit)
---
## Architecture: Three Parallel Workstreams
### Workstream 1 — Static Analysis
**Tools:** `ruff` (linting) + `mypy` (type checking)
**Trigger:** Every push / pull request via Forgejo Actions
**Risk:** Zero — read-only analysis of existing source
Surfaces:
- Undefined names, unused imports (ruff/F rules)
- Likely bugs: mutable defaults, string formatting issues (ruff/B rules)
- Type errors: untyped returns, Optional not handled (mypy)
- Modern Python upgrade opportunities (ruff/UP rules)
First run will be noisy — output becomes the bug backlog.
### Workstream 2 — Test Suite
**Framework:** pytest + pytest-asyncio
**Trigger:** Every push / pull request via Forgejo Actions
**Isolation:** Integration tests use a Docker TimescaleDB service container
**Unit tests** (pure Python, no DB):
- `test_clean_helpers.py``clean()`, `clean_num()`, `clean_ts()`, `is_valid_fix()` — these gate all data into the DB
- `test_api_signing.py``build_sign()` MD5 signature correctness
- `test_field_mapping.py` — locks in the three most bug-prone field mappings:
- FIX-E06: poll alarms use `alertTypeId`/`alarmTypeName`/`alertTime` (not `alarmType`)
- FIX-M16: trip distance arrives in metres, stored as km (÷ 1000)
- BUG-03: BCD timestamps `YYMMDDHHmmss` parsed correctly
**Integration tests** (real TimescaleDB):
- `test_movement_pipeline.py``poll_live_positions()` full round-trip, UPSERT idempotency
- `test_events_pipeline.py``poll_alarms()` field mapping, NULL alarm_type rejection
- `test_webhook_endpoints.py` — FastAPI endpoints with mock Jimi payloads, SAVEPOINT isolation
### Workstream 3 — DB Audit
**Runner:** `db_audit/run_audit.py` (Python)
**Trigger:** Daily at 06:00 EAT (03:00 UTC) via scheduled Forgejo workflow + `workflow_dispatch` for manual runs
**Output:** Rows written to `tracksolid.health_checks` table; queryable from Grafana
Six health checks:
| Check | File | Critical | Warning |
|---|---|---|---|
| Stale devices | `stale_devices.sql` | — | Any enabled device with no GPS fix >2h |
| NULL integrity | `null_integrity.sql` | Any NULL imei or gps_time in telemetry tables | — |
| Distance outliers | `distance_outliers.sql` | — | Any trip >500km or <0km in last 7 days |
| Duplicate positions | `duplicate_positions.sql` | Any (imei, gps_time) duplicate in position_history | — |
| Data gaps | `data_gaps.sql` | — | Any enabled device with no data in 7 days |
| Enum drift | `enum_drift.sql` | — | Unexpected value in source/severity columns |
Exit code: `1` on any `critical`, `0` on `ok`/`warning`.
---
## File Layout
```
55_ts_coolify_gemini_prod/
├── pyproject.toml ← ADD: ruff + mypy + pytest config + dev deps
├── .forgejo/
│ └── workflows/
│ ├── ci-static.yml
│ ├── ci-tests.yml
│ └── scheduled-audit.yml
├── tests/
│ ├── conftest.py
│ ├── fixtures/
│ │ ├── api_responses.py
│ │ └── schema.sql
│ ├── unit/
│ │ ├── test_clean_helpers.py
│ │ ├── test_api_signing.py
│ │ └── test_field_mapping.py
│ └── integration/
│ ├── test_movement_pipeline.py
│ ├── test_events_pipeline.py
│ └── test_webhook_endpoints.py
└── db_audit/
├── run_audit.py
├── checks/
│ ├── stale_devices.sql
│ ├── null_integrity.sql
│ ├── distance_outliers.sql
│ ├── duplicate_positions.sql
│ ├── data_gaps.sql
│ └── enum_drift.sql
└── schema/
└── health_checks_table.sql
```
---
## Forgejo Runner Setup
Before CI can run, a self-hosted runner must be registered on the Coolify server:
1. Forgejo → Settings → Actions → Runners → Register Runner → copy token
2. On Coolify server: `docker run -d --name forgejo-runner gitea/act_runner:latest register --instance https://repo.rahamafresh.com --token <TOKEN> --name coolify-runner --labels self-hosted`
3. Verify runner appears as active in Forgejo
Required Forgejo secrets:
- `DATABASE_URL` — production DB connection string (for scheduled audit)
- `TEST_DATABASE_URL` — set automatically by CI service container
---
## Verification
| Workstream | Pass Criteria |
|---|---|
| Static Analysis | Push triggers CI-static; ruff + mypy produce output report; job exits non-zero on violations |
| Test Suite | Push triggers CI-tests; all unit tests pass; integration tests pass against service container DB |
| DB Audit | Manual run populates `health_checks` table; findings match known issues (44 silent devices, etc.); scheduled run fires at 06:00 EAT |