6 KiB
Bug Reduction Quality Program — Design Spec
Date: 2026-04-12
Project: Fireside Communications Fleet Telemetry Ingestion Platform
Repo: 55_ts_coolify_gemini_prod
Status: Approved — Implementation in Progress
Problem
The platform has been running in production since late 2025 ingesting GPS and telemetry data from ~63 fleet vehicles. All bugs discovered to date (FIX-M11, FIX-M13, FIX-M16, FIX-E06, BUG-01 through BUG-05) were caught manually in production — via data inspection, Grafana anomalies, or customer reports. There are:
- Zero automated tests
- No linting or type-checking configuration
- No CI/CD pipeline
- No programmatic DB health monitoring
Any code change risks silent regressions. Any API field mapping change risks data going silently to NULL. Any schema change risks data corruption that may not be noticed for days.
Goal
A layered quality program that:
- Finds existing bugs and data issues without modifying source code
- Prevents future regressions by locking in known-correct behaviour
- Monitors production DB health on a daily schedule
Constraints
- Existing source files MUST NOT be modified in Phase 1
- All additions are new files only (config, tests, CI workflows, audit scripts)
- Must run in CI (Forgejo Actions, self-hosted runner) and production (scheduled DB audit)
Architecture: Three Parallel Workstreams
Workstream 1 — Static Analysis
Tools: ruff (linting) + mypy (type checking)
Trigger: Every push / pull request via Forgejo Actions
Risk: Zero — read-only analysis of existing source
Surfaces:
- Undefined names, unused imports (ruff/F rules)
- Likely bugs: mutable defaults, string formatting issues (ruff/B rules)
- Type errors: untyped returns, Optional not handled (mypy)
- Modern Python upgrade opportunities (ruff/UP rules)
First run will be noisy — output becomes the bug backlog.
Workstream 2 — Test Suite
Framework: pytest + pytest-asyncio
Trigger: Every push / pull request via Forgejo Actions
Isolation: Integration tests use a Docker TimescaleDB service container
Unit tests (pure Python, no DB):
test_clean_helpers.py—clean(),clean_num(),clean_ts(),is_valid_fix()— these gate all data into the DBtest_api_signing.py—build_sign()MD5 signature correctnesstest_field_mapping.py— locks in the three most bug-prone field mappings:- FIX-E06: poll alarms use
alertTypeId/alarmTypeName/alertTime(notalarmType) - FIX-M16: trip distance arrives in metres, stored as km (÷ 1000)
- BUG-03: BCD timestamps
YYMMDDHHmmssparsed correctly
- FIX-E06: poll alarms use
Integration tests (real TimescaleDB):
test_movement_pipeline.py—poll_live_positions()full round-trip, UPSERT idempotencytest_events_pipeline.py—poll_alarms()field mapping, NULL alarm_type rejectiontest_webhook_endpoints.py— FastAPI endpoints with mock Jimi payloads, SAVEPOINT isolation
Workstream 3 — DB Audit
Runner: db_audit/run_audit.py (Python)
Trigger: Daily at 06:00 EAT (03:00 UTC) via scheduled Forgejo workflow + workflow_dispatch for manual runs
Output: Rows written to tracksolid.health_checks table; queryable from Grafana
Six health checks:
| Check | File | Critical | Warning |
|---|---|---|---|
| Stale devices | stale_devices.sql |
— | Any enabled device with no GPS fix >2h |
| NULL integrity | null_integrity.sql |
Any NULL imei or gps_time in telemetry tables | — |
| Distance outliers | distance_outliers.sql |
— | Any trip >500km or <0km in last 7 days |
| Duplicate positions | duplicate_positions.sql |
Any (imei, gps_time) duplicate in position_history | — |
| Data gaps | data_gaps.sql |
— | Any enabled device with no data in 7 days |
| Enum drift | enum_drift.sql |
— | Unexpected value in source/severity columns |
Exit code: 1 on any critical, 0 on ok/warning.
File Layout
55_ts_coolify_gemini_prod/
├── pyproject.toml ← ADD: ruff + mypy + pytest config + dev deps
├── .forgejo/
│ └── workflows/
│ ├── ci-static.yml
│ ├── ci-tests.yml
│ └── scheduled-audit.yml
├── tests/
│ ├── conftest.py
│ ├── fixtures/
│ │ ├── api_responses.py
│ │ └── schema.sql
│ ├── unit/
│ │ ├── test_clean_helpers.py
│ │ ├── test_api_signing.py
│ │ └── test_field_mapping.py
│ └── integration/
│ ├── test_movement_pipeline.py
│ ├── test_events_pipeline.py
│ └── test_webhook_endpoints.py
└── db_audit/
├── run_audit.py
├── checks/
│ ├── stale_devices.sql
│ ├── null_integrity.sql
│ ├── distance_outliers.sql
│ ├── duplicate_positions.sql
│ ├── data_gaps.sql
│ └── enum_drift.sql
└── schema/
└── health_checks_table.sql
Forgejo Runner Setup
Before CI can run, a self-hosted runner must be registered on the Coolify server:
- Forgejo → Settings → Actions → Runners → Register Runner → copy token
- On Coolify server:
docker run -d --name forgejo-runner gitea/act_runner:latest register --instance https://repo.rahamafresh.com --token <TOKEN> --name coolify-runner --labels self-hosted - Verify runner appears as active in Forgejo
Required Forgejo secrets:
DATABASE_URL— production DB connection string (for scheduled audit)TEST_DATABASE_URL— set automatically by CI service container
Verification
| Workstream | Pass Criteria |
|---|---|
| Static Analysis | Push triggers CI-static; ruff + mypy produce output report; job exits non-zero on violations |
| Test Suite | Push triggers CI-tests; all unit tests pass; integration tests pass against service container DB |
| DB Audit | Manual run populates health_checks table; findings match known issues (44 silent devices, etc.); scheduled run fires at 06:00 EAT |