Bug Reduction Quality Program — Design Spec

Date: 2026-04-12
Project: Fireside Communications Fleet Telemetry Ingestion Platform
Repo: 55_ts_coolify_gemini_prod
Status: Approved — Implementation in Progress

Problem

The platform has been running in production since late 2025 ingesting GPS and telemetry data from ~63 fleet vehicles. All bugs discovered to date (FIX-M11, FIX-M13, FIX-M16, FIX-E06, BUG-01 through BUG-05) were caught manually in production — via data inspection, Grafana anomalies, or customer reports. There are:

Zero automated tests
No linting or type-checking configuration
No CI/CD pipeline
No programmatic DB health monitoring

Any code change risks silent regressions. Any API field mapping change risks data going silently to NULL. Any schema change risks data corruption that may not be noticed for days.

Goal

A layered quality program that:

Finds existing bugs and data issues without modifying source code
Prevents future regressions by locking in known-correct behaviour
Monitors production DB health on a daily schedule

Constraints

Existing source files MUST NOT be modified in Phase 1
All additions are new files only (config, tests, CI workflows, audit scripts)
Must run in CI (Forgejo Actions, self-hosted runner) and production (scheduled DB audit)

Architecture: Three Parallel Workstreams

Workstream 1 — Static Analysis

Tools: ruff (linting) + mypy (type checking)
Trigger: Every push / pull request via Forgejo Actions
Risk: Zero — read-only analysis of existing source

Surfaces:

Undefined names, unused imports (ruff/F rules)
Likely bugs: mutable defaults, string formatting issues (ruff/B rules)
Type errors: untyped returns, Optional not handled (mypy)
Modern Python upgrade opportunities (ruff/UP rules)

First run will be noisy — output becomes the bug backlog.

Workstream 2 — Test Suite

Framework: pytest + pytest-asyncio
Trigger: Every push / pull request via Forgejo Actions
Isolation: Integration tests use a Docker TimescaleDB service container

Unit tests (pure Python, no DB):

test_clean_helpers.py — clean(), clean_num(), clean_ts(), is_valid_fix() — these gate all data into the DB
test_api_signing.py — build_sign() MD5 signature correctness
test_field_mapping.py — locks in the three most bug-prone field mappings:
- FIX-E06: poll alarms use alertTypeId/alarmTypeName/alertTime (not alarmType)
- FIX-M16: trip distance arrives in metres, stored as km (÷ 1000)
- BUG-03: BCD timestamps YYMMDDHHmmss parsed correctly

Integration tests (real TimescaleDB):

test_movement_pipeline.py — poll_live_positions() full round-trip, UPSERT idempotency
test_events_pipeline.py — poll_alarms() field mapping, NULL alarm_type rejection
test_webhook_endpoints.py — FastAPI endpoints with mock Jimi payloads, SAVEPOINT isolation

Workstream 3 — DB Audit

Runner: db_audit/run_audit.py (Python)
Trigger: Daily at 06:00 EAT (03:00 UTC) via scheduled Forgejo workflow + workflow_dispatch for manual runs
Output: Rows written to tracksolid.health_checks table; queryable from Grafana

Six health checks:

Check	File	Critical	Warning
Stale devices	`stale_devices.sql`	—	Any enabled device with no GPS fix >2h
NULL integrity	`null_integrity.sql`	Any NULL imei or gps_time in telemetry tables	—
Distance outliers	`distance_outliers.sql`	—	Any trip >500km or <0km in last 7 days
Duplicate positions	`duplicate_positions.sql`	Any (imei, gps_time) duplicate in position_history	—
Data gaps	`data_gaps.sql`	—	Any enabled device with no data in 7 days
Enum drift	`enum_drift.sql`	—	Unexpected value in source/severity columns

Exit code: 1 on any critical, 0 on ok/warning.

File Layout

55_ts_coolify_gemini_prod/
├── pyproject.toml              ← ADD: ruff + mypy + pytest config + dev deps
├── .forgejo/
│   └── workflows/
│       ├── ci-static.yml
│       ├── ci-tests.yml
│       └── scheduled-audit.yml
├── tests/
│   ├── conftest.py
│   ├── fixtures/
│   │   ├── api_responses.py
│   │   └── schema.sql
│   ├── unit/
│   │   ├── test_clean_helpers.py
│   │   ├── test_api_signing.py
│   │   └── test_field_mapping.py
│   └── integration/
│       ├── test_movement_pipeline.py
│       ├── test_events_pipeline.py
│       └── test_webhook_endpoints.py
└── db_audit/
    ├── run_audit.py
    ├── checks/
    │   ├── stale_devices.sql
    │   ├── null_integrity.sql
    │   ├── distance_outliers.sql
    │   ├── duplicate_positions.sql
    │   ├── data_gaps.sql
    │   └── enum_drift.sql
    └── schema/
        └── health_checks_table.sql

Forgejo Runner Setup

Before CI can run, a self-hosted runner must be registered on the Coolify server:

Forgejo → Settings → Actions → Runners → Register Runner → copy token
On Coolify server: docker run -d --name forgejo-runner gitea/act_runner:latest register --instance https://repo.rahamafresh.com --token <TOKEN> --name coolify-runner --labels self-hosted
Verify runner appears as active in Forgejo

Required Forgejo secrets:

DATABASE_URL — production DB connection string (for scheduled audit)
TEST_DATABASE_URL — set automatically by CI service container

Verification

Workstream	Pass Criteria
Static Analysis	Push triggers CI-static; ruff + mypy produce output report; job exits non-zero on violations
Test Suite	Push triggers CI-tests; all unit tests pass; integration tests pass against service container DB
DB Audit	Manual run populates `health_checks` table; findings match known issues (44 silent devices, etc.); scheduled run fires at 06:00 EAT

6 KiB Raw Blame History