tracksolid_timescale_grafan.../docs/superpowers/specs/2026-04-12-bug-reduction-quality-program-design.md
2026-04-12 21:31:56 +03:00

6 KiB

Bug Reduction Quality Program — Design Spec

Date: 2026-04-12
Project: Fireside Communications Fleet Telemetry Ingestion Platform
Repo: 55_ts_coolify_gemini_prod
Status: Approved — Implementation in Progress

Problem

The platform has been running in production since late 2025 ingesting GPS and telemetry data from ~63 fleet vehicles. All bugs discovered to date (FIX-M11, FIX-M13, FIX-M16, FIX-E06, BUG-01 through BUG-05) were caught manually in production — via data inspection, Grafana anomalies, or customer reports. There are:

  • Zero automated tests
  • No linting or type-checking configuration
  • No CI/CD pipeline
  • No programmatic DB health monitoring

Any code change risks silent regressions. Any API field mapping change risks data going silently to NULL. Any schema change risks data corruption that may not be noticed for days.

Goal

A layered quality program that:

  1. Finds existing bugs and data issues without modifying source code
  2. Prevents future regressions by locking in known-correct behaviour
  3. Monitors production DB health on a daily schedule

Constraints

  • Existing source files MUST NOT be modified in Phase 1
  • All additions are new files only (config, tests, CI workflows, audit scripts)
  • Must run in CI (Forgejo Actions, self-hosted runner) and production (scheduled DB audit)

Architecture: Three Parallel Workstreams

Workstream 1 — Static Analysis

Tools: ruff (linting) + mypy (type checking)
Trigger: Every push / pull request via Forgejo Actions
Risk: Zero — read-only analysis of existing source

Surfaces:

  • Undefined names, unused imports (ruff/F rules)
  • Likely bugs: mutable defaults, string formatting issues (ruff/B rules)
  • Type errors: untyped returns, Optional not handled (mypy)
  • Modern Python upgrade opportunities (ruff/UP rules)

First run will be noisy — output becomes the bug backlog.

Workstream 2 — Test Suite

Framework: pytest + pytest-asyncio
Trigger: Every push / pull request via Forgejo Actions
Isolation: Integration tests use a Docker TimescaleDB service container

Unit tests (pure Python, no DB):

  • test_clean_helpers.pyclean(), clean_num(), clean_ts(), is_valid_fix() — these gate all data into the DB
  • test_api_signing.pybuild_sign() MD5 signature correctness
  • test_field_mapping.py — locks in the three most bug-prone field mappings:
    • FIX-E06: poll alarms use alertTypeId/alarmTypeName/alertTime (not alarmType)
    • FIX-M16: trip distance arrives in metres, stored as km (÷ 1000)
    • BUG-03: BCD timestamps YYMMDDHHmmss parsed correctly

Integration tests (real TimescaleDB):

  • test_movement_pipeline.pypoll_live_positions() full round-trip, UPSERT idempotency
  • test_events_pipeline.pypoll_alarms() field mapping, NULL alarm_type rejection
  • test_webhook_endpoints.py — FastAPI endpoints with mock Jimi payloads, SAVEPOINT isolation

Workstream 3 — DB Audit

Runner: db_audit/run_audit.py (Python)
Trigger: Daily at 06:00 EAT (03:00 UTC) via scheduled Forgejo workflow + workflow_dispatch for manual runs
Output: Rows written to tracksolid.health_checks table; queryable from Grafana

Six health checks:

Check File Critical Warning
Stale devices stale_devices.sql Any enabled device with no GPS fix >2h
NULL integrity null_integrity.sql Any NULL imei or gps_time in telemetry tables
Distance outliers distance_outliers.sql Any trip >500km or <0km in last 7 days
Duplicate positions duplicate_positions.sql Any (imei, gps_time) duplicate in position_history
Data gaps data_gaps.sql Any enabled device with no data in 7 days
Enum drift enum_drift.sql Unexpected value in source/severity columns

Exit code: 1 on any critical, 0 on ok/warning.


File Layout

55_ts_coolify_gemini_prod/
├── pyproject.toml              ← ADD: ruff + mypy + pytest config + dev deps
├── .forgejo/
│   └── workflows/
│       ├── ci-static.yml
│       ├── ci-tests.yml
│       └── scheduled-audit.yml
├── tests/
│   ├── conftest.py
│   ├── fixtures/
│   │   ├── api_responses.py
│   │   └── schema.sql
│   ├── unit/
│   │   ├── test_clean_helpers.py
│   │   ├── test_api_signing.py
│   │   └── test_field_mapping.py
│   └── integration/
│       ├── test_movement_pipeline.py
│       ├── test_events_pipeline.py
│       └── test_webhook_endpoints.py
└── db_audit/
    ├── run_audit.py
    ├── checks/
    │   ├── stale_devices.sql
    │   ├── null_integrity.sql
    │   ├── distance_outliers.sql
    │   ├── duplicate_positions.sql
    │   ├── data_gaps.sql
    │   └── enum_drift.sql
    └── schema/
        └── health_checks_table.sql

Forgejo Runner Setup

Before CI can run, a self-hosted runner must be registered on the Coolify server:

  1. Forgejo → Settings → Actions → Runners → Register Runner → copy token
  2. On Coolify server: docker run -d --name forgejo-runner gitea/act_runner:latest register --instance https://repo.rahamafresh.com --token <TOKEN> --name coolify-runner --labels self-hosted
  3. Verify runner appears as active in Forgejo

Required Forgejo secrets:

  • DATABASE_URL — production DB connection string (for scheduled audit)
  • TEST_DATABASE_URL — set automatically by CI service container

Verification

Workstream Pass Criteria
Static Analysis Push triggers CI-static; ruff + mypy produce output report; job exits non-zero on violations
Test Suite Push triggers CI-tests; all unit tests pass; integration tests pass against service container DB
DB Audit Manual run populates health_checks table; findings match known issues (44 silent devices, etc.); scheduled run fires at 06:00 EAT