fix: BUG-01 ETL (type crash + cartesian explosion), BUG-02 multi-account audit, BUG-03 diagnostic #12

Open
kianiadee wants to merge 2 commits from fix/bugs-01-02-03 into main
Owner

Summary

First tranche of fixes from the recent code audit. CRITICAL + HIGH severity items only.

  • BUG-01a (CRITICAL — type crash): dwh_gold.refresh_daily_metrics() inserted t.imei (TEXT) into fact_daily_fleet_metrics.vehicle_key (INTEGER REFERENCES dim_vehicles), so the nightly ETL would have raised invalid input syntax for type integer on every run.
  • BUG-01b (CRITICAL — cartesian explosion, discovered during PR review): the original function joined trips × alarms in a single SELECT, producing one row per (trip, alarm) pair. Every SUM/COUNT over trip columns was multiplied by the per-IMEI alarm count. Spot-checks on a corrected ad-hoc version of the SELECT showed total_trips identical to alarm_count, drive_hours > 1000/day, and total_distance_km in the tens of thousands per vehicle per day. Without this second fix, the ETL would have stopped crashing but produced garbage metrics.

Migration 08_fix_etl_vehicle_key.sql resolves both:

  1. Backfills dwh_gold.dim_vehicles from tracksolid.devices so every IMEI has a serial vehicle_key.
  2. Rewrites refresh_daily_metrics() to compute trip_agg and alarm_agg in separate CTEs, then joins the aggregates on IMEI and maps IMEI → vehicle_key via dim_vehicles.
  3. Re-syncs dim_vehicles at the top of each call so newly registered devices appear in the warehouse without manual seeding.
  • BUG-02 (HIGH): sync_driver_audit.py only queried TARGET_ACCOUNT, ignoring the Fireside@HQ and Fireside_MSA sub-accounts. The audit now iterates TARGETS (matching FIX-M19 in ingest_movement_rev.sync_devices), dedupes devices by IMEI, and tolerates per-target failures.
  • BUG-03 (HIGH, diagnostic only): the webhook trip handler stores item["miles"] straight into distance_km. The field name is suspicious and FIX-M16 already proved the polling endpoint mis-documents its units. Added db_audit/checks/bug03_webhook_distance_units.sql which compares the distribution of stored_km / great_circle_km for push-source vs poll-source trips over 30 days — the ratio test will tell us whether the push value needs /1.609 (miles), /1000 (metres), or no conversion. The existing calculation is left unchanged until the data confirms; the misleading [FIX-M11] comment is replaced with a [BUG-03] pointer to the diagnostic.

Test plan

  • Deploy. Migration runner picks up 08_fix_etl_vehicle_key.sql on container restart.
  • Verify dwh_gold.dim_vehicles row count equals tracksolid.devices row count.
  • Run SELECT dwh_gold.refresh_daily_metrics(CURRENT_DATE - 1); — should succeed (previously crashed).
  • Confirm dwh_gold.fact_daily_fleet_metrics rows for yesterday show plausible values: total_trips in single/double digits, total_drive_hours ≤ 24, alarm_count decoupled from total_trips.
  • Run python sync_driver_audit.py inside the ingest_movement container — total device count should match the union of all sub-accounts, not just one.
  • Run db_audit/checks/bug03_webhook_distance_units.sql and capture the per-source median ratio; if push median ≈ 1.609× poll median, raise a follow-up to divide the webhook miles value.

Notes

  • BUG-01b was uncovered by running the original SELECT body manually against live data — total_trips and alarm_count came out identical, revealing the cartesian multiplication.
  • No production data change until the migration runs.
  • BUG-03 is intentionally diagnostic-only — calculation is unchanged until the ratio test settles the unit question.

🤖 Generated with Claude Code

## Summary First tranche of fixes from the recent code audit. CRITICAL + HIGH severity items only. - **BUG-01a (CRITICAL — type crash):** `dwh_gold.refresh_daily_metrics()` inserted `t.imei` (TEXT) into `fact_daily_fleet_metrics.vehicle_key` (INTEGER REFERENCES dim_vehicles), so the nightly ETL would have raised `invalid input syntax for type integer` on every run. - **BUG-01b (CRITICAL — cartesian explosion, discovered during PR review):** the original function joined `trips × alarms` in a single SELECT, producing one row per (trip, alarm) pair. Every SUM/COUNT over trip columns was multiplied by the per-IMEI alarm count. Spot-checks on a corrected ad-hoc version of the SELECT showed total_trips identical to alarm_count, drive_hours > 1000/day, and total_distance_km in the tens of thousands per vehicle per day. **Without this second fix, the ETL would have stopped crashing but produced garbage metrics.** Migration `08_fix_etl_vehicle_key.sql` resolves both: 1. Backfills `dwh_gold.dim_vehicles` from `tracksolid.devices` so every IMEI has a serial `vehicle_key`. 2. Rewrites `refresh_daily_metrics()` to compute `trip_agg` and `alarm_agg` in **separate CTEs**, then joins the aggregates on IMEI and maps IMEI → vehicle_key via `dim_vehicles`. 3. Re-syncs `dim_vehicles` at the top of each call so newly registered devices appear in the warehouse without manual seeding. - **BUG-02 (HIGH):** `sync_driver_audit.py` only queried `TARGET_ACCOUNT`, ignoring the `Fireside@HQ` and `Fireside_MSA` sub-accounts. The audit now iterates `TARGETS` (matching FIX-M19 in `ingest_movement_rev.sync_devices`), dedupes devices by IMEI, and tolerates per-target failures. - **BUG-03 (HIGH, diagnostic only):** the webhook trip handler stores `item["miles"]` straight into `distance_km`. The field name is suspicious and FIX-M16 already proved the polling endpoint mis-documents its units. Added `db_audit/checks/bug03_webhook_distance_units.sql` which compares the distribution of `stored_km / great_circle_km` for push-source vs poll-source trips over 30 days — the ratio test will tell us whether the push value needs `/1.609` (miles), `/1000` (metres), or no conversion. The existing calculation is left unchanged until the data confirms; the misleading `[FIX-M11]` comment is replaced with a `[BUG-03]` pointer to the diagnostic. ## Test plan - [ ] Deploy. Migration runner picks up `08_fix_etl_vehicle_key.sql` on container restart. - [ ] Verify `dwh_gold.dim_vehicles` row count equals `tracksolid.devices` row count. - [ ] Run `SELECT dwh_gold.refresh_daily_metrics(CURRENT_DATE - 1);` — should succeed (previously crashed). - [ ] Confirm `dwh_gold.fact_daily_fleet_metrics` rows for yesterday show **plausible** values: `total_trips` in single/double digits, `total_drive_hours ≤ 24`, `alarm_count` decoupled from `total_trips`. - [ ] Run `python sync_driver_audit.py` inside the `ingest_movement` container — total device count should match the union of all sub-accounts, not just one. - [ ] Run `db_audit/checks/bug03_webhook_distance_units.sql` and capture the per-source median ratio; if push median ≈ 1.609× poll median, raise a follow-up to divide the webhook `miles` value. ## Notes - BUG-01b was uncovered by running the original SELECT body manually against live data — total_trips and alarm_count came out identical, revealing the cartesian multiplication. - No production data change until the migration runs. - BUG-03 is intentionally diagnostic-only — calculation is unchanged until the ratio test settles the unit question. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
kianiadee added 1 commit 2026-05-15 12:39:19 +00:00
fix: BUG-01 ETL type crash, BUG-02 multi-account audit, BUG-03 diagnostic
Some checks failed
Static Analysis / static (push) Waiting to run
Tests / test (push) Waiting to run
Static Analysis / static (pull_request) Has been cancelled
Tests / test (pull_request) Has been cancelled
8d386bf27a
BUG-01 (CRITICAL): dwh_gold.refresh_daily_metrics inserted t.imei (TEXT) into
fact_daily_fleet_metrics.vehicle_key (INTEGER REFERENCES dim_vehicles), so the
nightly ETL would have raised "invalid input syntax for type integer" on every
run. Migration 08 backfills dim_vehicles from tracksolid.devices and rewrites
the function to JOIN through dim_vehicles, returning the serial vehicle_key.
The function also re-syncs dim_vehicles at the top of each call so newly
registered devices appear in the warehouse without manual seeding.

BUG-02 (HIGH): sync_driver_audit.py only queried TARGET_ACCOUNT, ignoring the
Fireside@HQ and Fireside_MSA sub-accounts. The audit now iterates TARGETS
(matching FIX-M19 in ingest_movement_rev.sync_devices), dedupes devices by
IMEI, and tolerates per-target failures.

BUG-03 (HIGH, diagnostic only): the webhook trip handler stores item["miles"]
straight into distance_km. The field name is suspicious and FIX-M16 already
proved the polling endpoint mis-documents its units. Added a SQL diagnostic
that compares the distribution of stored-km / great-circle-km for push-source
vs poll-source trips over 30 days — the ratio test will tell us whether the
push value needs a /1.609 (miles), /1000 (metres), or no conversion. The
existing calculation is left unchanged until the data confirms the unit; the
old FIX-M11 comment is replaced with a BUG-03 pointer to the diagnostic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kianiadee added 1 commit 2026-05-15 13:44:48 +00:00
fix(BUG-01b): aggregate trips and alarms in separate CTEs to avoid cartesian explosion
Some checks failed
Static Analysis / static (push) Has been cancelled
Tests / test (push) Has been cancelled
Static Analysis / static (pull_request) Has been cancelled
Tests / test (pull_request) Has been cancelled
7bc0a2ce87
The original refresh_daily_metrics() joined trips × alarms in one SELECT,
producing one row per (trip, alarm) pair. Every SUM/COUNT over trip
columns was multiplied by the per-IMEI alarm count, so spot-checks
showed total_trips identical to alarm_count, drive_hours > 1000/day,
and distance_km in the tens of thousands per vehicle per day.

Migration 08 carried that flawed join forward when fixing the
TEXT→INTEGER vehicle_key crash. Rewriting the function so trip_agg and
alarm_agg are computed in separate CTEs and then joined on imei
restores correct per-vehicle aggregates: total_trips reflects real trip
count, drive_hours ≤ 24, alarms are counted once.

This bug is being fixed in the same migration file (08) before PR #12
merges; no deploy has applied the prior version, so no second migration
is needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kianiadee changed title from fix: BUG-01 ETL type crash, BUG-02 multi-account audit, BUG-03 diagnostic to fix: BUG-01 ETL (type crash + cartesian explosion), BUG-02 multi-account audit, BUG-03 diagnostic 2026-05-15 13:57:41 +00:00
Some checks failed
Static Analysis / static (push) Has been cancelled
Tests / test (push) Has been cancelled
Static Analysis / static (pull_request) Has been cancelled
Tests / test (pull_request) Has been cancelled
This pull request has changes conflicting with the target branch.
  • run_migrations.py
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/bugs-01-02-03:fix/bugs-01-02-03
git checkout fix/bugs-01-02-03

Merge

Merge the changes and update on Forgejo.
git checkout main
git merge --no-ff fix/bugs-01-02-03
git checkout main
git merge --ff-only fix/bugs-01-02-03
git checkout fix/bugs-01-02-03
git rebase main
git checkout main
git merge --no-ff fix/bugs-01-02-03
git checkout main
git merge --squash fix/bugs-01-02-03
git checkout main
git merge --ff-only fix/bugs-01-02-03
git checkout main
git merge fix/bugs-01-02-03
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: kianiadee/tracksolid_timescale_grafana_prod#12
No description provided.