fleettickets/n8n-s3-ticket-exports.md

131 lines
6.2 KiB
Markdown
Raw Normal View History

# n8n S3 Ticket Exports — Incremental (CDC) Stream
Updated on June 23, 2026.
> **History.** This doc previously described a full-snapshot-per-hour model
> ("No delta files … No `changes/` directories"). That is no longer accurate.
> As of the June 22, 2026 re-seed the source switched to an **incremental
> change-data-capture (CDC) stream** under `automations/<dataset>/changes/`.
> The structure below was verified by direct S3 inspection of the `tickets`
> bucket on June 23, 2026. Workflow-internal details (IDs, node behaviour) are
> carried over from the prior version and may be stale — trust the bucket.
## Overview
The FTTH ticket export now writes an **incremental** CSV stream per dataset:
- The **first** file in a stream is a full current-state **baseline**.
- Every **later** file holds **only the rows that changed** since the prior
export — new and updated tickets, keyed by `ticket_id`.
- **Deletions are never emitted** (tickets are closed in place, not removed).
Consumers must ingest **every not-yet-processed file in ascending timestamp
order** (baseline first, then each delta) and **upsert on `ticket_id`**. Taking
only the newest file silently drops the intermediate deltas.
CSV files only. Filenames use the execution time in the `Africa/Nairobi`
timezone (format below). All files share one identical flat-CSV schema (header
+ rows) — the same column set the previous full snapshots used.
## Output Layout
The change stream lives under a `changes/` prefix per dataset in the `tickets`
bucket:
```text
automations/crq/changes/YYYY-MM-DDTHH-mm-ss.csv
automations/inc/changes/YYYY-MM-DDTHH-mm-ss.csv
```
Observed `tickets` bucket layout (June 23, 2026):
```text
automations/inc/
├── changes/ ← ACTIVE incremental stream (baseline + deltas)
│ ├── 2026-06-22T15-50-39.csv (baseline, ~15 MB, 34,849 rows)
│ ├── 2026-06-22T15-53-04.csv (delta, 1 row)
│ ├── 2026-06-22T17-10-41.csv (delta, 22 rows)
│ └── 2026-06-22T17-15-41.csv (delta, 131 rows)
├── processed/ ← our pipeline's archive of consumed files
├── full/ ← present but EMPTY (legacy prefix)
├── latest.csv/ ← present but EMPTY (legacy prefix)
└── latest.json/ ← present but EMPTY (legacy prefix)
```
Notes:
- There are **no longer any `automations/inc/<ts>.csv` files at the root** of
`inc/` — the last full snapshots (through `2026-06-18T17-00-05.csv`) were
archived to `processed/`. New data arrives **only** under `changes/`.
- The `full/`, `latest.csv/`, and `latest.json/` prefixes still appear in
listings but contain **no objects** (leftover/legacy; ignore them). There is
no `latest` pointer and no JSON/metadata envelope.
- **CRQ mirrors INC**: `automations/crq/changes/` carries the same incremental
2026-06-25 20:16:38 +00:00
stream (with matching baseline timestamps and additional newer deltas) and the
identical 32-column schema. As of 2026-06-25 CRQ **is consumed** — by
`crq/import_crq.py` over the shared `pipeline.py` engine (`tickets.crq`), the same
way INC is consumed by `inc/import_inc.py`. CRQ's old root snapshots
(`automations/crq/<ts>.csv`, old `tickets` bucket) are still present because nothing
archives them — they are not consumed (the `changes/` stream is the source of truth).
## CSV Schema
Header (32 columns), identical across baseline and delta files:
```text
ticket_id, source_type, service_type, bucket, raw_status, normalized_status,
created_at_service, scheduled_at, closed_at, last_seen_at, first_seen_at,
week_start, week_end, cluster, region, location_name, latitude, longitude,
department, assigned_team, owner, sla_status, mttr, is_auto_created,
is_auto_closed, is_alarm, is_actionable, source_s3_bucket, source_s3_key,
source_snapshot_id, created_at, updated_at
```
Each row is a complete record (not a partial diff): a delta row carries the
ticket's full current state, so a plain upsert on `ticket_id` is correct. The
baseline still contains `is_alarm=true` rows and may include a leading
`EXPORT STOPPED…` truncation-sentinel row in `ticket_id`; both are filtered by
2026-06-25 20:16:38 +00:00
the consumer (see `pipeline.py` + the `inc/`,`crq/` entrypoints).
## Timestamp Format
```text
YYYY-MM-DDTHH-mm-ss e.g. 2026-06-22T15-50-39
```
Generated once at the start of each execution, formatted in `Africa/Nairobi`
(EAT). Note this is the *execution* time, not a top-of-the-hour schedule — the
incremental files appear whenever a change batch is exported (multiple within
the same hour are normal, e.g. `15-50-39` then `15-53-04`).
2026-06-25 20:16:38 +00:00
## How the loader Consumes It
2026-06-25 20:16:38 +00:00
`python -m inc.import_inc --from-bucket --apply` (and `python -m crq.import_crq
--from-bucket --apply`; see `run_ingest.sh`) — both over the shared `pipeline.py` engine:
2026-06-25 20:16:38 +00:00
1. Lists `automations/<inc|crq>/changes/<ts>.csv`, sorts ascending by timestamp.
2. Skips files at/older than the **watermark**
(`tickets.import_meta.metadata->>'source_max_key'` — the newest file already
applied); on a fresh stream it processes everything present.
3. For each pending file, oldest→newest: drop `is_alarm=true` + sentinel rows,
strip `DROP_FIELDS`, normalize `region`/`raw_status`, then upsert on
`ticket_id`. The row upsert and the watermark advance **commit in one
transaction per file**, after which the file is moved to
`automations/inc/processed/`.
4. A mid-run failure therefore leaves folder state consistent with the
watermark; the next run resumes cleanly from where it stopped.
The first file applied onto an empty watermark is recorded as
`export_type="baseline"` in `tickets.import_meta`; every file after is `"delta"`.
## Workflow Context (carried over — verify before relying on)
The export originates from the FTTH Automation Ticket export workflow, calling
the authenticated Scoreboard export endpoint and uploading CSV(s) to the
`tickets` bucket; a sibling workflow exports `fuel_records/<ts>.csv` to the
`fuel` bucket. Source DB queries are read-only and the workflows do not delete
or update source rows. The previously documented workflow IDs and the claim of
"two files per hour, full snapshots, no `changes/`" predate the June 22 switch
to the incremental stream and should be re-confirmed against the live n8n
configuration before being treated as current.