128 lines
5.9 KiB
Markdown
128 lines
5.9 KiB
Markdown
|
|
# n8n S3 Ticket Exports — Incremental (CDC) Stream
|
||
|
|
|
||
|
|
Updated on June 23, 2026.
|
||
|
|
|
||
|
|
> **History.** This doc previously described a full-snapshot-per-hour model
|
||
|
|
> ("No delta files … No `changes/` directories"). That is no longer accurate.
|
||
|
|
> As of the June 22, 2026 re-seed the source switched to an **incremental
|
||
|
|
> change-data-capture (CDC) stream** under `automations/<dataset>/changes/`.
|
||
|
|
> The structure below was verified by direct S3 inspection of the `tickets`
|
||
|
|
> bucket on June 23, 2026. Workflow-internal details (IDs, node behaviour) are
|
||
|
|
> carried over from the prior version and may be stale — trust the bucket.
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
The FTTH ticket export now writes an **incremental** CSV stream per dataset:
|
||
|
|
|
||
|
|
- The **first** file in a stream is a full current-state **baseline**.
|
||
|
|
- Every **later** file holds **only the rows that changed** since the prior
|
||
|
|
export — new and updated tickets, keyed by `ticket_id`.
|
||
|
|
- **Deletions are never emitted** (tickets are closed in place, not removed).
|
||
|
|
|
||
|
|
Consumers must ingest **every not-yet-processed file in ascending timestamp
|
||
|
|
order** (baseline first, then each delta) and **upsert on `ticket_id`**. Taking
|
||
|
|
only the newest file silently drops the intermediate deltas.
|
||
|
|
|
||
|
|
CSV files only. Filenames use the execution time in the `Africa/Nairobi`
|
||
|
|
timezone (format below). All files share one identical flat-CSV schema (header
|
||
|
|
+ rows) — the same column set the previous full snapshots used.
|
||
|
|
|
||
|
|
## Output Layout
|
||
|
|
|
||
|
|
The change stream lives under a `changes/` prefix per dataset in the `tickets`
|
||
|
|
bucket:
|
||
|
|
|
||
|
|
```text
|
||
|
|
automations/crq/changes/YYYY-MM-DDTHH-mm-ss.csv
|
||
|
|
automations/inc/changes/YYYY-MM-DDTHH-mm-ss.csv
|
||
|
|
```
|
||
|
|
|
||
|
|
Observed `tickets` bucket layout (June 23, 2026):
|
||
|
|
|
||
|
|
```text
|
||
|
|
automations/inc/
|
||
|
|
├── changes/ ← ACTIVE incremental stream (baseline + deltas)
|
||
|
|
│ ├── 2026-06-22T15-50-39.csv (baseline, ~15 MB, 34,849 rows)
|
||
|
|
│ ├── 2026-06-22T15-53-04.csv (delta, 1 row)
|
||
|
|
│ ├── 2026-06-22T17-10-41.csv (delta, 22 rows)
|
||
|
|
│ └── 2026-06-22T17-15-41.csv (delta, 131 rows)
|
||
|
|
├── processed/ ← our pipeline's archive of consumed files
|
||
|
|
├── full/ ← present but EMPTY (legacy prefix)
|
||
|
|
├── latest.csv/ ← present but EMPTY (legacy prefix)
|
||
|
|
└── latest.json/ ← present but EMPTY (legacy prefix)
|
||
|
|
```
|
||
|
|
|
||
|
|
Notes:
|
||
|
|
|
||
|
|
- There are **no longer any `automations/inc/<ts>.csv` files at the root** of
|
||
|
|
`inc/` — the last full snapshots (through `2026-06-18T17-00-05.csv`) were
|
||
|
|
archived to `processed/`. New data arrives **only** under `changes/`.
|
||
|
|
- The `full/`, `latest.csv/`, and `latest.json/` prefixes still appear in
|
||
|
|
listings but contain **no objects** (leftover/legacy; ignore them). There is
|
||
|
|
no `latest` pointer and no JSON/metadata envelope.
|
||
|
|
- **CRQ mirrors INC**: `automations/crq/changes/` carries the same incremental
|
||
|
|
stream (with matching baseline timestamps and additional newer deltas). CRQ
|
||
|
|
remains out of scope for `import_tickets.py` (INC-only), but the source-side
|
||
|
|
shape is the same. CRQ's old root snapshots (`automations/crq/<ts>.csv`) are
|
||
|
|
still present because nothing archives them — they are not consumed.
|
||
|
|
|
||
|
|
## CSV Schema
|
||
|
|
|
||
|
|
Header (32 columns), identical across baseline and delta files:
|
||
|
|
|
||
|
|
```text
|
||
|
|
ticket_id, source_type, service_type, bucket, raw_status, normalized_status,
|
||
|
|
created_at_service, scheduled_at, closed_at, last_seen_at, first_seen_at,
|
||
|
|
week_start, week_end, cluster, region, location_name, latitude, longitude,
|
||
|
|
department, assigned_team, owner, sla_status, mttr, is_auto_created,
|
||
|
|
is_auto_closed, is_alarm, is_actionable, source_s3_bucket, source_s3_key,
|
||
|
|
source_snapshot_id, created_at, updated_at
|
||
|
|
```
|
||
|
|
|
||
|
|
Each row is a complete record (not a partial diff): a delta row carries the
|
||
|
|
ticket's full current state, so a plain upsert on `ticket_id` is correct. The
|
||
|
|
baseline still contains `is_alarm=true` rows and may include a leading
|
||
|
|
`EXPORT STOPPED…` truncation-sentinel row in `ticket_id`; both are filtered by
|
||
|
|
the consumer (see `import_tickets.py`).
|
||
|
|
|
||
|
|
## Timestamp Format
|
||
|
|
|
||
|
|
```text
|
||
|
|
YYYY-MM-DDTHH-mm-ss e.g. 2026-06-22T15-50-39
|
||
|
|
```
|
||
|
|
|
||
|
|
Generated once at the start of each execution, formatted in `Africa/Nairobi`
|
||
|
|
(EAT). Note this is the *execution* time, not a top-of-the-hour schedule — the
|
||
|
|
incremental files appear whenever a change batch is exported (multiple within
|
||
|
|
the same hour are normal, e.g. `15-50-39` then `15-53-04`).
|
||
|
|
|
||
|
|
## How `import_tickets.py` Consumes It
|
||
|
|
|
||
|
|
`python import_tickets.py --from-bucket --apply` (see `run_ingest.sh`):
|
||
|
|
|
||
|
|
1. Lists `automations/inc/changes/<ts>.csv`, sorts ascending by timestamp.
|
||
|
|
2. Skips files at/older than the **watermark**
|
||
|
|
(`tickets.import_meta.metadata->>'source_max_key'` — the newest file already
|
||
|
|
applied); on a fresh stream it processes everything present.
|
||
|
|
3. For each pending file, oldest→newest: drop `is_alarm=true` + sentinel rows,
|
||
|
|
strip `DROP_FIELDS`, normalize `region`/`raw_status`, then upsert on
|
||
|
|
`ticket_id`. The row upsert and the watermark advance **commit in one
|
||
|
|
transaction per file**, after which the file is moved to
|
||
|
|
`automations/inc/processed/`.
|
||
|
|
4. A mid-run failure therefore leaves folder state consistent with the
|
||
|
|
watermark; the next run resumes cleanly from where it stopped.
|
||
|
|
|
||
|
|
The first file applied onto an empty watermark is recorded as
|
||
|
|
`export_type="baseline"` in `tickets.import_meta`; every file after is `"delta"`.
|
||
|
|
|
||
|
|
## Workflow Context (carried over — verify before relying on)
|
||
|
|
|
||
|
|
The export originates from the FTTH Automation Ticket export workflow, calling
|
||
|
|
the authenticated Scoreboard export endpoint and uploading CSV(s) to the
|
||
|
|
`tickets` bucket; a sibling workflow exports `fuel_records/<ts>.csv` to the
|
||
|
|
`fuel` bucket. Source DB queries are read-only and the workflows do not delete
|
||
|
|
or update source rows. The previously documented workflow IDs and the claim of
|
||
|
|
"two files per hour, full snapshots, no `changes/`" predate the June 22 switch
|
||
|
|
to the incremental stream and should be re-confirmed against the live n8n
|
||
|
|
configuration before being treated as current.
|