fleettickets/n8n-s3-ticket-exports.md

# n8n S3 Ticket Exports — Incremental (CDC) Stream

Updated on June 23, 2026.

> **History.** This doc previously described a full-snapshot-per-hour model
> ("No delta files … No `changes/` directories"). That is no longer accurate.
> As of the June 22, 2026 re-seed the source switched to an **incremental
> change-data-capture (CDC) stream** under `automations/<dataset>/changes/`.
> The structure below was verified by direct S3 inspection of the `tickets`
> bucket on June 23, 2026. Workflow-internal details (IDs, node behaviour) are
> carried over from the prior version and may be stale — trust the bucket.

## Overview

The FTTH ticket export now writes an **incremental** CSV stream per dataset:

- The **first** file in a stream is a full current-state **baseline**.
- Every **later** file holds **only the rows that changed** since the prior
  export — new and updated tickets, keyed by `ticket_id`.
- **Deletions are never emitted** (tickets are closed in place, not removed).

Consumers must ingest **every not-yet-processed file in ascending timestamp
order** (baseline first, then each delta) and **upsert on `ticket_id`**. Taking
only the newest file silently drops the intermediate deltas.

CSV files only. Filenames use the execution time in the `Africa/Nairobi`
timezone (format below). All files share one identical flat-CSV schema (header
+ rows) — the same column set the previous full snapshots used.

## Output Layout

The change stream lives under a `changes/` prefix per dataset in the `tickets`
bucket:

```text
automations/crq/changes/YYYY-MM-DDTHH-mm-ss.csv
automations/inc/changes/YYYY-MM-DDTHH-mm-ss.csv
```

Observed `tickets` bucket layout (June 23, 2026):

```text
automations/inc/
├── changes/                 ← ACTIVE incremental stream (baseline + deltas)
│   ├── 2026-06-22T15-50-39.csv   (baseline,  ~15 MB, 34,849 rows)
│   ├── 2026-06-22T15-53-04.csv   (delta,        1 row)
│   ├── 2026-06-22T17-10-41.csv   (delta,       22 rows)
│   └── 2026-06-22T17-15-41.csv   (delta,      131 rows)
├── processed/               ← our pipeline's archive of consumed files
├── full/                    ← present but EMPTY (legacy prefix)
├── latest.csv/              ← present but EMPTY (legacy prefix)
└── latest.json/             ← present but EMPTY (legacy prefix)
```

Notes:

- There are **no longer any `automations/inc/<ts>.csv` files at the root** of
  `inc/` — the last full snapshots (through `2026-06-18T17-00-05.csv`) were
  archived to `processed/`. New data arrives **only** under `changes/`.
- The `full/`, `latest.csv/`, and `latest.json/` prefixes still appear in
  listings but contain **no objects** (leftover/legacy; ignore them). There is
  no `latest` pointer and no JSON/metadata envelope.
- **CRQ mirrors INC**: `automations/crq/changes/` carries the same incremental
  stream (with matching baseline timestamps and additional newer deltas). CRQ
  remains out of scope for `import_tickets.py` (INC-only), but the source-side
  shape is the same. CRQ's old root snapshots (`automations/crq/<ts>.csv`) are
  still present because nothing archives them — they are not consumed.

## CSV Schema

Header (32 columns), identical across baseline and delta files:

```text
ticket_id, source_type, service_type, bucket, raw_status, normalized_status,
created_at_service, scheduled_at, closed_at, last_seen_at, first_seen_at,
week_start, week_end, cluster, region, location_name, latitude, longitude,
department, assigned_team, owner, sla_status, mttr, is_auto_created,
is_auto_closed, is_alarm, is_actionable, source_s3_bucket, source_s3_key,
source_snapshot_id, created_at, updated_at
```

Each row is a complete record (not a partial diff): a delta row carries the
ticket's full current state, so a plain upsert on `ticket_id` is correct. The
baseline still contains `is_alarm=true` rows and may include a leading
`EXPORT STOPPED…` truncation-sentinel row in `ticket_id`; both are filtered by
the consumer (see `import_tickets.py`).

## Timestamp Format

```text
YYYY-MM-DDTHH-mm-ss          e.g. 2026-06-22T15-50-39
```

Generated once at the start of each execution, formatted in `Africa/Nairobi`
(EAT). Note this is the *execution* time, not a top-of-the-hour schedule — the
incremental files appear whenever a change batch is exported (multiple within
the same hour are normal, e.g. `15-50-39` then `15-53-04`).

## How `import_tickets.py` Consumes It

`python import_tickets.py --from-bucket --apply` (see `run_ingest.sh`):

1. Lists `automations/inc/changes/<ts>.csv`, sorts ascending by timestamp.
2. Skips files at/older than the **watermark**
   (`tickets.import_meta.metadata->>'source_max_key'` — the newest file already
   applied); on a fresh stream it processes everything present.
3. For each pending file, oldest→newest: drop `is_alarm=true` + sentinel rows,
   strip `DROP_FIELDS`, normalize `region`/`raw_status`, then upsert on
   `ticket_id`. The row upsert and the watermark advance **commit in one
   transaction per file**, after which the file is moved to
   `automations/inc/processed/`.
4. A mid-run failure therefore leaves folder state consistent with the
   watermark; the next run resumes cleanly from where it stopped.

The first file applied onto an empty watermark is recorded as
`export_type="baseline"` in `tickets.import_meta`; every file after is `"delta"`.

## Workflow Context (carried over — verify before relying on)

The export originates from the FTTH Automation Ticket export workflow, calling
the authenticated Scoreboard export endpoint and uploading CSV(s) to the
`tickets` bucket; a sibling workflow exports `fuel_records/<ts>.csv` to the
`fuel` bucket. Source DB queries are read-only and the workflows do not delete
or update source rows. The previously documented workflow IDs and the claim of
"two files per hour, full snapshots, no `changes/`" predate the June 22 switch
to the incremental stream and should be re-confirmed against the live n8n
configuration before being treated as current.
fix(inc): ingest the incremental changes/ stream (baseline + deltas) The S3 source switched from full hourly snapshots at automations/inc/<ts>.csv to an incremental CDC stream at automations/inc/changes/<ts>.csv (first file = full baseline, each later file = only the rows that changed, keyed by ticket_id; no deletions). The loader still pointed at the old root path and only ingested the single newest file, so after the switch it found nothing (no new tickets ingested) and, even with the path fixed, would silently drop intermediate deltas. Changes: - point ingestion at automations/inc/changes/ (_CHANGE_KEY_RE) - ingest EVERY not-yet-processed file in ascending timestamp order (baseline first, then each delta), upserting each - replace the single-ETag skip with a per-file timestamp watermark (import_meta.metadata->>'source_max_key'); rows + watermark commit in one txn per file, then archive to processed/ — so a mid-run failure leaves a consistent, resumable state - docs: rename n8n-hourly-s3-full-data-exports.md -> n8n-s3-ticket-exports.md and rewrite it for the incremental stream; fix the reference in docs/phase-1-ingestion.md Verified live against prod: re-seeded baseline + 5 deltas (26,529 rows), files archived to processed/, watermark advanced, re-run is a no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-23 11:37:17 +00:00			`# n8n S3 Ticket Exports — Incremental (CDC) Stream`

			`Updated on June 23, 2026.`

			`> History. This doc previously described a full-snapshot-per-hour model`
			> ("No delta files … No `changes/` directories"). That is no longer accurate.
			`> As of the June 22, 2026 re-seed the source switched to an **incremental`
			> change-data-capture (CDC) stream** under `automations/<dataset>/changes/`.
			> The structure below was verified by direct S3 inspection of the `tickets`
			`> bucket on June 23, 2026. Workflow-internal details (IDs, node behaviour) are`
			`> carried over from the prior version and may be stale — trust the bucket.`

			`## Overview`

			`The FTTH ticket export now writes an incremental CSV stream per dataset:`

			`- The first file in a stream is a full current-state baseline.`
			`- Every later file holds only the rows that changed since the prior`
			export — new and updated tickets, keyed by `ticket_id`.
			`- Deletions are never emitted (tickets are closed in place, not removed).`

			`Consumers must ingest **every not-yet-processed file in ascending timestamp`
			order (baseline first, then each delta) and upsert on `ticket_id`**. Taking
			`only the newest file silently drops the intermediate deltas.`

			CSV files only. Filenames use the execution time in the `Africa/Nairobi`
			`timezone (format below). All files share one identical flat-CSV schema (header`
			`+ rows) — the same column set the previous full snapshots used.`

			`## Output Layout`

			The change stream lives under a `changes/` prefix per dataset in the `tickets`
			`bucket:`

			```text
			`automations/crq/changes/YYYY-MM-DDTHH-mm-ss.csv`
			`automations/inc/changes/YYYY-MM-DDTHH-mm-ss.csv`
			```

			Observed `tickets` bucket layout (June 23, 2026):

			```text
			`automations/inc/`
			`├── changes/ ← ACTIVE incremental stream (baseline + deltas)`
			`│ ├── 2026-06-22T15-50-39.csv (baseline, ~15 MB, 34,849 rows)`
			`│ ├── 2026-06-22T15-53-04.csv (delta, 1 row)`
			`│ ├── 2026-06-22T17-10-41.csv (delta, 22 rows)`
			`│ └── 2026-06-22T17-15-41.csv (delta, 131 rows)`
			`├── processed/ ← our pipeline's archive of consumed files`
			`├── full/ ← present but EMPTY (legacy prefix)`
			`├── latest.csv/ ← present but EMPTY (legacy prefix)`
			`└── latest.json/ ← present but EMPTY (legacy prefix)`
			```

			`Notes:`

			- There are no longer any `automations/inc/<ts>.csv` files at the root of
			`inc/` — the last full snapshots (through `2026-06-18T17-00-05.csv`) were
			archived to `processed/`. New data arrives only under `changes/`.
			- The `full/`, `latest.csv/`, and `latest.json/` prefixes still appear in
			`listings but contain no objects (leftover/legacy; ignore them). There is`
			no `latest` pointer and no JSON/metadata envelope.
			- CRQ mirrors INC: `automations/crq/changes/` carries the same incremental
			`stream (with matching baseline timestamps and additional newer deltas). CRQ`
			remains out of scope for `import_tickets.py` (INC-only), but the source-side
			shape is the same. CRQ's old root snapshots (`automations/crq/<ts>.csv`) are
			`still present because nothing archives them — they are not consumed.`

			`## CSV Schema`

			`Header (32 columns), identical across baseline and delta files:`

			```text
			`ticket_id, source_type, service_type, bucket, raw_status, normalized_status,`
			`created_at_service, scheduled_at, closed_at, last_seen_at, first_seen_at,`
			`week_start, week_end, cluster, region, location_name, latitude, longitude,`
			`department, assigned_team, owner, sla_status, mttr, is_auto_created,`
			`is_auto_closed, is_alarm, is_actionable, source_s3_bucket, source_s3_key,`
			`source_snapshot_id, created_at, updated_at`
			```

			`Each row is a complete record (not a partial diff): a delta row carries the`
			ticket's full current state, so a plain upsert on `ticket_id` is correct. The
			baseline still contains `is_alarm=true` rows and may include a leading
			`EXPORT STOPPED…` truncation-sentinel row in `ticket_id`; both are filtered by
			the consumer (see `import_tickets.py`).

			`## Timestamp Format`

			```text
			`YYYY-MM-DDTHH-mm-ss e.g. 2026-06-22T15-50-39`
			```

			Generated once at the start of each execution, formatted in `Africa/Nairobi`
			`(EAT). Note this is the execution time, not a top-of-the-hour schedule — the`
			`incremental files appear whenever a change batch is exported (multiple within`
			the same hour are normal, e.g. `15-50-39` then `15-53-04`).

			## How `import_tickets.py` Consumes It

			`python import_tickets.py --from-bucket --apply` (see `run_ingest.sh`):

			1. Lists `automations/inc/changes/<ts>.csv`, sorts ascending by timestamp.
			`2. Skips files at/older than the watermark`
			(`tickets.import_meta.metadata->>'source_max_key'` — the newest file already
			`applied); on a fresh stream it processes everything present.`
			3. For each pending file, oldest→newest: drop `is_alarm=true` + sentinel rows,
			strip `DROP_FIELDS`, normalize `region`/`raw_status`, then upsert on
			`ticket_id`. The row upsert and the watermark advance **commit in one
			`transaction per file**, after which the file is moved to`
			`automations/inc/processed/`.
			`4. A mid-run failure therefore leaves folder state consistent with the`
			`watermark; the next run resumes cleanly from where it stopped.`

			`The first file applied onto an empty watermark is recorded as`
			`export_type="baseline"` in `tickets.import_meta`; every file after is `"delta"`.

			`## Workflow Context (carried over — verify before relying on)`

			`The export originates from the FTTH Automation Ticket export workflow, calling`
			`the authenticated Scoreboard export endpoint and uploading CSV(s) to the`
			`tickets` bucket; a sibling workflow exports `fuel_records/<ts>.csv` to the
			`fuel` bucket. Source DB queries are read-only and the workflows do not delete
			`or update source rows. The previously documented workflow IDs and the claim of`
			"two files per hour, full snapshots, no `changes/`" predate the June 22 switch
			`to the incremental stream and should be re-confirmed against the live n8n`
			`configuration before being treated as current.`