How to incrementally merge daily CSV delta updates into a master file without losing expired records?

16 hours ago 1
ARTICLE AD BOX

I am building a data pipeline to create a cumulative master dataset from daily CSV exports generated by a booking system.

Context & Data Structure

Source: A daily CSV file containing status transitions for bookings.

Key Columns: Booking_ID, Previous_Status, New_Status, and Timestamp (YYYY-MM-DD HH:MM:SS).

Edge Case 1: A single Booking_ID can have multiple status changes per day.

Edge Case 2: Two different transitions for the same Booking_ID can share the exact same Timestamp.

The Challenge

The source system purges expired bookings automatically. To avoid losing historical data of these expired bookings, I cannot simply overwrite or reload the entire system state daily. Instead, I need to process daily Delta files (which only contain new bookings or updated rows/transitions) and merge them into a Master file.

The script should run daily, reading only the newly added CSV file in the directory (for performance efficiency) and appending its data to the Master file.

What I've Solved So Far

I have already resolved the issue of duplicate timestamps for the same Booking_ID. I implemented a custom sequential index based on the system's ingestion order, combining it with the timestamp to ensure the correct chronological order of status transitions per booking.

Where I am Stuck

I am struggling with the comparison and deduplication logic during the cumulative merge process. When I read the new daily delta file, how can I efficiently compare it against the master file to append new bookings, insert new status transitions in the correct sequence, and avoid duplicating already processed transitions?

Are there established patterns, algorithms, or specific library features (e.g., in Python/Pandas or SQL) best suited for this type of incremental ledger-style merging?

Any guidance or architectural advice would be highly appreciated.

Read Entire Article