ARTICLE AD BOX
I am trying to apply changes from one dataframe (source file is a 7 MB .CSV) to a larger dataframe (source file approx. 3GB .CSV), e.g. update existing rows with matching IDs, while at the same time adding new rows with no pre-existing ID in the larger dataframe. I believe the correct way to do this is to use the Polars update() method with the "how" strategy set to "full".
Unfortunately, this works fine testing on my local machine but silently fails in a Cloud Function environment even with the container configured for 8G RAM.
I am using scan_csv() with infer_schema=False to get LazyFrames (with only strings) of the two datasets before calling update(), and tried logging intermediate results using describe(), which logs the dataframe stats just fine for each of the source datasets, but never is able to get past the update() to log the resulting dataframe describe():
import polars as pl large_df = pl.scan_csv(large_file_path, infer_schema=False) small_df = pl.scan_csv(small_file_path, infer_schema=False) logging.info(f'LARGE: {large_df.describe()}') # Logs are visible for this logging.info(f'SMALL: {small_df.describe()}') # Logs are visible for this merged_df = large_df.update(small_df, how='full', on='id') # results in OOM in the Cloud Function log logger.info(f'MERGED: {merged_df.describe()}') # Never reaches this lineAm I doing anything wrong or inefficient here?
