Polars lazyframe update() silently failing in a serverless Cloud Function (OOM error)

4 days ago 12

ARTICLE AD BOX

I am trying to apply changes from one dataframe (source file is a 7 MB .CSV) to a larger dataframe (source file approx. 3GB .CSV), e.g. update existing rows with matching IDs, while at the same time adding new rows with no pre-existing ID in the larger dataframe. I believe the correct way to do this is to use the Polars update() method with the "how" strategy set to "full".

Unfortunately, this works fine testing on my local machine but silently fails in a Cloud Function environment even with the container configured for 8G RAM.

I am using scan_csv() with infer_schema=False to get LazyFrames (with only strings) of the two datasets before calling update(), and tried logging intermediate results using describe(), which logs the dataframe stats just fine for each of the source datasets, but never is able to get past the update() to log the resulting dataframe describe():

import polars as pl large_df = pl.scan_csv(large_file_path, infer_schema=False) small_df = pl.scan_csv(small_file_path, infer_schema=False) logging.info(f'LARGE: {large_df.describe()}') # Logs are visible for this logging.info(f'SMALL: {small_df.describe()}') # Logs are visible for this merged_df = large_df.update(small_df, how='full', on='id') # results in OOM in the Cloud Function log logger.info(f'MERGED: {merged_df.describe()}') # Never reaches this line

Am I doing anything wrong or inefficient here?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Polars lazyframe update() silently failing in a serverless Cloud Function (OOM error)

ARTICLE AD BOX

Related

I have a problem with the request module in Automate Boring Stuff With Python - Chapter 13

How do I resolve the ConnectionResetError and CondaHTTPError when attempting to update conda despite multiple retries and Anaconda reinstalls?

Make a Python process that communicates with itself over a PTY

LEFT SIDEBAR AD