Fastest way to read 10M DB rows in Python?

1 week ago 15

ARTICLE AD BOX

I’m trying to efficiently read about 10 million rows (single column) from a database table in Python and I’m not sure if my current approach is reasonable or if I’m missing some optimizations.

Approach 1: cursor + fetchmany On average, it takes around 1.2 minutes to read 10 million rows.

sql = f"SELECT {col_id} FROM {table_id}" raw_conn = engine.raw_connection() try: cursor = raw_conn.cursor() cursor.execute(sql) total_rows = 0 while True: rows = cursor.fetchmany(chunk_size) if not rows: break # Direct string conversion - fastest approach values.extend(str(row[0]) for row in rows)

Approach 2: pandas read_sql with chunks
On average, this takes around 2 minutes to read 10 million rows.

sql = f"SELECT {col_id} FROM {table_id} WHERE {col_id} IS NOT NULL" values: List[str] = [] for chunk in pd.read_sql(sql, engine, chunksize=CHUNK_SIZE): # .astype(str) keeps nulls out (already filtered in SQL) values.extend(chunk.iloc[:, 0].astype(str).tolist())

What is the most efficient way to read this many rows from the table into Python?
Are these timings (~1.2–2 minutes for 10 million rows) reasonable, or can this be significantly improved with a different pattern (e.g., driver settings, batching strategy, multiprocessing, or a different library)?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Fastest way to read 10M DB rows in Python?

ARTICLE AD BOX

Related

I have a problem with the request module in Automate Boring Stuff With Python - Chapter 13

How do I resolve the ConnectionResetError and CondaHTTPError when attempting to update conda despite multiple retries and Anaconda reinstalls?

Make a Python process that communicates with itself over a PTY

LEFT SIDEBAR AD