TokenizationError when loading h5py dataset as dask dataframe

1 day ago 7

ARTICLE AD BOX

My goal is to process (sklearn Pipeline) large hdf file that doesn't fit into the RAM.

The core data is an irregular multivariate time series data (looooooong 2d array) and could be split columnwise to fit in memory but....my sklearn pipeline requires a 2D dataframe.

They say that dask is capable of on-disk processing but it doesn't even load the data, resulting in hashing error

Minimal reproducible example below:

# Python v3.13.9, dask 2025.10.0, h5py 3.15.1 import h5py import numpy as np import pandas as pd import dask.array as da import dask.dataframe as dd # stores random data as 'test' dataset via h5py with h5py.File("test.h5", 'w') as f: arr = np.random.rand(10, 10) f.create_dataset(name='test', data=arr, dtype=np.float32) # tries loading the data via h5py with h5py.File("test.h5", 'r') as f: loaded_arr = f['test'] # creates an array (WORKS) dask_array = da.from_array(loaded_arr) dask_array[3, 7].compute() print("OK", type(dask_array)) # creates pandas dataframe (WORKS) pandas_dataframe = pd.DataFrame(loaded_arr) pandas_dataframe.iloc[3, 7] print("OK", type(pandas_dataframe)) # creates dask dataframe (FAILS) dask_dataframe = dd.from_array(loaded_arr) dask_dataframe.iloc[3, 7].compute() print("OK", type(dask_dataframe))

always results in weird hashing error due to last code section (dd.from_array),

TokenizationError: Object <HDF5 dataset "test": shape (10, 10), type "<f4"> cannot be deterministically hashed. This likely indicates that the object cannot be serialized deterministically.

They say that from_array call accepts anything array-like that supports indexing

Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax

and h5py.Dataset definitely does that so i dunno what's wrong.

There is also a way to read hdf5 directly (pytables) but it also fails due to some subtle difference inbetween pytables and h5py format realization. At least it was expected

dd.read_hdf("test.h5",key="test") TypeError: An error occurred while calling the read_hdf method registered to the pandas backend. Original Message: cannot create a storer if the object is not existing nor a value are passed

I can't load everything in memory as an intermediate step (dd.from_array(loaded_arr[...]) or dd.from_pandas(pd.DataFrame(loaded_arr)))

There is no way i gonna use pytables backend for file creation instead of h5py anyways even if this helps.

I don't really want to use dask either but the only its alternative, vaex currently doesn't support python 3.13 (yeah, in 2025). So I am stuck here((

I also don't want to implement all that chunk data processing by myself. The simplest option I have is to split the data columnwise which breaks next processing stages.

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

TokenizationError when loading h5py dataset as dask dataframe

ARTICLE AD BOX

Related

I have a problem with the request module in Automate Boring Stuff With Python - Chapter 13

How do I resolve the ConnectionResetError and CondaHTTPError when attempting to update conda despite multiple retries and Anaconda reinstalls?

Make a Python process that communicates with itself over a PTY

LEFT SIDEBAR AD