ARTICLE AD BOX
My goal is to process (sklearn Pipeline) large hdf file that doesn't fit into the RAM.
The core data is an irregular multivariate time series data (looooooong 2d array) and could be split columnwise to fit in memory but....my sklearn pipeline requires a 2D dataframe.
They say that dask is capable of on-disk processing but it doesn't even load the data, resulting in hashing error
Minimal reproducible example below:
# Python v3.13.9, dask 2025.10.0, h5py 3.15.1 import h5py import numpy as np import pandas as pd import dask.array as da import dask.dataframe as dd # stores random data as 'test' dataset via h5py with h5py.File("test.h5", 'w') as f: arr = np.random.rand(10, 10) f.create_dataset(name='test', data=arr, dtype=np.float32) # tries loading the data via h5py with h5py.File("test.h5", 'r') as f: loaded_arr = f['test'] # creates an array (WORKS) dask_array = da.from_array(loaded_arr) dask_array[3, 7].compute() print("OK", type(dask_array)) # creates pandas dataframe (WORKS) pandas_dataframe = pd.DataFrame(loaded_arr) pandas_dataframe.iloc[3, 7] print("OK", type(pandas_dataframe)) # creates dask dataframe (FAILS) dask_dataframe = dd.from_array(loaded_arr) dask_dataframe.iloc[3, 7].compute() print("OK", type(dask_dataframe))always results in weird hashing error due to last code section (dd.from_array),
TokenizationError: Object <HDF5 dataset "test": shape (10, 10), type "<f4"> cannot be deterministically hashed. This likely indicates that the object cannot be serialized deterministically.
They say that from_array call accepts anything array-like that supports indexing
Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax
and h5py.Dataset definitely does that so i dunno what's wrong.
There is also a way to read hdf5 directly (pytables) but it also fails due to some subtle difference inbetween pytables and h5py format realization. At least it was expected
dd.read_hdf("test.h5",key="test") TypeError: An error occurred while calling the read_hdf method registered to the pandas backend. Original Message: cannot create a storer if the object is not existing nor a value are passedI can't load everything in memory as an intermediate step (dd.from_array(loaded_arr[...]) or dd.from_pandas(pd.DataFrame(loaded_arr)))
There is no way i gonna use pytables backend for file creation instead of h5py anyways even if this helps.
I don't really want to use dask either but the only its alternative, vaex currently doesn't support python 3.13 (yeah, in 2025). So I am stuck here((
I also don't want to implement all that chunk data processing by myself. The simplest option I have is to split the data columnwise which breaks next processing stages.
