ARTICLE AD BOX
I am benchmarking several tabular Generative AI models (including TVAE, TabDDPM, and WGAN-GP) to synthesize sensor data. I need to rigorously evaluate the statistical similarity between my generated synthetic datasets and the empirical baseline dataset.
I want to use the Two-Sample Kolmogorov-Smirnov test (scipy.stats.ks_2samp). However, my dataset is multivariate (consisting of features like Efficiency, Load, and Object Type). The standard KS test is designed for 1D distributions.
My question: What is the standard programmatic approach in Python to compute an aggregated KS score for a multidimensional tabular dataset?
Should I iterate through each continuous column independently, run ks_2samp, and average the p-values/statistics?
Or is there a specific library/method better suited for multivariate empirical cumulative distribution functions (eCDFs) in this context?
Thank you for your insights!
