ARTICLE AD BOX
I am currently working on a project that synthesizes tabular data for an electromagnetic interference detection system. My dataset contains mixed data types: continuous variables (e.g., power efficiency η, load conditions RL) and categorical variables (e.g., spatial classes).
I am using CTGAN from the SDV (Synthetic Data Vault) library and trying to optimize its hyperparameters (like embedding_dim, generator_dim, batch_size) using Optuna.
My current approach uses a downstream machine learning classifier (CatBoost) macro F1-score as the reward signal for Optuna's objective function. However, the tuning process is extremely slow, and sometimes the CTGAN model suffers from mode collapse during certain Optuna trials.
My Questions:
Is using a downstream classifier's F1-score an efficient objective metric for Optuna when tuning CTGAN, or should I use a statistical metric (like the Kolmogorov-Smirnov test) to speed up the trials?
What is the best practice to handle Optuna TrialPruned exceptions when the GAN discriminator overpowers the generator early in the training loop?
Any code snippets or architectural advice would be highly appreciated!
