Best practices to optimize this multi-step data processing code [closed]

13 hours ago 1

ARTICLE AD BOX

I built a regression pipeline for predicting a continuous target (accident_risk) using XGBoost with a focus on avoiding data leakage and following best practices.

What I’ve already done

Used Pipeline + ColumnTransformer for XGBoost to ensure leakage-free preprocessing during CV

Used OneHotEncoder(handle_unknown='ignore') instead of get_dummies

Ensured imputers are fit only on training data

Used RandomizedSearchCV instead of GridSearch

Used early stopping for both models

Built a simple weighted ensemble of XGBoost + CatBoost

Performed final retraining on full dataset before submission

My questions

Are there any improvements I can make to:

model performance?

code structure / design?

Is my approach to:

handling categorical variables

cross-validation

final retraining
correct and optimal?

Are there better alternatives to my current ensemble strategy (weighted average)?

Code

Preprocessor

def build_preprocessor(num_cols, cat_cols): num_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ]) cat_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)), ]) return ColumnTransformer( transformers=[ ("num", num_transformer, num_cols), ("cat", cat_transformer, cat_cols), ] )

XGBoost Pipeline

xgb_pipeline = Pipeline([ ("preprocessor", build_preprocessor(num_cols, cat_cols)), ("xgb", XGBRegressor(tree_method="hist", eval_metric="rmse")) ]) param_dist = { "xgb__n_estimators": [200, 400, 600], "xgb__max_depth": [3, 5, 7], "xgb__learning_rate": [0.01, 0.05, 0.1], } search = RandomizedSearchCV( xgb_pipeline, param_distributions=param_dist, n_iter=10, cv=5, scoring="r2" ) search.fit(X_train, y_train)

Additional context

Dataset contains both numerical and categorical features

Evaluation metric: R²

Using 5-fold CV

I would appreciate any suggestions on improving performance, readability, or best practices.

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Best practices to optimize this multi-step data processing code [closed]

ARTICLE AD BOX

My questions

Code

Additional context

Related

LightFM for purchase data: how should I prepare interaction weights? Current approach uses per-user frequency normalized by max item frequency

Is my machine learning training the best approach as a beginner?

Using machine learning, you’ll analyze trends in danceability, energy, valence

LEFT SIDEBAR AD