ARTICLE AD BOX
I built a regression pipeline for predicting a continuous target (accident_risk) using XGBoost with a focus on avoiding data leakage and following best practices.
What I’ve already done
Used Pipeline + ColumnTransformer for XGBoost to ensure leakage-free preprocessing during CV
Used OneHotEncoder(handle_unknown='ignore') instead of get_dummies
Ensured imputers are fit only on training data
Used RandomizedSearchCV instead of GridSearch
Used early stopping for both models
Built a simple weighted ensemble of XGBoost + CatBoost
Performed final retraining on full dataset before submission
My questions
Are there any improvements I can make to:
model performance?
code structure / design?
Is my approach to:
handling categorical variables
cross-validation
final retraining
correct and optimal?
Are there better alternatives to my current ensemble strategy (weighted average)?
Code
Preprocessor
def build_preprocessor(num_cols, cat_cols): num_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ]) cat_transformer = Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)), ]) return ColumnTransformer( transformers=[ ("num", num_transformer, num_cols), ("cat", cat_transformer, cat_cols), ] )XGBoost Pipeline
xgb_pipeline = Pipeline([ ("preprocessor", build_preprocessor(num_cols, cat_cols)), ("xgb", XGBRegressor(tree_method="hist", eval_metric="rmse")) ]) param_dist = { "xgb__n_estimators": [200, 400, 600], "xgb__max_depth": [3, 5, 7], "xgb__learning_rate": [0.01, 0.05, 0.1], } search = RandomizedSearchCV( xgb_pipeline, param_distributions=param_dist, n_iter=10, cv=5, scoring="r2" ) search.fit(X_train, y_train)Additional context
Dataset contains both numerical and categorical features
Evaluation metric: R²
Using 5-fold CV
I would appreciate any suggestions on improving performance, readability, or best practices.
