ARTICLE AD BOX
I am developing a spatiotemporal tree-based ensemble framework (utilizing LightGBM, XGBoost, and CatBoost) to forecast dengue outbreaks based on climate variables (temperature, precipitation, humidity) and lagged historical case counts.
While tree-based algorithms are theoretically invariant to monotonic feature scaling, I am implementing scaling primarily because:
I am calculating SHAP (Shapley Additive Explanations) values for post-hoc model interpretability and global feature importance.
I am applying forward aggregation across temporal slices to prevent data leakage, meaning the range and variance of features dynamically shift across training validation windows.
I am debating between StandardScaler (Z-score normalization) and MinMaxScaler (0-1 normalization). Given the spatiotemporal and epidemiological nature of the data, StandardScaler appears to behave more robustly, but I want to ensure my architectural justification is sound.
Here is a minimal visualization of how the choice impacts extreme climate outliers (e.g., a massive monsoon rainfall anomaly):
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Simulating a climate feature with a severe anomaly (monsoon spike)
np.random.seed(42)
weekly_rainfall = np.random.normal(loc=150, scale=30, size=100)
weekly_rainfall = np.append(weekly_rainfall, [650]) # Extreme outlier event
df = pd.DataFrame({'Rainfall': weekly_rainfall})
# Applying both scalers
df['MinMax'] = MinMaxScaler().fit_transform(df[['Rainfall']])
df['Standard'] = StandardScaler().fit_transform(df[['Rainfall']])
print("Variance of normal weeks under MinMax:", df['MinMax'].iloc[:-1].var())
print("Variance of normal weeks under Standard:", df['Standard'].iloc[:-1].var())
My Observations & Core Dilemma:
The Outlier Compression Issue: Epidemic forecasting relies heavily on anomaly detection (e.g., a sudden spike in humidity/rainfall triggers a vector breeding surge). When using MinMaxScaler, the single extreme outlier (650 mm) compresses the variance of the entire "normal" historical distribution into a tiny, narrow band close to 0.
SHAP Interpretability Impact: Because MinMaxScaler alters the relative spacing and variance of non-outlier data points under compression, I have noticed it subtly distorts the baseline comparison in SHAP expectation values, making normal variations look uniform to the explainer.
StandardScaler Robustness: StandardScaler preserves the variance structure of the normal data because it centers around the mean and scales by standard deviation, allowing the anomaly to exist naturally out at $+5\sigma$ or $+6\sigma$ without destroying the internal resolution of lower-valued inputs.
Questions:
Is my mathematical rationale sound that StandardScaler is objectively better than MinMaxScaler for tree-based ensemble interpretability (SHAP) when dealing with heavy-tailed epidemiological and climate anomalies?
Does the variance compression caused by MinMaxScaler negatively impact the gradient splitting efficiency in LightGBM/XGBoost when handling spatiotemporal forward-aggregated data slices?
