Rationale for StandardScaler over MinMaxScaler in spatiotemporal tree-based ensemble models with SHAP interpretability

1 day ago 2

ARTICLE AD BOX

I am developing a spatiotemporal tree-based ensemble framework (utilizing LightGBM, XGBoost, and CatBoost) to forecast dengue outbreaks based on climate variables (temperature, precipitation, humidity) and lagged historical case counts.

While tree-based algorithms are theoretically invariant to monotonic feature scaling, I am implementing scaling primarily because:

I am calculating SHAP (Shapley Additive Explanations) values for post-hoc model interpretability and global feature importance.

I am applying forward aggregation across temporal slices to prevent data leakage, meaning the range and variance of features dynamically shift across training validation windows.

I am debating between StandardScaler (Z-score normalization) and MinMaxScaler (0-1 normalization). Given the spatiotemporal and epidemiological nature of the data, StandardScaler appears to behave more robustly, but I want to ensure my architectural justification is sound.

Here is a minimal visualization of how the choice impacts extreme climate outliers (e.g., a massive monsoon rainfall anomaly):

import numpy as np

import pandas as pd

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Simulating a climate feature with a severe anomaly (monsoon spike)

np.random.seed(42)

weekly_rainfall = np.random.normal(loc=150, scale=30, size=100)

weekly_rainfall = np.append(weekly_rainfall, [650]) # Extreme outlier event

df = pd.DataFrame({'Rainfall': weekly_rainfall})

# Applying both scalers

df['MinMax'] = MinMaxScaler().fit_transform(df[['Rainfall']])

df['Standard'] = StandardScaler().fit_transform(df[['Rainfall']])

print("Variance of normal weeks under MinMax:", df['MinMax'].iloc[:-1].var())

print("Variance of normal weeks under Standard:", df['Standard'].iloc[:-1].var())

My Observations & Core Dilemma:

The Outlier Compression Issue: Epidemic forecasting relies heavily on anomaly detection (e.g., a sudden spike in humidity/rainfall triggers a vector breeding surge). When using MinMaxScaler, the single extreme outlier (650 mm) compresses the variance of the entire "normal" historical distribution into a tiny, narrow band close to 0.

SHAP Interpretability Impact: Because MinMaxScaler alters the relative spacing and variance of non-outlier data points under compression, I have noticed it subtly distorts the baseline comparison in SHAP expectation values, making normal variations look uniform to the explainer.

StandardScaler Robustness: StandardScaler preserves the variance structure of the normal data because it centers around the mean and scales by standard deviation, allowing the anomaly to exist naturally out at $+5\sigma$ or $+6\sigma$ without destroying the internal resolution of lower-valued inputs.

Questions:

Is my mathematical rationale sound that StandardScaler is objectively better than MinMaxScaler for tree-based ensemble interpretability (SHAP) when dealing with heavy-tailed epidemiological and climate anomalies?

Does the variance compression caused by MinMaxScaler negatively impact the gradient splitting efficiency in LightGBM/XGBoost when handling spatiotemporal forward-aggregated data slices?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Rationale for StandardScaler over MinMaxScaler in spatiotemporal tree-based ensemble models with SHAP interpretability

ARTICLE AD BOX

My Observations & Core Dilemma:

Questions:

Related

Why does local variable caching in a loop behave differently when using exec() vs local scope assignment in Python 3.12?

Generating an exported package version from poetry's version

Python bitwise OR (?) operator | used in function argument type hints [duplicate]

LEFT SIDEBAR AD