ARTICLE AD BOX
I am working on a research project involving Supply Chain Forecast Matching, and I am stuck on the best strategy for handling outliers. I would love some advice from a feature engineering perspective.
1. The Goal: I am building a classification model (XGBoost) to predict if a "New Forecast" matches an "Existing Order" based on historical behavior. The target is binary (is_match 1 or 0).
2. The Data:
Volume: ~21,000 historical update pairs.
Key Features: I am calculating deltas between the old and new orders:
delta_days: Difference in delivery date (e.g., +5 days delay).
delta_qty_pct: Percentage change in quantity (e.g., +0.10 for a 10% increase).
Nature of Data: The data is heavily skewed (fat-tailed). Most updates are small (0-5 days), but legitimate business events can cause massive shifts (e.g., 60-day delays) that are not data errors.
3. The Conflict (The 50% Problem): I need to clean "Garbage Data" (e.g., typos causing 5,000-day delays) without removing valid business volatility.
The Issue: When I applied the standard IQR Method (1.5 * IQR) globally, it removed ~50% of my dataset (dropping from 21k to 10k rows).
Why: Because the interquartile range is very tight (most updates are 0), the IQR bounds became extremely narrow (e.g., -14 to +10 days). This essentially treated any "major delay" (e.g., 20 days) as an outlier, even though 20-day delays are valid scenarios in my domain.
4. The Approaches I am Considering:
Approach A: Global Winsorization (My current preference) I apply a 1% - 99% percentile cap on the entire dataset instead of removing rows.
Pros: It solves the data loss problem (0% loss) and preserves the difference between "Late" (60 days) and "Crazy" (5000 days, which gets capped to 60).
Cons: It uses a single threshold for all buyers, treating "strict" buyers and "chaotic" buyers the same way during the cleaning phase.
Approach B: Grouped Outlier Removal (Suggested by a Senior) I group the data by Buyer + Seller + Product and remove outliers per group.
Pros: It respects that "Normal" is different for every buyer.
Cons: Many of my groups are sparse (Small N < 10). I worry that statistical methods on such small groups will calculate near-zero variance and aggressively delete valid data, replicating the "50% loss" issue on a local scale.
5. My Question: Given that I will later calculate Z-Scores as features for the model (which captures the relative "weirdness" of a value per buyer), is it safer to stick to Global Winsorization to preserve data volume, or is Grouped Cleaning standard practice even with sparse groups?
Is there a hybrid approach (e.g., Global Cleaning + Local Feature Engineering) that is preferred in this scenario?
Thanks for your help!
