Strategy for Outlier Removal in Skewed Supply Chain Data

4 weeks ago 33

ARTICLE AD BOX

I am working on a research project involving Supply Chain Forecast Matching, and I am stuck on the best strategy for handling outliers. I would love some advice from a feature engineering perspective.

1. The Goal: I am building a classification model (XGBoost) to predict if a "New Forecast" matches an "Existing Order" based on historical behavior. The target is binary (is_match 1 or 0).

2. The Data:

Volume: ~21,000 historical update pairs.

Key Features: I am calculating deltas between the old and new orders:

delta_days: Difference in delivery date (e.g., +5 days delay).

delta_qty_pct: Percentage change in quantity (e.g., +0.10 for a 10% increase).

Nature of Data: The data is heavily skewed (fat-tailed). Most updates are small (0-5 days), but legitimate business events can cause massive shifts (e.g., 60-day delays) that are not data errors.

3. The Conflict (The 50% Problem): I need to clean "Garbage Data" (e.g., typos causing 5,000-day delays) without removing valid business volatility.

The Issue: When I applied the standard IQR Method (1.5 * IQR) globally, it removed ~50% of my dataset (dropping from 21k to 10k rows).

Why: Because the interquartile range is very tight (most updates are 0), the IQR bounds became extremely narrow (e.g., -14 to +10 days). This essentially treated any "major delay" (e.g., 20 days) as an outlier, even though 20-day delays are valid scenarios in my domain.

4. The Approaches I am Considering:

Approach A: Global Winsorization (My current preference) I apply a 1% - 99% percentile cap on the entire dataset instead of removing rows.

Pros: It solves the data loss problem (0% loss) and preserves the difference between "Late" (60 days) and "Crazy" (5000 days, which gets capped to 60).

Cons: It uses a single threshold for all buyers, treating "strict" buyers and "chaotic" buyers the same way during the cleaning phase.

Approach B: Grouped Outlier Removal (Suggested by a Senior) I group the data by Buyer + Seller + Product and remove outliers per group.

Pros: It respects that "Normal" is different for every buyer.

Cons: Many of my groups are sparse (Small N < 10). I worry that statistical methods on such small groups will calculate near-zero variance and aggressively delete valid data, replicating the "50% loss" issue on a local scale.

5. My Question: Given that I will later calculate Z-Scores as features for the model (which captures the relative "weirdness" of a value per buyer), is it safer to stick to Global Winsorization to preserve data volume, or is Grouped Cleaning standard practice even with sparse groups?

Is there a hybrid approach (e.g., Global Cleaning + Local Feature Engineering) that is preferred in this scenario?

Thanks for your help!

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Strategy for Outlier Removal in Skewed Supply Chain Data

ARTICLE AD BOX

Related

Jupyter Kernel stuck on "Connecting" in Docker (CellOracle) on macOS Silicon (M2)

Flask API pagination and filtering works, but totalRecords count seems incorrect

Submit job without having access to worker code

LEFT SIDEBAR AD