This Note explains how to fix imbalanced data using a SMOTE step.
Here is data on airline flight delay. Each flight is categorized as "delayed" if it delayed more than 20 minutes.
As you can see, delayed flights are minority compared to flights that did not delay. If we build a binary classification model to predict if the flight will delay, for example, with Random Forest, the resulting prediction will most likely biased towards "not delayed".
One way to avoid this problem is to balance the data by synthesizing the minority data, which in this case is "delayed" data.
You can do this by creating a "Fix Imbalanced Data" step, which runs an algorithm called SMOTE (Synthetic Minority Oversampling Technique) to synthesize minority data.
It can be accessed from the following menu.
The Dialog like this will show up.
Here is the list of the parameters.
In the output data, delayed column values became more balanced, due to the synthesized minority "delayed" data. In this example, the majority data is also undersampled to make the data mora balanced.
Some of the rows with the value TRUE on delayed column are the synthesized ones.
To tell which rows are the synthesized ones, you can look at the synthesized column.