How to Fix Imbalanced Data with a SMOTE Step

This Note explains how to fix imbalanced data using a SMOTE step.

Input Data

Here is data on airline flight delay. Each flight is categorized as “delayed” if it delayed more than 20 minutes.

As you can see, delayed flights are minority compared to flights that did not delay. If we build a binary classification model to predict if the flight will delay, for example, with Random Forest, the resulting prediction will most likely biased towards “not delayed”.

One way to avoid this problem is to balance the data by synthesizing the minority data, which in this case is “delayed” data.

SMOTE Step

How to Access The Feature

You can do this by creating a “Fix Imbalanced Data” step, which runs an algorithm called SMOTE (Synthetic Minority Oversampling Technique) to synthesize minority data.

It can be accessed from the following menu.

The Dialog like this will show up.

Parameters

Here is the list of the parameters.

  • Imbalanced Column
  • Target % of Minority Data - Percentage of minority data you want to see on the resulting data frame. The default is 40%.
  • Target Size of Data - The size you want the resulting data frame to have. The default is 50,000.
  • Maximum % Increase for Minority Size - Maximum limit on the ratio of the size of synthesized minority data against the size of actual minority data. The default is 200 %.
  • Number of Neighbors to Sample for Populating Minority Data - The number of nearest neighbor data points SMOTE algorithm makes use of when synthesizing artificial minority data.
  • Random Seed

Output Data

In the output data, delayed column values became more balanced, due to the synthesized minority “delayed” data. In this example, the majority data is also undersampled to make the data mora balanced.

Some of the rows with the value TRUE on delayed column are the synthesized ones.

To tell which rows are the synthesized ones, you can look at the synthesized column.