What is imbalanced data?
Many real-world datasets suffer from imbalance, where certain types of samples are overrepresented in the dataset, while others occur less often. Some examples are:
- When classifying credit card transactions as fraudulent or legitimate, the vast majority of transactions will belong to the latter category
- Severe rainfall occurs less often then moderate rainfall, but may cause more damage to humans and infrastructure
- When trying to identify land use, there are more pixels that represent forests and agriculture than urban settlements
In this post, we aim to give an intuitive explanation for why machine learning algorithms struggle with imbalanced data, show you how to quantify the performance of your algorithm using quantile evaluation, and show you three different strategies to improve your algorithm’s performance.
Example dataset for regression: California housing
Dataset imbalance is often illustrated for classification problems, where a majority class overshadows a minority class. Here, we focus on regression, where the target is a continuous numerical value. We are going to use the California Housing Dataset that is available with scikit-learn. The dataset contains more than 20,000 samples of houses with features such as the location, number of rooms and bedrooms, house age, square footage, and median neighbourhood income. The target variable is the median house value, measured in millions of US-$. In order to see if the dataset is imbalanced, we plot the histogram of the target variable.