Intro

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally.For example, you may have a 2-class (binary) classification problem with 100 instances (rows). A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2. This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1. You can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques can be used on either.

Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter.

There are problems where a class imbalance is not just common, it is expected. For example, in datasets like those that characterize fraudulent transactions are imbalanced. The vast majority of the transactions will be in the “Not-Fraud” class and a very small minority will be in the “Fraud” class. Another example is customer churn datasets, where the vast majority of customers stay with the service (the “No-Churn” class) and a small minority cancel their subscription (the “Churn” class). When there is a modest class imbalance like 4:1 in the example above it can cause problems.

The accuracy paradox is the name for the exact situation in the introduction to this post. It is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution. It is very common, because classification accuracy is often the first measure we use when evaluating models on our classification problems.

This is a short excusrion on the SMOTE (learn more about SMOTE, see the original 2002 paper titled “SMOTE: Synthetic Minority Over-sampling Technique“) variations I found and which allow to manipulate in various ways the creation of synthetic samples.

You can find the code of this exploration here, note that it’s Python v3.5.


Let’s create some classification data and we take 200 informative features. The usage of PCA to turn it into a 2D dataset is simply a projection technique so things can be plotted.


 

In order to measure how the synthetic samples influence the classification we will use a random forest and naive Bayes classifiers.

So, in the default setup without any SMOTE we have

 Standard SMOTE algorithm

The idea of the algorithm is to take k-nearest neighbors which define through the barycenter a direction and use a random factor in this direction. The larger the value of k the more the synthetic sample blur the existing ones. By default the value is 5.

Let’s first consider how the amount/ratio influence the accuracy of the predictions.


 

 No surprise here, the more samples the more the accuracy increases. We end up with an almost 1:1 ratio.

Let’s assume now a fixed ration but increase the amount of nearest neighbors.


 

 So, the amount of neighbors diminishes the accuracy somewhat but there is no clear effect.

Borderline 1 variation

The core idea of SMOTE is to use nearest neighbors to create new samples. However, if a minority point is close to another class then that point should rather not be considered since it would pull towards more noise and a less clear distinction between classes. So, the basic premise of the borderline SMOTE method is to identify points which potentially increase the confusion and not include these in the vectors creating new samples.
It’s clear that this method will have no effect if the classes are well separated and mostly effective when mixture is moderate.
The algorithm is described in this article.


 

We can see that the accuracy goes more directly to its max due to the fact that our sample has indeed some noisy overlap between the clases and that the borderline SMOTE is really ideal in this case.

SVM variation

This approach is similar to the borderline idea but one uses a support vector machine to detect boundary points and separate them for the creation of synthetic samples.


 

 This gives a slightly better speed towards maximization (around 1.4 vs 1.6 with the borderline approach) but at the cost of a more lengthy computing time.