## Intro

Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter.

There are problems where a class imbalance is not just common, it is expected. For example, in datasets like those that characterize fraudulent transactions are imbalanced. The vast majority of the transactions will be in the “Not-Fraud” class and a very small minority will be in the “Fraud” class. Another example is customer churn datasets, where the vast majority of customers stay with the service (the “No-Churn” class) and a small minority cancel their subscription (the “Churn” class). When there is a modest class imbalance like 4:1 in the example above it can cause problems.

The accuracy paradox is the name for the exact situation in the introduction to this post. It is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution. It is very common, because classification accuracy is often the first measure we use when evaluating models on our classification problems.

This is a short excusrion on the SMOTE (learn more about SMOTE, see the original 2002 paper titled “SMOTE: Synthetic Minority Over-sampling Technique“) variations I found and which allow to manipulate in various ways the creation of synthetic samples.

You can find the code of this exploration here, note that it’s Python v3.5.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
import sys import sklearn.datasets from unbalanced_dataset import SMOTE from sklearn import tree from sklearn.ensemble import RandomForestClassifier from sklearn import decomposition import time, os import pandas as pd, numpy as np from ggplot import * import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix, roc_auc_score %matplotlib inline %load_ext autoreload %autoreload 2 import sklearn.datasets import matplotlib.pyplot as plt import seaborn as sns import datetime from sklearn.decomposition import PCA import pandas as pd, numpy as np, os, time sns.set() def plotClassificationData(x, y, title=""): palette = sns.color_palette() plt.scatter(x[y == 0, 0], x[y == 0, 1], label="Class #0", alpha=0.5, facecolor=palette[0], linewidth=0.15) plt.scatter(x[y == 1, 0], x[y == 1, 1], label="Class #1", alpha=0.5, facecolor=palette[2], linewidth=0.15) plt.title(title) plt.legend() plt.show() def linePlot(x, title=""): palette = sns.color_palette() plt.plot(x, alpha=0.5, label=title, linewidth=0.2) plt.legend() plt.show() def savePlotClassificationData(x, y): palette = sns.color_palette() plt.scatter(x[y == 0, 0], x[y == 0, 1], label="Class #0", alpha=0.5, facecolor=palette[0], linewidth=0.15) plt.scatter(x[y == 1, 0], x[y == 1, 1], label="Class #1", alpha=0.5, facecolor=palette[2], linewidth=0.15) plt.legend() # plt.show() filePath = "/Users/Swa/Desktop/" + str(datetime.datetime.now(datetime.timezone.utc).timestamp()) + ".png" plt.savefig(filePath) def plotHistogram(x, bins=10): plt.hist(x, bins=bins) plt.show() |

Let’s create some classification data and we take 200 informative features. The usage of PCA to turn it into a 2D dataset is simply a projection technique so things can be plotted.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
x,y = sklearn.datasets.make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9], n_informative=200, n_redundant=0, flip_y=0, n_features=200, n_clusters_per_class=1, n_samples=5000, random_state=10) # we'll invert the y to reflect the situation where '1' means classified y = 1 - y def count_classifieds(z): return sum(z) def count_unclassifieds(z): return len(z) - sum(z) def imbalance_ratio(z): return round(count_classifieds(z)/count_unclassifieds(z),1) num_classified = count_classifieds(y) num_unclassified = count_unclassifieds(y) print("Number of classified clients: %s"%num_classified) print("Number of unclassified clients: %s"%num_unclassified ) print("Imbalance ratio: %s"%imbalance_ratio(y)) pca = decomposition.PCA(n_components=2) xv = pca.fit_transform(x) plotClassificationData(xv,y) |

In order to measure how the synthetic samples influence the classification we will use a **random forest** and **naive Bayes** classifiers.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
def makeForest(X:np.array, Y:np.array, treecount=10): if treecount <= 1: raise ValueError("The forest should not have less than one tree.") model = RandomForestClassifier(n_estimators=treecount) return model.fit(X, Y) def getForestArea(x, y, treecount=10): forest = makeForest(x, y, treecount) predicted = forest.predict(x) try: area = roc_auc_score(y, predicted) except Exception as exc: area = 0 return round(area,2) |

## Standard SMOTE algorithm

The idea of the algorithm is to take k-nearest neighbors which define through the barycenter a direction and use a random factor in this direction. The larger the value of k the more the synthetic sample blur the existing ones. By default the value is 5.

Let’s first consider how the amount/ratio influence the accuracy of the predictions.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
areas = [] division = np.arange(0.2,4,0.1) for k in division: smote = SMOTE(kind="regular", ratio=k) sx, sy = smote.fit_transform(x, y) areas.append(getForestArea(sx,sy)) plt.plot(division, areas) plt.xlabel('Ratio vs. area.') plt.ylabel('Area') plt.title('AUC') plt.ylim([0.97,1.005]) plt.show() print("Ended with %s classified and %s unclassifieds."%(count_classifieds(sy),count_unclassifieds(sy))) |

Let’s assume now a fixed ration but increase the amount of nearest neighbors.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
areas = [] division = np.arange(2,20,1) for nn in division: smote = SMOTE(kind="regular", k=nn, ratio=0.5) sx, sy = smote.fit_transform(x, y) areas.append(getForestArea(sx,sy)) plt.plot(division, areas) plt.xlabel('Neighbors vs. area.') plt.ylabel('Area') plt.title('AUC') plt.ylim([0.93,1.005]) plt.show() print("At each turn we had %s classified and %s unclassifieds."%(count_classifieds(sy),count_unclassifieds(sy))) |

## Borderline 1 variation

The core idea of SMOTE is to use nearest neighbors to create new samples. However, if a minority point is close to another class then that point should rather not be considered since it would pull towards more noise and a less clear distinction between classes. So, the basic premise of the borderline SMOTE method is to identify points which potentially increase the confusion and not include these in the vectors creating new samples.

It’s clear that this method will have no effect if the classes are well separated and mostly effective when mixture is moderate.

The algorithm is described in this article.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
areas = [] division = np.arange(0.2,4,0.1) for k in division: smote = SMOTE(kind="borderline1", ratio=k) sx, sy = smote.fit_transform(x, y) areas.append(getForestArea(sx,sy)) plt.plot(division, areas) plt.xlabel('Ratio vs. area.') plt.ylabel('Area') plt.title('AUC') plt.ylim([0.97,1.005]) plt.show() print("Ended with %s classified and %s unclassifieds."%(count_classifieds(sy),count_unclassifieds(sy))) |

We can see that the accuracy goes more directly to its max due to the fact that our sample has indeed some noisy overlap between the clases and that the borderline SMOTE is really ideal in this case.

## SVM variation

This approach is similar to the borderline idea but one uses a support vector machine to detect boundary points and separate them for the creation of synthetic samples.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import warnings warnings.filterwarnings('ignore') areas = [] division = np.arange(0.2,4,0.1) for k in division: smote = SMOTE(kind="svm", ratio=k) sx, sy = smote.fit_transform(x, y) areas.append(getForestArea(sx,sy)) plt.plot(division, areas) plt.xlabel('Ratio vs. area.') plt.ylabel('Area') plt.title('AUC') plt.ylim([0.97,1.005]) plt.show() print("Ended with %s classified and %s unclassifieds."%(count_classifieds(sy),count_unclassifieds(sy))) |