Our approach to delivering results focuses on a three-phase process that includes designing, implementing, and managing each solution. We'll work with you to integrate our teams so that where your team stops, our team begins.
OUR APPROACHDesign modern IT architectures and implement market-leading technologies with a team of IT professionals and project managers that cross various areas of expertise and that can engage directly with your team under various models.
OUR PROJECTSWith our round-the-clock Service Desk, state-of-the-art Technical Operations Center (TOC), vigilant Security Operations Center (SOC), and highly skilled Advanced Systems Management team, we are dedicated to providing comprehensive support to keep your operations running smoothly and securely at all times.
OUR SERVICESImbalance in datasets imposes a great penalty for the accurate prediction of minority classes. This means that our model will be almost unable to predict the cases that matter the most: those where the cost of the misclassification is the greatest. This article proposes multiple techniques to solve this problem and demonstrate them in a working environment with a real dataset.
A dataset is imbalanced when the classes are not approximately equally distributed. This means that there is a severe skew in the class representation. So, how severe can this skewness be? Well, there are studies that show imbalances on the order of 100 to 1 in fraud detection, as well as imbalances up to 100,000 to 1 in other applications. This kind of use case, where there are very few samples of a class relative to others, maybe seen as finding a needle in a haystack problem.
Let’s dive a little further into this. Imagine we have a dataset where classes appear in a 999:1 ratio. The algorithm is clever, it has mostly seen one type of case, therefore the classifier will try to predict every example as if it belongs to the majority class. And by doing so, we would have 99% accuracy in our model. This is a tough benchmark to beat, even for an algorithm. However, and no matter how high our accuracy is, this approach has bigger problems to address.
First of all, it assumes equal error costs. This means that the error of misclassifying an observation has the same consequences regardless of the class.
In the real world, things are different. Classification often leads to action, and actions have an effect. In cases where the prediction of the minority class is most important, having a random guess approach is simply not tolerable. Because misclassifying the minority class may have consequences such as allowing a fraudulent transaction, ignoring a malfunctioning part, and not detecting a disease. For example, a typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels. The nature of this kind of application requires a very high rate of detection in the minority class and allows for a small error rate in the majority class.
The consequences of misclassification may be severe, and performing the incorrect actions may be quite costly. On very rare occasions, the costs of mistakes are equivalent. In fact, it’s hard to think about a domain where the classification is indifferent to whether it makes Type 1 (False Positive) or Type 2 (False Negative) errors.
Now that we understand how the bias in the dataset can influence machine learning algorithms, let’s talk a little bit about how the machine learning community has addressed the issue of class imbalance. There are roughly two ways to deal with this problem: One way is to assign distinct costs to train examples, and the other is to resample the original dataset, either by oversampling the minority class and/or undersampling the majority class.
In this article, we will focus the discussion on the resampling of the original dataset and will leave the adjustment of training costs for future discussions.
As stated before, we can establish balance in our dataset by increasing the minority class to make it match the majority class, and/or cutting down the majority class to match the less represented class. Both techniques have their advantages and disadvantages. Especially when there are multiple ways to oversample the majority class.
One way we can oversample the majority class is by randomly selecting samples from the minority class, and adding them to the training dataset until the desired class distribution is achieved in the training dataset. Essentially, this would be randomly duplicating observations. This approach is also referred to as naive resampling because it assumes nothing about the data. Its major advantage is that it is simple to implement and fast to execute, however, in line with former investigations, it does not have a major impact on our model efficiency.
According to previous research, undersampling the majority class is another way to address the imbalance in datasets. It enables better classifiers to be built when compared to random oversampling of the minority class.
Another alternative is to oversample the minority class by creating “synthetic” examples, rather than randomly oversampling with replacement. Enter the Synthetic Minority Oversample Technique or ‘SMOTE’.
The synthetic minority oversampling technique or SMOTE is an oversampling technique that generates new synthetic training samples by performing operations on real samples. SMOTE increases the minority class by taking each minority class sample and inserting synthetic examples along the line segments joining any or all K-nearest neighbors of the minority class. Depending on the amount of oversampling required, the neighbors of the K nearest neighbors are randomly selected. Let’s slow down a bit on this part and try to elaborate more on how SMOTE works.
This procedure will be repeated enough times with different data points until the minority class has about the same proportion as the majority class.
From our previous explanation, we highlight two main ideas: First, in the simplest form, SMOTE will only work with continuous features. For categorical features, the proposed method for generating synthetic samples will generate instances in a continuous spectrum and this will break the coherence in our dataset.
Second, because of the K Nearest Neighbors threshold, SMOTE will create synthetic observations where there is a high density of samples, and fewer synthetic samples will be created on the inter-class boundaries. If misclassification often happens near the boundary decision, the K Nearest Neighbors approach will not create samples that reflect the reality of our use case and will not help improve the classification score.
As you can see, depending on the characteristics of our dataset and our use case, the default behavior of SMOTE may not be the better fit to overcome the imbalance in our dataset. For this reason, different variants of SMOTE have been developed for cases where, in its simplest form, the Synthetic Minority Oversample Technique comes short.
K-Means-SMOTE uses a combination of K-Means clustering and SMOTE oversampling to overcome some of the default SMOTE shortcomings. The use of clustering enables the proposed oversampler to identify target areas of the input space where the generation of artificial data is most effective. The method aims at eliminating both between-class imbalances and within-class imbalances while at the same time avoiding the generation of noisy samples. K-Means SMOTE involves three steps: Clustering, Filtering and Oversampling. In the clustering step, the input space is aggregated into K groups using K-Means clustering. The K-Means algorithm is an iterative method of finding naturally occurring groups in data that can be represented in a Euclidean space. The most notable hyperparameter of the K-Means algorithm is the k itself, the number of clusters. It is essential for the effectiveness of K-Means SMOTE to find an appropriate value for k, as it defines how many minority clusters can be found in the filtering step.
The next step is to filter and choose clusters to be oversampled and determine how many samples are to be generated in each cluster. The idea is to oversample clusters dominated by the minority class, as applying SMOTE inside minority areas is less susceptible to noise generation. The selection of clusters for oversampling is based on each cluster’s proportion of minority and majority instances. This hyperparameter of the Imbalance Ratio is also adjustable. Finally, in the oversampling step, each filtered cluster is oversampled using SMOTE. The proposed method, as you can see, relies on an unsupervised approach that enables the discovery of overlapping class regions and may aid the avoiding oversampling in unsafe areas.
Of course, when we talk about resampling with SMOTE and its variants, no matter which, there is a lot going on under the hood, and trying to explain in detail the procedure and the math that are actually running on each technique is out of the scope of this article. The techniques described earlier are the result of intensive research from the community and, even though they are highly complex in their theory, we don’t need to know exactly the mathematical procedure that is happening behind the curtains of SMOTE, because they are available as libraries and methods we can use, and that is exactly what we are going to do.
Now that we understand why we need SMOTE and how it works, it’s time to get our hands dirty and show how to implement it using Imbalanced-Learn, an open-source library that relies on Scikit-learn to provide tools when dealing with imbalanced datasets
The dataset we will be working with is a Stroke Prediction Dataset. This dataset is used to predict whether a patient is likely to get a stroke based on various parameters (link to the dataset: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)
As for every Machine Learning workflow, we should perform some EDA. So, let’s take a look at the dataset and, of course, let’s zoom in on the imbalance characteristic.
What we will do now is to partition the dataset into Train and Test. From there, we will oversample the Train datasets using different techniques, we will train or model using XGBoost algorithm, and we will see the impact on our metrics using a confusion matrix to see the results of our inferences on the Test subsample.
One important note here. A very important one, actually: You will always perform your oversampling in your Training dataset. Never oversample the Testing dataset.
Why? You may ask, it’s because oversampling will generate samples synthetically and will alter the distribution of your classes. If we do it on the Test dataset we’ll introduce unneeded noise that may cause overfitting. So, the best way to do it is to pull out your testing dataset and do your oversampling just on the training dataset.
In our case, this is how we suggest our Splitting and Oversampling.