SMOTE and Improvements

Class imbalance is a common problem in real-world classification tasks where the number of instances in one class significantly outnumbers the other class(es). This can lead to poor performance of standard machine learning algorithms. Oversampling techniques like SMOTE (Synthetic Minority Over-sampling TEchnique) and ADASYN (Adaptive Synthetic Sampling) aim to alleviate this issue by generating synthetic examples of the minority class. This article provides an in-depth mathematical formulation and comparative analysis of the SMOTE and ADASYN algorithms.

Introduction

In a binary classification problem, let the dataset be denoted as ${(x_{i}, y_{i})}_{i = 1}^{N}$ where $x_{i} \in R^{d}$ is a d-dimensional feature vector and $y_{i} \in {0, 1}$ is the corresponding class label. Let the minority and majority class sets be $S_{min}$ and $S_{maj}$ respectively. Class imbalance occurs when $∣ S_{min} ∣ << ∣ S_{maj} ∣$ , which can hinder the learning of classifiers, especially in detecting the minority class. Oversampling methods oversample $S_{min}$ by generating synthetic examples to obtain a more balanced class distribution.

SMOTE Algorithm

SMOTE [1] generates synthetic examples in the following way:

For each $x_{i} \in S_{min}$ ,

Find its k-nearest neighbors in $S_{min}$ , denoted $N_{i}$ .
For $j = 1$ to $N_{i}$ ,
- Select a random neighbor $\overset{x}{^}_{ij} \in N_{i}$
- Generate a synthetic example $x_{n e w} = x_{i} + λ (\overset{x}{^}_{ij} - x_{i})$ where $λ \in [0, 1]$ is a random number.

The number of synthetic examples generated for each $x_{i}$ is proportional to the imbalance ratio $∣ S_{maj} ∣/∣ S_{min} ∣$ . SMOTE effectively forces the decision region of the minority class to become more general by generating examples along line segments joining minority class instances.

ADASYN Algorithm

ADASYN [2] is an extension of SMOTE that adaptively generates synthetic examples based on the distribution $\overset{r}{^}_{i}$ defined as:

$\overset{r}{^}_{i} = \frac{r _{i}}{\sum _{i = 1}^{∣ S_{min} ∣} r _{i}}$ where $r_{i} = \frac{Δ _{i}}{k}$

Here $Δ_{i}$ is the number of examples in the k-nearest neighbors of $x_{i} \in S_{min}$ that belong to the majority class. Therefore, more synthetic examples are generated for minority class instances that are harder to learn compared to those that are easier to learn.

The number of synthetic examples generated for each $x_{i}$ is $g_{i} = \overset{r}{^}_{i} \cdot G$ where $G$ is a parameter that determines the balance level after generation. The synthetic examples are generated in the same way as SMOTE.

Comparative Analysis

Both SMOTE and ADASYN aim to achieve class balance by generating synthetic minority class examples. However, there are some key differences:

SMOTE generates equal number of synthetic examples for each minority instance, while ADASYN generates different numbers of examples for different minority instances based on their distribution.
ADASYN tends to focus more on minority instances near the boundary since they have higher $\overset{r}{^}_{i}$ values. This adaptively forces the decision boundary to be more inclined towards the difficult instances.
The number of synthetic examples in SMOTE is determined by the imbalance ratio, while in ADASYN it is determined by the parameter $G$ . This allows more flexibility in controlling the final balance in ADASYN.

Empirical results have shown that both methods significantly improve classification performance on imbalanced datasets, with ADASYN often obtaining better F-measure and G-mean scores [2][3]. However, by generating noisy synthetic examples, both methods also risk overfitting and reducing classifier performance on the majority class [4]. The optimal oversampling ratio is data-dependent and needs to be tuned carefully.

Citations: [1] https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/ [2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10789107/ [3] https://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf [4] https://rikunert.com/smote_explained [5] https://typeset.io/questions/what-are-the-pros-and-cons-of-using-smote-2pzu32jb92 [6] https://www.diva-portal.org/smash/get/diva2:1519153/FULLTEXT01.pdf [7] https://towardsdatascience.com/smote-fdce2f605729 [8] https://www.linkedin.com/posts/soledad-galli_no-smote-is-not-the-silver-bullet-for-activity-7094995959624929281-EYmH [9] https://pub.aimind.so/adasyn-algorithm-for-unbalanced-classification-problems-4e0b08e83bd7?gi=bc6dd39ee030 [10] https://www.turing.com/kb/smote-for-an-imbalanced-dataset [11] https://datascience.stackexchange.com/questions/106461/why-smote-is-not-used-in-prize-winning-kaggle-solutions [12] https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.ADASYN.html [13] https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ [14] https://mindfulmodeler.substack.com/p/dont-fix-your-imbalanced-data [15] https://towardsdatascience.com/the-mystery-of-adasyn-is-revealed-73bcba57c3fe [16] https://www.blog.trainindata.com/overcoming-class-imbalance-with-smote/ [17] https://towardsdatascience.com/stop-using-smote-to-handle-all-your-imbalanced-data-34403399d3be [18] https://ieeexplore.ieee.org/document/4633969 [19] https://domino.ai/blog/smote-oversampling-technique [20] https://www.researchgate.net/publication/224330873_ADASYN_Adaptive_Synthetic_Sampling_Approach_for_Imbalanced_Learning

Notes

SMOTE and Improvements

Graph View

Backlinks