In the previous story we explained how the naive random oversampling and random oversampling examples (ROSE) algorithms work. More importantly, we also defined the class imbalance problem and derived solutions for it with intuition. I highly recommend checking that story to ensure clear understanding of class imbalance.

In this story, we will continue by considering the SMOTE, SMOTE-NC and SMOTE-N algorithms. But before we do, it’s worthy to point out the two algorithms we considered in the last story fit the following implementation framework:

- Define how the algorithm takes data belonging to class
*X*that needs*Nx*examples and computes such examples by oversampling - Given some ratios hyperparameter, compute the number of points that need to be added for each class
- For each class run the algorithm then combine all newly added points together with the original data to form the final oversampled dataset

For both the random oversampling and ROSE algorithms it was also true that to generate *Nx* examples for class *X *the algorithm does the following:

- Choose
*Nx*points randomly with replacement from the data belonging to class*X* - Perform logic on each of the chosen points to generate a new point (e.g., replication or placing a Gaussian then sampling from it)

It holds that the rest of the algorithms we will consider in this story also fit the same framework.

**SMOTE (Synthetic Minority Oversampling Technique)**

Thus, to explain what SMOTE does we only need to answer one question: What logic is performed on each of the *Nx *randomly chosen with replacement examples from class *X* in order to generate *Nx *new examples?

The answer is as follows:

- Find the k-nearest neighbors of the point (k is a hyperparameter of the algorithm)
- Choose one of them randomly
- Draw a line segment from the point to that randomly chosen neighbor
- Randomly choose a point on that line segment
- Return it as the new point

Mathematically,

- If point
*x_i*has nearest neighbors*z_i1, z_i2, …, z_ik* - And if
*j*is a random number in*[1,k]* - And
*r*is a random number in*[0, 1]*

then for each point *x_i* SMOTE generates a new point *x_i’* by simply applying:

This is really all what the SMOTE algorithm does. From point *x_i, *walk a distance *r *along the vector *z_ij — x_i *then place a new point.

A minor side note is that there is a small difference in how the algorithm operates compared to how the algorithm is presented in the paper. In particular, the authors assume that ratios are integers (and floor it if not). If the ratio for class *X *is an integer C then for each point in it choose a random neighbor with replacement C times then apply the SMOTE logic we described. **In practice**, when SMOTE is implemented it’s typically generalized to work with floating point ratios as we described by rather choosing *Nx* points randomly then applying SMOTE on each. For integer ratios such as C=2, it follows that each point is picked on average twice and we go back to original algorithm. This should make sense as it is the same transition from oversampling by repeating with integer ratios to random oversampling which was explained in the last story.

This animation shows how the decision regions of an SVM change by changing the oversampling ratio for the versicolor class over an imbalanced subset of the Iris dataset. The ratio here is relative to the size of the majority class. That is, a ratio of 1.0 would set *Nx* so that the versicolor class has as much examples as the virginica class.

You may be thinking why would SMOTE ever be better than ROSE. After all, SMOTE’s logic of generating points has not been justified in the paper; meanwhile, sampling from an estimate of *P(x|y)* as done in ROSE is much more rational and intuitive. One problem is possibly that getting a good estimate of *P(x|y) *requires a lot of data; however, we know that minority classes often have little data. If we don’t have a lot of data we have two options:

- Choose the bandwidth to be too small where we go back to possible overfitting as was in random oversampling
- Choose it to be too big which in the extreme case is equivalent to just uniformly adding random points from the feature space (i.e., unrealistic examples)

If you think about it, we should worry less about this problem in SMOTE. If a hyperplane that perfectly separates the data exists then that solution still exists after applying SMOTE. In fact, the way SMOTE generates points may cause a nonlinear hypersurface to be become more linear so it seems to be at a much lower risk of causing the model to overfit.

**SMOTE-NC (SMOTE-Nominal Continuous)**

While both ROSE and SMOTE appear to offer a significant improvement over naive random oversampling, they come with the drawback of losing the capability to handle categorical variables which was not a problem for naive random oversampling. The authors of SMOTE were bright enough to think of a way to circumvent this by developing an extension for the SMOTE algorithm to handle cases where categorical features are also present.

You may be thinking that encoding categorical features would be a way to get around this whatsoever; however, you are not absolutely right because SMOTE or ROSE will then treat them as continuous and generate invalid values for them. For instance, if a feature is binary then the chosen point along the line may be 0.57 for the new point which is not 0 and 1. Rounding it is a bad idea because that is equivalent to randomly choosing whether it is 0 or 1.

Recall that the following is how SMOTE generated a new point:

- Suppose point
*x_i*has nearest neighbors*z_i1, z_i2, …, z_ik* - let
*j*be a random number in*[1,k]* - let
*r*be a random number in*[0, 1]*

then for each point *x_i* SMOTE generates a new point *x_i’* by simply applying

It should be obvious that we can’t apply the same method in the presence of categorical features unless we extend it by answering the following two questions

- How are the k-nearest neighbors found? The Euclidean distance metric only operates on continuous features.
- How is the new point generated? We can’t apply the SMOTE equation to generate the categorical part of
*x_i’*

For the first question, the author’s suggested a modification to the Euclidean distance to account for the categorical part. Suppose each of *x_i *and *z_ij *involve *m* continuous features and *n* categorical features then in the modified metric the continuous features are naturally subtracted and squared and then a constant penalty is added for each differing pair of categorical features. This penalty is in particular the median of variances of all continuous features which can be computed at the start of the algorithm.

As an example, to measure the distance between two points *x_1 *and *x_2*

then if the median of standard deviations is *m*, the distance is given by

the last two terms account for how the last two categorical features are different.

Although authors provide no justification for the metric, it may make sense after observing that one of the most common way to measure distances among categorical features is Hamming distance. It simply adds 1 for each differing pair of categorical features. A Hamming distance of 6 suggests that the two points have different values in 6 of the categorical features. It’s obvious in our case that setting the penalty as 1 as in Hamming distance is not intuitive because if continuous features often strongly vary then the 1s will be very insignificant in the sum which is equivalent to ignore the categorical features in the measurement. It should make sense that using the average squared difference between any two continuous features as the penalty would get around this problem because then if the variance of continuous features is often large, the penalty is as large and is not negligible. The only catch is that the authors used the median of variances and not their mean which may be justified by its robustness to outliers.

Answering the second question is far more simple, now that we have found the k-nearest neighbors using the modified metric we can generate the continuous part of the new point using the SMOTE equation as usual. To generate the categorical part of the new point, it makes sense to simply take the mode of the categorical parts of the k-nearest neighbors. I.e., let the neighbors vote over the values in the categorical part where the most common values will dominate.

It follow that what SMOTE-NC does to generate a new point is…

- Find the k-nearest neighbors of the point (k is a hyperparameter of the algorithm) using the modified Euclidean metric
- Choose one of them randomly
- Draw a line segment from the point to that neighbor in the continuous features space
- Randomly choose a point on that line segment
- Let that be the continuous part of the new point
- For the categorical part of the new point, take the mode of the categorical parts of the k-nearest neighbors.

**SMOTE-N (SMOTE-Nominal)**

It should be obvious that SMOTE-NC becomes SMOTE when no categorical features are involved because then the penalty is zero and the mode step in generation is skipped. However, if no continuous features are involved then the algorithm is in a precarious position because there is no penalty defined as there are no continuous features. Your workaround may be to set it as 1 or something and operate the algorithm as normal but that is not ideal because there will easily be many ties when computing the nearest neighbors. *If the Hamming distance between one point and another 10 points is 7 are they all really equally close to that point?* Or do they just share in common that they differ from the point in 7 of the features?

SMOTE-N is another algorithm that the authors present in the paper to deal with data that is purely categorical. It responds negatively to the italicized question above by employing another distance metric over the categorical features. Once the k nearest neighbors are found mode computation decides the new points; however, this time the point itself also is involved in the computation.

It hence suffices to explain the distance metric used in SMOTE-N to perform K-NN. The metric is called the “modified value distance metric” (Cost & Salzberg, 1993) and it operates as follows given two feature vectors with q categorical values and p_1, p_2, …,p_q possible values for each of the categorical features respectively.

- Encode each of the categorical values via a vector V of length K where K is the number of classes. V[i] should be the frequency of that value for the i_th class divided by its frequency over all classes.
- Now any categorical vector is represented by a tensor of q vectors each of length k
- Compute the distance between any two categorical vectors represented by that tensor by computing the Manhattan distance between each pair of vectors of length k then take the L2 norm of the result

For an example, suppose we want to find the distance between the following two categorical vectors

then given 3 classes, after encoding them suppose we end up with

After computing the Manhattan distance between each pair of vectors we have

which evaluates to 1.428 after talking the L2 norm.

To be precise, the paper points out that it is possible to use either the L1 norm or the L2 for the magnitude but does not decide which to use for the algorithm (here we chose L2).

You may asking yourself why this would be any better than using plain Hamming distance. The definite answer is that the authors have not justified. However, just to introduce some slight intuition, we argued earlier that the Hamming may often result in a lot of ties during the distance computation for KNN. Suppose we have three categorical vectors

Here Hamming distance would suggest that *x_2 *and *x_3 *are as close to *x_1 *as in both cases the Hamming distance is 1. Meanwhile, the modified value difference metric would look at how each value is distributed over the classes first before deciding which is closer. Suppose the frequencies per class for B2 is [0.3, 0.2, 0.5] and for B3 is [0.1, 0.9, 0] and for B1 is [0.25, 0.25, 0.5]. In this case, MVDM would suggest that *x_3 *is much closer to *x_1 *because B1 is much closer to B2 than B3 is. From a probabilistic perspective, if we were to collect a new point with an unknown class then knowing whether the category is B2 or B3 does not help much predict the class and in this sense they are similar or interchangeable.

Hence, in summary the SMOTE-N algorithm operates as follows to generate a new point:

- Find the k-nearest neighbors of the point (k is a hyperparameter of the algorithm) using the modified value difference metric
- Return the mode of the categorical values of the neighbors (including the point itself) to generate the new point

That’s it! It should be now crystal clear to you how each of SMOTE, SMOTE-N and SMOTE-NC work. Till next time, au revoir.

References:

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321–357, 2002.