In machine learning, **dimensionality reduction** refers to reducing the number of input variables in the dataset. The number of input variables refers to the **dimensionality** of the dataset.

Dimensionality reduction techniques are mainly divided into two main categories: *Linear* and *Non-linear (Manifold)*.

Under linear methods, we have discussed *Principal Component Analysis (PCA)*, *Factor Analysis (FA)*, *Linear Discriminant Analysis (LDA)* and *Non-Negative Matrix Factorization (NMF)*.

Under non-linear methods, we have discussed *Autoencoders (AEs)* and *Kernel PCA*.

**t-Distributed Stochastic Neighbor Embedding (t-SNE)** is also a *non-linear *dimensionality reduction method used for visualizing high-dimensional data in a lower-dimensional space to find important clusters or groups in the data.

All dimensionality reduction techniques fall under the category of unsupervised machine learning in which we can reveal hidden patterns and important relationships in the data without requiring labels.

So, dimensionality reduction algorithms deal with unlabeled data. When training such algorithms, the

fit()method only needs the feature matrix,Xas the input and it does not require the label column,y—Source:Non-Negative Matrix Factorization (NMF) for Dimensionality Reduction in Image Data

**What you will learn:**

---------------------------------------------------

01. Advantages of t-SNE over PCA

02. Disadvantages of t-SNE

03. How t-SNE works

04. Python implementation of t-SNE - TSNE() class

05. Important arguments of TSNE() class

06. KL divergence

07. The MNIST data in tabular format

08. Visualizing MNIST data using PCA

09. Visualizing MNIST data using t-SNE

10. PCA before t-SNE (Very special trick)

11. Choosing the right value for perplexity

12. Changing the right number of iterations

13. Randomness of initialization

14. PCA initialization

15. Using a random state in random initialization**Other dimensionality reduction methods**

---------------------------------------------------

1. Principal Component Analysis (PCA)

2. Factor Analysis (FA)

3. Linear Discriminant Analysis (LDA)

4. Non-Negative Matrix Factorization (NMF)

5. Autoencoders (AEs)

6. Kernel PCA

There are mainly two advantages of t-SNE over PCA:

- t-SNE can preserve the spatial relationship between data points after reducing the dimensionality of the data. It means that the nearby data (points with similar characteristics) in the original dimension will still be nearby in the lower dimension! That is why t-SNE is mostly used to find clusters in the data.
- t-SNE can handle non-linear data which is very common in real-world applications.

PCA tries to reduce dimensionality by maximizing variance in the data while t-SNE tries to do the same by keeping similar data points together (and dissimilar data points apart) in both higher and lower dimensions.

Because of these reasons, t-SNE can easily outperform PCA in dimensionality reduction. Today, you will see this in action with an actual implementation of both t-SNE and PCA on the same dataset. You can compare the outputs of both methods and verify the fact!

- The major disadvantage of t-SNE is that it requires a lot of computation resources to execute the algorithm on large datasets. So, it is time-consuming to execute t-SNE when the dimensionality of data is very high. To avoid this, we will discuss a very special trick. You can get almost similar results but with significantly less time!
- Another disadvantage is that you will get unstable (different) results with t-SNE if you use random initialization even with the same hyperparameter values. You will learn more about this at the end.

You will also learn how to tune some of the most important hyperparameters that come with Scikit-learn’s t-SNE algorithm for obtaining even better results!

So, just continue reading!

There are two probability distributions involved in t-SNE calculations. So, the algorithm is stochastic by nature as its name implies.

- In the high-dimensional space, we use
**Gaussian (normal) distribution**to convert pairwise distances between data points into conditional probabilities. - In the low-dimensional space, we use
**Student’s t-distribution**to convert pairwise distances between data points into conditional probabilities.

The KL divergence measures the divergence (difference) between these two probability distributions. It is the cost function that we try to minimize during the training of the algorithm.

So, t-SNE aims to locate the data points in the low-dimensional space in a way that it can minimize the KL divergence between the two probability distributions as much as possible.

t-SNE requires heavy computational resources and too much time to do these probability calculations, especially with large datasets. That’s why t-SNE is extremely slow when compared to PCA. As a solution for this, we will discuss a very special trick.

In Python, t-SNE is implemented by using Scikit-learn’s **TSNE()** class. Scikit-learn is the Python machine-learning library.

First, you need to import and create an instance of the **TSNE()** class.

`# Import`

from sklearn.manifold import TSNE# Create an instance

TSNE_model = TSNE(n_components=2, perplexity=30.0, learning_rate='auto',

n_iter=1000, init='pca', random_state=None)

## Important arguments of TSNE() class

There are many arguments in the **TSNE()** class. If we do not specify them directly, they take their default values when calling the **TSNE()** function. To learn more about those arguments, refer to the Scikit-learn documentation.

The following list contains detailed explanations of the most important arguments in the **TSNE()** class.

**n_components:**An integer that defines the number of dimensions of the embedded space. The default is 2. The most common values are 2 and 3 which are used to visualize data in 2D and 3D spaces respectively as t-SNE is mostly used for data visualization.**perplexity:**The number of nearest neighbors to consider when visualizing data. This is the most important argument in the TSNE() class. The default is 30.0, but trying out a value between 5 and 50 is highly recommended as different values can result in significantly different results. You need to plot the KL divergence for different perplexity values and choose the right value. This is technically called*hyperparameter tuning*. In general, larger datasets require larger perplexity values.**learning_rate:**The TSNE() function uses stochastic gradient descent to minimize the KL divergence cost function. The stochastic gradient descent optimizer requires a proper learning rate value to execute the process. The learning rate determines how fast or slow the optimizer minimizes the loss function. A larger value may fail to converge the model on a solution while a smaller value may take too much time to converge.**n_iter:**The maximum number of iterations you set for gradient descent. The default is 1000.**init:**An initialization method. There are two options:*“random”*or*“pca”*. The default is PCA initialization which is more stable than random initialization and produces the same results across different executions. If we choose random initialization, the algorithm will generate different results at different executions as TSNE has a non-convex cost function and the GD optimizer may stick at a local minimum.**random_state:**An integer to get the same results across different executions when using the random initialization which generates significantly different results each time we run the algorithm. Popular integers are 0, 1, 2 and 42. Learn more here.

## Important methods of TSNE() class

**fit(X):**Learns a TSNE model from the feature matrix,**X**. No transformation is applied here.**fit_transform(X):**Learns a TSNE model from the feature matrix,**X**and returns the TSNE transformed data.

`TSNE_transformed_data = TSNE_model.fit_transform(X)`

## Important attributes of TSNE() class

**kl_divergence_:**Returns the Kullback-Leibler divergence after optimization. The GD optimizer tries to minimize the KL divergence during training. Analyzing this divergence by setting different values for perplexity

This article also includes the use of the PCA algorithm for visualizing MNIST data in a lower-dimensional space. Therefore, it is worth discussing the Python implementation of the PCA algorithm although it is optional here. I have already published in-detail articles for PCA, which can be found in the following links.

PCA articles:3 Easy Steps to Perform Dimensionality Reduction Using Principal Component Analysis (PCA), Principal Component Analysis (PCA) with Scikit-learn, An In-depth Guide to PCA with NumPy.

There are different ways of loading the MNIST dataset which contains 70,000 grayscale images of handwritten digits under 10 categories (0 to 9).

**Using Scikit-learn API:**We get the shape of (70000, 784) which is the required tabular format for the TSNE and PCA algorithms. The dataset will be loaded as a Pandas data frame. There are 70000 observations (images) in the dataset. Each observation has 784 (28 x 28) features (pixel values). The size of an image is 28 x 28.**Using Keras API:**We get the shape of (60000, 28, 28) for the train test and (10000, 28, 28) for the test set. The data will be loaded as three-dimensional NumPy arrays. This format cannot be directly used in the TSNE and PCA algorithms. We need to reshape the MNIST data.

Note:To learn more about the differences between the two APIs, click here.

For now, we will load the MNIST dataset using the Scikit-learn API. To speed up the computation process when using TSNE and PCA, we only load a part (the first 10,000 instances) of the MNIST dataset.

`from sklearn.datasets import fetch_openml`mnist = fetch_openml('mnist_784', version=1)

image_data = mnist['data'][0:10000]

labels = mnist['target'][0:10000]

print("Data shape:", image_data.shape)

print("Data type:", type(image_data))

print()

print("Label shape:", labels.shape)

print("Label type:", type(labels))

We also normalize pixel values to use with PCA and t-SNE.

`# Normalize the pixel values`

image_data = image_data.astype('float32') / 255

As you can see in the above output, the original dimensionality of MNIST data is 784 which cannot be plotted in a 2D plot. So, we need to reduce the number of dimensions to 2 by applying PCA.

Let’s see the output.

`from sklearn.decomposition import PCA`PCA_model = PCA(n_components=2)

PCA_transformed_data = PCA_model.fit_transform(image_data)

print("PCA transformed data shape:", PCA_transformed_data.shape)

print("PCA transformed data type:", type(PCA_transformed_data))

The PCA-transformed MNIST data has the shape of (10000, 2) which can now be plotted in a 2D plot.

It is clear that all data points are in a single cluster and we cannot see different clusters for each class label. This is not the representation we need.

Let’s fix this by applying TSNE to the MNIST data.

Now, we will apply t-SNE on the same dataset. When applying t-SNE, we will use default values for all arguments (hyperparameters) in the **TSNE()** class.

`from sklearn.manifold import TSNE`TSNE_model = TSNE(n_components=2, perplexity=30.0)

TSNE_transformed_data = TSNE_model.fit_transform(image_data)

print("TSNE transformed data shape:", TSNE_transformed_data.shape)

print("TSNE transformed data type:", type(TSNE_transformed_data))

The TSNE-transformed MNIST data has the shape of (10000, 2) which can now be plotted in a 2D plot same as before.

It is clear that data points are separated as clusters according to their class labels. The nearby data points of the same class in the original dimension will still be nearby in the lower dimension!

t-SNE is more effective than PCA for dimensionality reduction as it keeps similar data points together (and dissimilar data points apart) in both higher and lower dimensions. and works well with non-linear data.

PCA can’t maintain the spatial relationship between data points after dimensionality reduction although it executes really fast compared to t-SNE.

On the other hand, t-SNE is really slow with larger datasets, but it can preserve the spatial relationship between data points after dimensionality reduction.

To speed up the computation process of t-SNE, we apply PCA before t-SNE and combine both methods as in the following diagram.

First, we apply PCA to the MNIST data and reduce dimensionality to 100 (keep only 100 dimensions/components/features). Then, we apply t-SNE to the PCA-transformed MNIST data. This time, t-SNE only sees 100 features instead of 784 features and does not want to perform much computation. Now, t-SNE executes really fast but still manages to generate the same or even better results!

By applying PCA before t-SNE, you will get the following benefits.

- PCA removes noise in the data and keeps only the most important features in the data. By feeding PCA-transformed data into t-SNE, you will get an even better output!
- PCA removes multicollinearity between the input features. The PCA-transformed data has uncorrelated variables which are fed into t-SNE.
- As I already said, PCA reduces the number of features significantly. The PCA-transformed data will be fed into t-SNE ad you will get results very fast!

`PCA_model = PCA(n_components=100)`

PCA_transformed_data = PCA_model.fit_transform(image_data)TSNE_model = TSNE(n_components=2, perplexity=30.0)

PCA_TSNE_transformed_data = TSNE_model.fit_transform(PCA_transformed_data)

plt.figure(figsize=[7, 4.9])

plt.scatter(PCA_TSNE_transformed_data[:, 0], PCA_TSNE_transformed_data[:, 1],

c=np.array(labels).astype('int32'), s=5, cmap='tab10')

plt.title('Lower dimensional representation of MNIST data - TSNE after PCA')

plt.xlabel('1st dimension')

plt.ylabel('2nd dimension')

plt.savefig("PCA_TSNE.png")

It took 100 seconds to run TSNE with all 784 features. After applying PCA, it only took 20 seconds to run TSNE with PCA-transformed data which has 100 features.

The PCA-transformed data accurately represents the original MNIST data as the first 100 components capture about 90% variance in the original data. We can confirm it by looking at the following plot. Therefore, it is reasonable to feed the PCA-transformed data to t-SNE in place of the original data.

`pca_all = PCA(n_components=784)`

pca_all.fit(image_data)plt.figure(figsize=[5, 3.5])

plt.grid()

plt.plot(np.cumsum(pca_all.explained_variance_ratio_ * 100))

plt.xlabel('Number of components')

plt.ylabel('Explained variance')

Perplexity determines the number of nearest neighbors to consider when visualizing data. It is the most important hyperparameter in t-SNE. Therefore, you need to tune it properly.

One option is that you can plot the KL divergences for different perplexity values and analyze how Kl divergence behaves when increasing the value of perplexity.

`perplexity_vals = np.arange(10, 220, 10)`

KL_divergences = []for i in perplexity_vals:

TSNE_model = TSNE(n_components=2, perplexity=i, n_iter=500).fit(PCA_transformed_data)

KL_divergences.append(TSNE_model.kl_divergence_)

plt.style.use("ggplot")

plt.figure(figsize=[5, 3.5])

plt.plot(perplexity_vals, KL_divergences, marker='o', color='blue')

plt.xlabel("Perplexity values")

plt.ylabel("KL divergence")

The KL divergence is decreasing continuously when the perplexity value is increased! Therefore, we can’t decide the right value for perplexity by only analyzing this plot.

As the second option, we need to run the TSNE algorithm several times with different perplexity values and visualize the results.