It’s Okay To Not Have Appropriate Data. Just Create It Yourself. | by Avi Chawla | May, 2023

In the context of machine learning, one primary reason you may need dummy data for is testing and evaluation.

Feeding data to model (Image by Author)

A major advantage here is that it can be created and manipulated at will, without the need for searching real-world data.

This is especially useful when dealing with sensitive or confidential data that cannot be shared or when real data is not available.

Synthetic data lets data professionals maintain a safe environment for testing without compromising the privacy and security of real data.

Another significant advantage of using dummy data is that it allows data professionals to test their models in different situations and conditions, and identify potential issues or weaknesses.

Multiple variations of data (Image by Author)

Given the immense benefits, let’s look at some popular ways to create dummy datasets in Python. More specifically, we will use Sklearn’s make_classification method.

Also, there’s a cool tool to create dummy dataset towards the end of this blog.

In this section, let’s understand how we can create a classification dataset with the make_classification method.

But just before we do that, it is important to understand some parameters of this method.

Essentially, the whole idea of make_classification method revolves around creating a numerical dataset, which some specified number of rows (n_samples) and columns (n_features):

The output of make_classification (Image by Author)

The features can be of four types:

Feature types in make_classification (Image by Author)

Let’s understand them one by one:

  1. Informative: Informative features are the features that contribute to the classification decision. The number of informative features can be specified using the n_informative parameter.
  2. Redundant: These are linear combinations of the informative features. This is specified using n_redundant parameter.
  3. Repeated: These are drawn randomly from informative and redundant features. The number of repeated features is specified using n_repeated parameters.
  4. Random noise: As the name suggests, these features are just random noise. The number of such features are =n_features-n_informative-n_redundant-n_repeated.

Lastly, the number of informative, redundant and repeated features must sum to less than the number of total features.

It’s available in the sklearn.datasets module, as imported below:

Next, let’s create a 2-dimensional dataset with 500 samples, with two features — both informative.

Once we visualize this, we get:

Dummy dataset (Visualisation by Author)

That’s how simple it is to create a dummy dataset with this method.

Number of clusters per class

Let’s change the number of clusters per class to two.

Once we visualize it, we see that each class in this data has two individual clusters:

Dummy dataset (Visualisation by Author)

Class imbalance

What if we need an imbalanced dataset?

This is controlled by the weights parameter of the make_classification method.

It is a list depicting the proportions of samples assigned to each class and can be used as follows:

Once we plot this dataset, we can clearly see an imbalanced dataset:

Dummy dataset (Visualisation by Author)

Class separation

In all the above figures, we notice that there is some amount of overlap between the location of data points in both classes.

You can control their separation with the class_sep parameter of the method:

If we plot this data now, we get:

Dummy dataset (Visualisation by Author)

From the visualization, it is pretty evident that now there’s a large gap between the two classes.

Often when we want data of some specific shape, while programmatically generating it might be feasible, it can also get tedious and time-consuming.

Instead, use drawdata. This allows you to draw any 2D dataset by dragging the mouse in a notebook and export it.

You can install it as follows:

Once installed, open a Jupyter notebook, import the method, and invoke it:

Creating dummy dataset with drawdata (Image by Author)

As demonstrated above, you can create a dataset by simply dragging the mouse.

After creating it, click copy csv and use Pandas’ read_clipboard() method to convert it to a DataFrame:

Besides a scatter plot, drawdata can also create histograms and line plots.

Source link

Leave a Comment