# Hands on Sampling Techniques and comparison, in Python | by Piero Paialunga | Dec, 2023

## Here’s a step-by-step tutorial on how to sample your dataset efficiently using Python

I was putting the Christmas tree up with my wife. We went to the basement, took the tree, brought it upstairs, and started building it from bottom to top. It’s always a magic moment🎄

Then it was the point to put the balls on the tree. And immediately I thought: there are at least three ways to put the balls on the tree.

• Uniformly: put the balls uniformly on the tree, kind of like that
• Randomly: put the balls randomly on the tree, closing your eyes and putting the ball wherever you feel like (I started doing this and my wife went MAD)
• Latin Hypercube: Splitting the tree into N sections and extracting randomly in each one of these sections. It’s very hard to draw it without running any code, but a possible Latin Hypercube looks like this:

I tried and show this to my wife. She smiled and said “Whatever”, so I went to my computer in the hope that your reaction would be something more satisfactory 😤

Jokes aside, when dealing with Machine Learning problems there are two different scenarios:

1. You don’t have any control over the dataset. You have a client, or a company, that hands you a dataset. That’s what you will have to deal with until a necessary (eventual) re-training will be scheduled.

For example, in the city of New York, you want to predict the price of a house based on some given features. They just give you the dataset and they want you to build your model so that when a new client arrives you have an AI software that can predict the price based on the features of the house of interest.

2. You can build your Design of Experiment. This is when you have a forward model or a real-world experiment that you can always set up to run.

For example, in a laboratory, you want to predict a physical signal given an experimental setup. You can always go to the lab and generate new data.

The considerations that you make in the two cases are completely different.

In the first case you can expect a dataset that is unbalanced in its features, maybe with missing input values and a skewed distribution of the target values. It’s the joy and damnation of a data scientist’s job to deal with these things though. You do data augmentation, data filtering, fill in the miss values, do some ANOVA testing if you can and so forth. In the second case, you have complete control over what’s going on in your dataset, especially from the input perspective. This means that if you have a NaN value you can repeat the experiment, if you have several NaN values you can investigate that weird area of your dataset, if you have a suspicious large value for some given features you can just repeat the experiment to make sure it’s not an hallucination of your setup.

As we have this amount of control we want to make sure to cover the input parameter space efficiently. For example, if you have 3 parameters, and you know the boundaries

where i goes from 1 to 3 (or from 0 to 2 if you like Python so much 😁). In this case, x_i is the i-th variable and it will always be larger than x_i^L(ower boundary), but it will always be smaller than x_i^U(pper boundary).

We have our 3-dimensional cube.

Now, remember that we have complete control of our dataset. How do we sample? In other words, how do we determine the xs ? What are the points that we want to select so that we run the forward model (experiment or simulation) and get the target values?

As you can expect there are multiple methods to do so. Each method has its advantages and disadvantages. In this study, we will discuss them, show the theory behind them, and display the code for everyone to use and understand more about the beautiful world of sampling. 🙂

Let’s start with the uniform sampling:

The uniform sampling method is arguably the most simple and famous one.

It is just about splitting each parameter (or dimension) in steps. Let’s assume that we have 3 steps per dimension, for 2 dimensions. Each dimension goes from 0 to 1 (we will extend this in a minute). This would be the sampling:

• (0,0)
• (0,0.5)
• (0,1)
• (0.5,0)
• (0.5,0.5)
• (0.5,1)
• (1,0)
• (1,0.5)
• (1,1)

This means that we fix one variable at a time and increase by step. Fairly simple. Let’s code it:

## 1.1 Uniform Sampling Code

How do we do this? Let’s avoid this kind of structure:

• for a in dimensions 1
• for b in dimension 2
• ….
• for last letter of the alphabet in dimension number of letters in the alphabet: X.append([a,b,…,last letter of the alphabet])

We don’t want this as it is not very efficient and you need to define a variable per dimension and it is annoying. Let’s use the magic numpy instead.