Using the Monte Carlo method to visualize the behavior of observations with very large numbers of features
Think of a dataset, made of some number of observations, each observation having N features. If you convert all features to a numeric representation, you could say that each observation is a point in an N-dimensional space.
When N is low, the relationships between points are just what you would expect intuitively. But sometimes N grows very large — this could happen, for example, if you’re creating a lot of features via one-hot encoding, etc. For very large values of N, observations behave as if they are sparse, or as if the distances between them are somehow bigger than what you would expect.
The phenomenon is real. As the number of dimensions N grows, and all else stays the same, the N-volume containing your observations really does increase in a sense (or at least the number of degrees of freedom becomes larger), and the Euclidian distances between observations also increase. The group of points actually does become more sparse. This is the geometric basis for the curse of dimensionality. The behavior of the models and techniques applied to the dataset will be influenced as a consequence of these changes.
Many things can go wrong if the number of features is very large. Having more features than observations is a typical setup for models overfitting in training. Any brute-force search in such a space (e.g. GridSearch) becomes less efficient — you need more trials to cover the same intervals along any axis. A subtle effect has an impact on any models based on distance or vicinity: as the number of dimensions grows to some very large values, if you consider any point in your observations, all the other points appear to be far away and somehow nearly equidistant — since these models rely on distance to do their job, the leveling out of differences of distance makes their job much harder. E.g. clustering doesn’t work as well if all points appear to be nearly equidistant.
For all these reasons, and more, techniques such as PCA, LDA, etc. have been created — in an effort to move away from the peculiar geometry of spaces with very many dimensions, and to distill the dataset down to a number of dimensions more compatible with the actual information contained in it.
It is hard to perceive intuitively the true magnitude of this phenomenon, and spaces with more than 3 dimensions are extremely challenging to visualize, so let’s do some simple 2D visualizations to help our intuition. There is a geometric basis for the reason why dimensionality can become a problem, and this is what we will visualize here. If you have not seen this before, the results might be surprising — the geometry of high-dimensional spaces is far more complex than the typical intuition is likely to suggest.
Consider a square of size 1, centered in the origin. In the square, you inscribe a circle.
That is the setup in 2 dimensions. Now think in the general, N-dimensional case. In 3 dimensions, you have a sphere inscribed in a cube. Beyond that, you have an N-sphere inscribed in an N-cube, which is the most general way to put it. For simplicity, we will refer to these objects as “sphere” and “cube”, no matter how many dimensions they have.
The volume of the cube is fixed, it’s always 1. The question is: as the number of dimensions N varies, what happens to the volume of the sphere?
Let’s answer the question experimentally, using the Monte Carlo method. We will generate a very large number of points, distributed uniformly but randomly within the cube. For each point we calculate its distance to the origin — if that distance is less than 0.5 (the radius of the sphere), then the point is inside the sphere.
If we divide the number of points inside the sphere by the total number of points, that will approximate the ratio of the volume of the sphere and of the volume of the cube. Since the volume of the cube is 1, the ratio will be equal to the volume of the sphere. The approximation gets better when the total number of points is large.
In other words, the ratio
inside_points / total_points will approximate the volume of the sphere.
The code is rather straightforward. Since we need many points, explicit loops must be avoided. We could use NumPy, but it’s CPU-only and single-threaded, so it will be slow. Potential alternatives: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), etc. We will use PyTorch — but the NumPy code would look almost identical.
If you follow the nested
torch statements, we generate 100 million random points, calculate their distances to the origin, count the points inside the sphere, and divide the count by the total number of points. The
ratio array will end up containing the volume of the sphere in different numbers of dimensions.
The tunable parameters are set for a GPU with 24 GB of memory — adjust them if your hardware is different.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# force CPU
# device = 'cpu'
# reduce d_max if too many ratio values are 0.0
d_max = 22
# reduce n if you run out of memory
n = 10**8
ratio = np.zeros(d_max)
for d in tqdm(range(d_max, 0, -1)):
# combine large tensor statements for better memory allocation
ratio[d - 1] = (
torch.sum(torch.pow(torch.rand((n, d), device=device) - 0.5, 2), dim=1)
# clean up memory
Let’s visualize the results: