# Navigating the Clustering Landscape | by David Farrugia | May, 2023

Let’s imagine for second that we are beekeepers. We have a swarm of bees buzzing around and our objective is to group them into K distinct hives (i.e., our clusters).

To start, we randomly pick K bees. These bees will act as our cluster centroids.

Like bees drawn to honey, each bee (data point) will gravitate towards the nearest hive (centroid).

After all the bees have found a hive, we’ll determine the new centroid of each hive (update centroids). We’ll keep repeating this process until the bees settle down and stop switching hives.

And voila, that’s the essence of K-means clustering!

## Pros

1. Simple and easy to understand
2. Easily scalable
3. Efficient in terms of computational cost

## Cons

1. You need to specify K in advance
2. Sensitive to the initial selection of centroids
3. Assumes clusters are spherical and equally sized (which may not always be the case)

## Perfect Use-Cases

1. Market segmentation
2. Document clustering
3. Image segmentation
4. Anomaly detection

## Python Example

`from sklearn.cluster import KMeansimport numpy as np# Let's assume we have some dataX = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])# We initialize KMeans with the number of clusters we wantkmeans = KMeans(n_clusters=2, random_state=0).fit(X)# We can get the labels of the data pointsprint(kmeans.labels_)# And we can predict the clusters for new data pointsprint(kmeans.predict([[0, 0], [4, 4]]))# The cluster centers (the mean of all the points in that cluster) can be accessed withprint(kmeans.cluster_centers_)`

Suppose that you’re attending a large family wedding where the familial connections are unclear.

Your first task is to identify the immediate family members, like siblings or parents and children, and bring them together.

Following this, you hunt for other relations who share a close bond with these established groups and incorporate them.

You continue this process, gradually piecing together the whole family and friends tapestry until everyone is interconnected.

And voila, that’s the essence of hierarchical clustering!

## Pros

1. No need to specify the number of clusters
2. Provides a hierarchy of clusters which can be useful

## Cons

1. Computationally expensive for large datasets
2. Sensitive to the choice of distance measure

## Perfect Use-Cases

1. Gene sequencing
2. Social network analysis
3. Building taxonomy trees

## Python Example

`from sklearn.cluster import AgglomerativeClusteringimport numpy as np# Let's assume we have some dataX = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])# We initialize AgglomerativeClustering with the number of clusters we wantclustering = AgglomerativeClustering(n_clusters=2).fit(X)# We can get the labels of the data pointsprint(clustering.labels_)`