Navigating the Clustering Landscape | by David Farrugia | May, 2023


Let’s imagine for second that we are beekeepers. We have a swarm of bees buzzing around and our objective is to group them into K distinct hives (i.e., our clusters).

To start, we randomly pick K bees. These bees will act as our cluster centroids.

Like bees drawn to honey, each bee (data point) will gravitate towards the nearest hive (centroid).

After all the bees have found a hive, we’ll determine the new centroid of each hive (update centroids). We’ll keep repeating this process until the bees settle down and stop switching hives.

And voila, that’s the essence of K-means clustering!

K-means Clustering Flowchart. Image by Author.

Pros

  1. Simple and easy to understand
  2. Easily scalable
  3. Efficient in terms of computational cost

Cons

  1. You need to specify K in advance
  2. Sensitive to the initial selection of centroids
  3. Assumes clusters are spherical and equally sized (which may not always be the case)

Perfect Use-Cases

  1. Market segmentation
  2. Document clustering
  3. Image segmentation
  4. Anomaly detection

Python Example

from sklearn.cluster import KMeans
import numpy as np

# Let's assume we have some data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# We initialize KMeans with the number of clusters we want
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

# We can get the labels of the data points
print(kmeans.labels_)

# And we can predict the clusters for new data points
print(kmeans.predict([[0, 0], [4, 4]]))

# The cluster centers (the mean of all the points in that cluster) can be accessed with
print(kmeans.cluster_centers_)

Suppose that you’re attending a large family wedding where the familial connections are unclear.

Your first task is to identify the immediate family members, like siblings or parents and children, and bring them together.

Following this, you hunt for other relations who share a close bond with these established groups and incorporate them.

You continue this process, gradually piecing together the whole family and friends tapestry until everyone is interconnected.

And voila, that’s the essence of hierarchical clustering!

Hierarchical Clustering Flowchart. Image by Author.

Pros

  1. No need to specify the number of clusters
  2. Provides a hierarchy of clusters which can be useful

Cons

  1. Computationally expensive for large datasets
  2. Sensitive to the choice of distance measure

Perfect Use-Cases

  1. Gene sequencing
  2. Social network analysis
  3. Building taxonomy trees

Python Example

from sklearn.cluster import AgglomerativeClustering
import numpy as np

# Let's assume we have some data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# We initialize AgglomerativeClustering with the number of clusters we want
clustering = AgglomerativeClustering(n_clusters=2).fit(X)

# We can get the labels of the data points
print(clustering.labels_)



Source link

Leave a Comment