In a previous couple of articles, we explored some basic machine learning algorithms. Thus far we covered some simple regression algorithms, classification algorithms. We used ML.NET implementation and application of these algorithms. Up to this point, we explored algorithms that are using supervised learning. This means that we always had input and expected output data that we used to train our machine learning models. In this type of learning, the training set contains inputs and desired outputs. This way the algorithm can check its calculated output the same as the desired output and take appropriate actions based on that.
Are you afraid that AI might take your job? Make sure you are the one who is building it.
STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖
1. Clustering Intuition
However, in real life, we often don’t have both input and output data, but we only have input data. This means that the algorithm on itself needs to figure connections between input samples. For that, we use unsupervised learning. In unsupervised learning, the training set contains only inputs. Just like we solve regression and classification problems with supervised learning with unsupervised learning we solve clustering problems. This technique attempts to identify similar inputs and to put them into categories, ie. it clusters data. Generally speaking, the goal is to detect the hidden patterns among the data and group them into clusters. This means that the samples which have some shared properties will fall into one group – cluster.
There are many clustering algorithms out there and in this article, and in this article we focus on K-Means Clustering because this algorithm is supported in ML.NET. However, we will explore a bit other types of clustering like Agglomerative Clustering and DBSCAN, but the focus remains on K-Means.
2. Dataset and Prerequisites
Data that we use in this article is from PalmerPenguins Dataset. This dataset has been recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset here, or via Kaggle. This dataset is essentially composed of two datasets, each containing data of 344 penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. Also, these datasets contain culmen dimensions for each species. The culmen is the upper ridge of a bird’s bill. In the simplified penguin’s data, culmen length and depth are renamed as variables culmen_length_mm and culmen_depth_mm.
Data itself is not too complicated. In essence, it is just tabular data:
Note that in this tutorial, we ignore the species feature. This is because we perform unsupervised learning, ie. we don’t need the expected output value of the sample. We want our algorithm to figure that out on its own. Here is how data looks like when we plot it:
The implementations provided here are done in C#, and we use the latest .NET 5. So make sure that you have installed this SDK. If you are using Visual Studio this comes with version 16.8.3. Also, make sure that you have installed the following package:
You can do a similar thing using Visual Studio’s Manage NuGetPackage option:
If you need to catch up with the basics of machine learning with ML.NET check out this article.
3. K-Means Clustering
K-Means is one of the most popular clustering algorithms. It is definitely a go-to option when you start experimenting with your unlabeled data. This algorithm groups n data points into K number of clusters, as the name of the algorithm suggests. This algorithm can be split into several stages:
- In the first stage, we need to set the hyperparameter k. This represents the number of clusters or groups that K-Means Clustering will create once it is done.
- K random vectors are picked in the feature space. These vectors are called centroids. These vectors are changed during the training process and the goal is to put them into the “center” of each cluster.
- Distances from each input sample x to each centroid c is calculated using some metric, like Euclidean distance. The closest centroid is assigned to each sample in the dataset. Basically, we create clusters at this stage.
- For each cluster average feature vector is calculated using samples that are assigned to it. This value is considered as a new centroid of the cluster.
- Steps 2-4 are repeated for a fixed number of iteration or until the centroids don’t change, whichever comes first.
Mathematically speaking each sample x is assigned to a cluster based on:
where cᵢ is a centroid of a cluster i and D is Euclidean distance calculated using the formula:
For finding the new centroid from clustered group of points we use formula:
As we already mentioned hyperparameter k, ie. the number of clusters, has to be tuned manually. This is quite tiresome. Later in this tutorial, we consider one technique for selecting the correct number of clusters. However, let’s explore some other types of clustering that are not yet supported in ML.NET.
3. Other Types of Clustering
3.1 Agglomerative Clustering
As we can see one of the biggest challenges of working with K-Means is that we need to determine the number of clusters beforehand. Another challenge is that K-Means tries to make clusters of the same size. These challenges can be addressed with other algorithms like Hierarchical Clustering. In general, every Hierarchical Clustering method starts by putting all samples into separate single-sample clusters. Then based on some similarity metrics, samples or clusters are merged together until the point when all samples are put into a single cluster. This means that we are building the hierarchy of clusters, hence the name. In this article, we explore Agglomerative Clustering which is one specific type of Hierarchical Clustering. The metric that it uses for merging clusters is the distance, ie. it merges the closest pair of clusters, based on the distance among centroids and repeats this step until only a single cluster is left. For this purpose, the proximity matrix is used. This matrix stores the distances between each point.
Let’s break this into steps:
- Every point is stored in its own cluster
- The proximity Matrix is calculated
- Closest points are detected and merged. They are the cluster and centroid is calculated.
- The proximity matrix is updated using the centroid of the created cluster.
- Steps 3 and 4 are repeated until one cluster is created.
At this moment you might ask yourself, how does this help us with deciding the number of clusters? To do so, we utilize an awesome concept – Dendrogram. This is a tree diagram that records all merges that happened during the training process. So, every time we merge two points or clusters, this is stored in the dendrogram. Here is one example:
What are we looking at? Well, on the x-axis we have all points in the dataset, while on the y-axis we have distances between those points. Every time points or clusters are merged this is represented with a horizontal line. The vertical line represents the distances between merged points/clusters. Longer vertical lines in the dendrogram indicate that the distance between clusters is bigger. In the next step, we need to set a threshold distance and draw a horizontal line in this image. In general, we try to set the threshold in such a way that it cuts the tallest vertical line. In our example, we set it to 15. Here is how that is done:
Unlike K-Means and Hierarchical Clustering, which are centroid-based algorithms, DBSCAN is a density-based algorithm. Effectively, this means that you don’t need to determine how many clusters do you need. We saw this at Hierarchical clustering, but DBSCAN takes it to another level. Instead of defining hyperparameter k, we define two hyperparameters for distance ε and the number of samples per cluster – n. Let’s break it into steps and it will be more clear:
- First, we assign a random sample x to the first cluster.
- We count how many samples have a distance from the sample x that is less or equal to ε. If the number of such samples is greater or equal to n, we add them in the cluster.
- We observe each new member of the cluster and perform step 2 for them too, ie. we calculate the number of samples within the ε area of the sample and if that number is larger then n, we add them to the cluster. We repeat this process recursively until there are no more samples left to put in it.
- Steps from 1 to 3 for a new random unclustered sample.
- The process is repeated like this until all samples are either clustered or marked as outliers.
The main advantage of this approach is that clusters have different and random shapes. Centroid-based algorithms always create clusters that have the shape of a hypersphere. That is why DBSCAN can be specifically useful for some data. Of course, the main problem is choosing optimal values for ε and n. This problem is optimized with a variation of this algorithm called HDBSCAN, ie. high-performance DBSCAN. This algorithm eliminates the use of the ε hyperparameter, however, this algorithm is out of the scope of this tutorial.
4. Implementation with ML.NET
As we mentioned, ML.NET supports only K-Means Clustering. However, we will make our solution in a way that we expect the guys and gals from Microsoft to provide other types of clustering. That is why our solution might look over-engineered, however, we have done it so because of future flexibility.
4.1 High-Level Architecutre
Before we dive into the ML.NET implementation, let’s consider the high-level architecture of this implementation. In general, we want to build a solution that we can easily extend with new clustering algorithms that ML.NET might include in the future. With that in mind we create folder structure of our solution, that looks like this:
The Data folder contains .csv with input data and the MachineLearning folder contains everything that is necessary for our algorithm to work. The architectural overview can be represented like this:
At the core of this solution, we have an abstract TrainerBase class. This class is in the Common folder and its main goal is to standardize the way this whole process is done. It is in this class where we process data and perform feature engineering. This class is also in charge of training machine learning algorithms. The classes that implement this abstract class are located in the Trainers folder. In this particular case, we have only one class that does so. Here we utilize ML.NET K-Means algorithm. Once Microsoft adds new algorithms, we can extend this folder with new classes. These classes define which algorithm should be used. In this particular case, we have only one Predictor located in the Predictor folder.
4.2 Data Models
In order to load data from the dataset and use it with ML.NET algorithms, we need to implement classes that are going to model this data. Two files can be found in Data Folder: PalmerPenguinData and PricePalmerPenguinPredictions. The PalmerPenguinData class models input data and it looks like this:
Note that we skipped loading of the first class that represents the class of the penguin. We do this because we want to perform unsupervised learning. Meaning, we don’t use the species column during the training process.
The PricePalmerPenguinPredictions class models output data:
4.3 TrainerBase and ITrainerBase
As we mentioned, this class is the core of this implementation. In essence, there are two parts to it. The first one is the interface that describes this class and another is the abstract class that needs to be overridden with the concrete implementations, however, it implements interface methods. Here is the ITrainerBase interface:
The TrainerBase class implements this interface. However, it is abstract since we want to inject specific algorithms:
That is one large class. It controls the whole process. Let’s split it up and see what it is all about. First, let’s observe the fields and properties of this class:
The Name property is used by the class that inherits this one to add the name of the algorithm. The ModelPath field is there to define where we will store our model once it is trained. Note that the file name has .mdl extension. Then we have our MlContext so we can use ML.NET functionalities. Don’t forget that this class is a singleton, so there will be only one in our solution. The _dataSplit field contains loaded data. Data is split into train and test datasets within this structure.
The field _model is used by the child classes. These classes define which machine learning algorithm is used in this field. The _trainedModel field is the resulting model that should be evaluated and saved. In essence, the only job of the class that inherits and implements this one is to define the algorithm that should be used, by instantiating an object of the desired algorithm as _model.
Cool, let’s now explore Fit() method:
This method is the blueprint for the training of the algorithms. As an input parameter, it receives the path to the .csv file. After we confirm that the file exists we use the private method LoadAndPrepareData. This method loads data into memory and splits it into two datasets, train and test dataset. We store the returning value into _dataSplit because we need a test dataset for the evaluation phase. Then we call BuildDataProcessingPipeline().
This is the method that performs data pre-processing and feature engineering. For this data, there is no need for some heavy work, we just create features from textual columns and do the normalization because continuous data is on a different scale. Here is the method:
Next is the Evaluate() method:
It is a pretty simple method that creates a Transformer object by using _trainedModel and test Dataset. Then we utilize MlContext to retrieve regression metrics. Finally, let’s check out Save() method:
This is another simple method that just uses MLContext to save the model into the defined path.
Thanks to all the heavy lifting that we have done in the TrainerBase class, the other Trainer classes should be simple and focused only on instantiating the ML.NET algorithm. In this particular case, we only have one class KMeansTrainer. Here it is:
Notice that this algorithm has a hyperparameter numberOfClusters. We use this number to define how much clusters we expect in our dataset.
The Predictor class is here to load the saved model and run some predictions. Usually, this class is not a part of the same microservice as trainers. We usually have one microservice that is performing the training of the model. This model is saved into file, from which the other model loads it and run predictions based on the user input. Here is how this class looks like:
In a nutshell, the model is loaded from a defined file, and predictions are made on the new sample. Note that we need to create PredictionEngine to do so.
4.6 Usage and Results
Ok, let’s put all of this together. Let’s pretend that we don’t know that we have 3 clusters in our dataset. That is why we run K-Means for a different number of clusters.
Note the TrainEvaluatePredict() method. This method does the heavy lifting here. In this method, we can inject an instance of the class that inherits TrainerBase and a new sample that we want to be predicted. Then we call Fit() method to train the algorithm. Then we call Evaluate() method and print out the metrics. Finally, we save the model. Once that is done, we create an instance of Predictor, call Predict() method with a new sample and print out the predictions. In the Main, we create a list of trainer objects, and then we call TrainEvaluatePredict on these objects. Here are the results:
Ok, we got some interesting results. In all cases, our solution assigned the value of cluster 1 to the new sample. That corresponds with the Adelie class, which is good. However, what does this data say to us? If you remember, we pretended we don’t know how many clusters we have in our dataset. So how do we figure that out from these results? Here we use the Elbow Method.
5. The Elbow Method
We know that we have tree classes in our dataset. However, let’s forget about all that for a second and let’s plot out data using the same color for all classes:
How many clusters do you see? We could say 3-ish, but we can not be sure. Plus there is that data in the middle which is super sketchy. As we said one of the most popular methods for determining the number of clusters in the dataset is called the Elbow Method. It can be built on top of two metrics: distortion and inertia. Distortion is calculated as the average of the squared distances (let’s say Euclidean distance) from the cluster centers of the respective clusters. Inertia represents the sum of squared distances of samples to their closest cluster center.
What we can do is run our clustering algorithm with a variable number of clusters and calculate distortion and inertia. Then we can plot the results. There we can look for the “elbow” point. This is the point after which the distortion/inertia starts decreasing in a linear fashion as the number of clusters grows. This point tells us the optimal number of clusters.
That is exactly what we did with our solution, so when we calculate the distortion and inertia for the results above and plot the distortion value along with the number of clusters:
From this image we can conclude that after 3 clusters, distortion decreases in a linear fashion, ie. 3 is the optimal number of clusters. How about inertia:
We can conclude the same thing from this image as well.
In this article, we had a chance to explore how we can utilize unsupervised learning for clustering problems. We observed the K-Means Clustering algorithm and implemented it with ML.NET. We also briefly explored algorithms like Hierarchical Clustering and DBSCAN. We applied K-Means Clustering to the PalmerPenguins dataset and saw some really interesting results. Also, we had a chance to see how powerful unsupervised learning can be.
Thank you for reading!
Nikola M. Zivkovic
CAIO at Rubik's Code
Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.
Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.