In a previous couple of articles, we explored some basic machine learning algorithms and how they fit into the .NET world. Thus far we covered some simple regression algorithms, classification algorithms. Apart from that, we learned a bit about unsupervised learning, more specifically – clustering. We used ML.NET to implement and apply these algorithms. Then we learned about SVM, an algorithm that can be used for regression and for classification. We continue down that in the previous article, we explored another such algorithm Decision Trees. In this one, we go even further and learn about Ensemble learning and Random Forest.
Are you afraid that AI might take your job? Make sure you are the one who is building it.
STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖
1. Dataset and Prerequisites
Data that we use in this article is from PalmerPenguins Dataset. This dataset has been recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset here, or via Kaggle. This dataset is essentially composed of two datasets, each containing data of 344 penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. Also, these datasets contain culmen dimensions for each species. The culmen is the upper ridge of a bird’s bill. In the simplified penguin’s data, culmen length and depth are renamed as variables culmen_length_mm and culmen_depth_mm.
Data itself is not too complicated. In essence, it is just tabular data:
Note that in this tutorial, we ignore the species feature. This is because we perform unsupervised learning, ie. we don’t need the expected output value of the sample. We want our algorithm to figure that out on its own. Here is how data looks like when we plot it:
For the regression examples in this article, we use the famous Boston Housing Dataset. This dataset is composed of 12 features and contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It is a small dataset with only 506 samples.
The complete dataset looks somewhat like this:
In fact, most of the features in this dataset have almost linear dependency:
The implementations provided here are done in C#, and we use the latest .NET 5. So make sure that you have installed this SDK. If you are using Visual Studio this comes with version 16.8.3. Also, make sure that you have installed the following package:
Note that this will install default Microsoft.ML package as well. You can do a similar thing using Visual Studio’s Manage NuGetPackage option:
If you need to catch up with the basics of machine learning with ML.NET check out this article.
2. Ensemble Learning and Random Forest Intuition
The interesting occurrence in machine learning is that sometimes we tend to get better results by using multiple predictors and then averaging results than from using one special algorithm for it. This technique in which we use multiple algorithms instead of one is called Ensemble Learning. Ensemble Learning is based on the law of the large numbers, which means that even if algorithms that are composing the ensemble are weak learners, the ensemble can be a strong learner.
There are several ways these ensemble learners function. For example, in the technique called hard voting, several classifiers vote for the class and the class that gets the majority of the votes is the output. This is a bit unintuitive, but if you build an ensemble containing 1,000 classifiers and each of them has an accuracy of 51% on its own, assemble based on hard-voting can have accuracy up to 75%. There is also a soft voting technique. In this case, each algorithm outputs probability, the ensemble will predict the class with the highest class probability, averaged over all the individual classifiers.
One of the most popular ways to build ensembles is to use the same algorithm multiple times but on the different subsets of the training dataset. Techniques that are used for this are called bagging and pasting. The only difference in these techniques is that while building subsets bagging allows training instances to be sampled several times for the same predictor, while pasting is not allowing that. When all algorithms are trained, the ensemble makes a prediction by aggregating the predictions of all algorithms. In the classification case that is usually the hard-voting process, while for the regression average result is taken.
Random Forest is one of the most powerful algorithms in machine learning. It is an ensemble of Decision Trees. In most cases, we train Random Forest with bagging to get the best results. It introduces additional randomness when building trees as well, which leads to greater tree diversity. This is done by the procedure called feature bagging. This means that each tree during the training is trained on a different subset of features. In turn, this leads to a lower variance of the complete model.
3. ML.NET Supported Random Forest Algorithms
ML.NET supports Random Forest for both classification and regression. At the moment Random Forest classification is limited only to binary classification. We hope that in the future, we will get an option to perform multiclass classification as well. Random Forest algorithm in ML.NET is called Fast Forest, and it is built as an ensemble of Fast Tree. As a reminder, Fast Tree is an implementation of the so-called MART algorithm, which is known to deliver high prediction accuracy for diverse tasks, and it is widely used in practice.
The fast forest is a random forest implementation that consists of an ensemble of such decision trees. The output of every decision tree in the ensemble is Gaussian distribution. Random forest then performs aggregation of all those outputs and creates distribution that is closest to the combined of all tree distributions.
4. Classification Implementation with ML.NET
ML.NET currently supports only binary classification with Random Forest. As you are probably aware, binary classification is performing simple classification on two classes. In essence, it is used for detecting if some sample represented some event or not. So, simple true-false predictions, which can be quite useful. That is why we need to modify and pre-process data from PalmerPenguin Dataset. We left two features culmen depth and culmen length. The other features are removed. We also modify the species feature, which now indicated if the sample belongs to the Adelie species or not (1 if the sample represents Adelie; 0 otherwise). Here is how data looks like now:
This is a simplified dataset and the problem we want to learn – Does some new sample that comes in our system represents Adelie’s class or not. Here is what that means for our dataset visually:
4.1 High-Level Architecutre
Before we dive deeper into this implementation, let’s consider the high-level architecture of this implementation. In general, we want to build an easily extendable solution that we can easily extend with new Random Forest algorithms that ML.NET could include in the future. We certainly hope that multiclass options will be available in the future. Also, some other variations of the Decision Tree algorithm could be implemented and with that new variation of Random Forest could be created. That is why the folder structure of our solution looks like this:
The Data folder contains .csv with input data and the MachineLearning folder contains everything that is necessary for our algorithm to work. The architectural overview can be represented like this:
At the core of this solution, we have an abstract TrainerBase class. This class is in the Common folder and its main goal is to standardize the way this whole process is done. It is in this class where we process data and perform feature engineering. This class is also in charge of training machine learning algorithm. The classes that implement this abstract class are located in the Trainers folder. Here we can find multiple classes which utilize ML.NET algorithms. These classes define which algorithm should be used. In this particular case, we have only one Predictor located in the Predictor folder.
4.2 Data Models
In order to load data from the dataset and use it with ML.NET algorithms, we need to implement classes that are going to model this data. Two files can be found in Data Folder: PalmerPenguinBinaryData and PricePalmerPenguinBinaryPredictions. The PalmerPenguinBinaryData class models input data and it looks like this:
The PricePalmerPenguinBinaryPredictions class models output data:
4.3 TrainerBase and ITrainerBase
As we mentioned, this class is the core of this implementation. In essence, there are two parts to it. The first one is the interface that describes this class and another is the abstract class that needs to be overridden with the concrete implementations, however, it implements interface methods. Here is the ITrainerBase interface:
The TrainerBase class implements this interface. However, it is abstract since we want to inject specific algorithms:
That is one large class. It controls the whole process. Let’s split it up and see what it is all about. First, let’s observe the fields and properties of this class:
The Name property is used by the class that inherits this one to add the name of the algorithm. The ModelPath field is there to define where we will store our model once it is trained. Note that the file name has .mdl extension. Then we have our MlContext so we can use ML.NET functionalities. Don’t forget that this class is a singleton, so there will be only one in our solution. The _dataSplit field contains loaded data. Data is split into train and test datasets within this structure.
The field _model is used by the child classes. These classes define which machine learning algorithm is used in this field. The _trainedModel field is the resulting model that should be evaluated and saved. In essence, the only job of the class that inherits and implements this one is to define the algorithm that should be used, by instantiating an object of the desired algorithm as _model.
Cool, let’s now explore Fit() method:
This method is the blueprint for the training of the algorithms. As an input parameter, it receives the path to the .csv file. After we confirm that the file exists we use the private method LoadAndPrepareData. This method loads data into memory and splits it into two datasets, train and test dataset. We store the returning value into _dataSplit because we need a test dataset for the evaluation phase. Then we call BuildDataProcessingPipeline().
This is the method that performs data pre-processing and feature engineering. For this data, there is no need for some heavy work, we just do the normalization. Here is the method:
Next is the Evaluate() method:
It is a pretty simple method that creates a Transformer object by using _trainedModel and test Dataset. Then we utilize MlContext to retrieve regression metrics. Finally, let’s check out Save() method:
This is another simple method that just uses MLContext to save the model into the defined path.
Thanks to all the heavy lifting that we have done in the TrainerBase class, the only Trainer class is simple and focused only on instantiating the ML.NET algorithm. Let’ take a look at RandomForestTrainer class:
As you can see, this class is pretty simple. We override the Name and _model. We use the FastForest class from the BinaryClassificaton namespace. Notice how we use some of the hyperparameters that this algorithm provides. With this, we can create more experiments. The numberOfLeaves represents the number of nodes that are going to be created in each branch of the decision tree, while the numberOfTrees represent the number of trees that are going to be trained.
The Predictor class is here to load the saved model and run some predictions. Usually, this class is not a part of the same microservice as trainers. We usually have one microservice that is performing the training of the model. This model is saved into file, from which the other model loads it and run predictions based on the user input. Here is how this class looks like:
In a nutshell, the model is loaded from a defined file, and predictions are made on the new sample. Note that we need to create PredictionEngine to do so.
4.6 Usage and Results
Ok, let’s put all of this together.
Not the TrainEvaluatePredict() method. This method does the heavy lifting here. In this method, we can inject an instance of the class that inherits TrainerBase and a new sample that we want to be predicted. Then we call Fit() method to train the algorithm. Then we call Evaluate() method and print out the metrics. Finally, we save the model. Once that is done, we create an instance of Predictor, call Predict() method with a new sample and print out the predictions. In the Main, we create a list of trainer objects, and then we call TrainEvaluatePredict on these objects.
In the list of algorithms, we relied on the hyperparameters to create several variations of Random Forest. Here are the results:
Awesome, so we got different predictions from different algorithms, along with different metrics. The first versions with five trees with only two leaves gave the wrong answer since for the sample we provided we used one of the Adelie instances. The other two variations did a good job. Metrics give us the feeling that there is no big difference between forests with more trees and forests with fewer trees. This just shows the power of ensemble learning.
5. Regression Implementation with ML.NET
As we mentioned, for the regression example, we use Boston Housing Dataset. Here is how that looks like:
Most of the features in the dataset have almost linear dependency:
From the high-level, architecture stays the same. There are, of course, some changes in each concrete implementation, however the architecture is intact.
The same goes for the project structure:
5.1 Data Models
Just like in classification examples we need to create classes for data. Two classescan be found in Data Folder: BostonHousingData and BostonHousingPricePredictions. The BostonHousingData class models input data and it looks like this:
The BostonHousingPricePredictions class models output data:
5.2 TrainerBase and ITrainerBase
The ITranierBase interface is the same as in the classification example.
The TranierBase implementation is at the center of the solution once again. It resembles the implementation we have done for classification example, however, there are some differences and specifics, since this class is adapted for regression and for specific data.
The most notable changes are in the BuildDataProcessingPipeline. In this function, we have done some data pre-processing and feature engineering. Namely, we used one-hot encoding on the RiverCoast feature and used log mean normalization on all features.
In the trainer’s folder, we can find a class that is almost the same as the one for classification example. In fact, the only difference is that we use classes from the Regression namespace.
The Predictor class is also adopted for this scenario:
7.4 Usage and Results
Let’s see how this works together:
The output looks like this:
In this article, we covered a lot of ground. We learned how Random Forest utilizes the power of ensemble learning with Decision Trees. Also, we had a chance to see how it can be used for classification and for regression. Finally, we implemented it all using ML.NET.
Thank you for reading!
Nikola M. Zivkovic
CAIO at Rubik's Code
Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.
Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.