Code that accompanies this article can be downloaded here.
Last month, at their Build event, Microsoft shared with us plans for .NET Core 3. Wile the accent was the transformation of desktop applications and support for Windows Forms and WPF, ML.NET – a framework for machine learning was introduced as well. If you take a look at the picture that has been around the web lately we may expect this module to be an integral part of the .NET Core 3. For now, ML.NET is just at in its infancy, and we are able to try its first incarnation. Just a few days ago 0.2 version of ML.NET was announced, so let’s see what is this framework all about.
ML.NET is an open-source and cross-platform framework and available as NuGet package. You can check the code here. It was originally developed in Microsoft Research and it is used across many Microsoft products like Windows, Bing, Azure, etc. One very cool thing about this framework is that it can be extended to add machine learning libraries like TensorFlow, Accord.NET, and CNTK. Before we dive into the details of this framework let’s have a brief introduction to Machine Learning and type of problems that it solves.
Machine Learning is computer science branch that uses statistical techniques to give computers the ability to learn how to solve certain problems without being explicitly programmed. Even though it is a big buzzword these days and “life of the party” at every conference, the initial concepts of machine learning trace back to the 50s. The whole idea is to develop a certain model, which will after being trained on some set of data, be able to make correct predictions using new data.
To put it plainly, the model uses historical data to make a prediction on new data and this whole process is called predictive modeling. Or mathematically said, we are trying to approximate a mapping function – f from input variables X to output variables y. In machine learning, we are using predictive modeling to solve two types of problems: Regression and Classification.
Regression problems require prediction of the quantity. Our output is continuous, meaning it is a real-value, such as an integer or floating point value. For example, we want to predict our salary based on the data from past couple of months. Classification problems, on the other hand, are trying to divide an input into certain categories. Meaning that output of this tasks is discrete. For example, we are trying to predict is an e-mail spam or not.
There are few approaches when that one can take when teaching a model: Supervised Learning, Unsupervised Learning and Reinforced Learning. In supervised learning, input data and expected output are provided to the model and model is learning which output should be provided for certain kind of input. This type of learning is the most popular one.
Unsupervised learning, unlike supervised learning, is not having output data. Our model is trying to make conclusions just using input data. Finally, reinforced learning uses a system of “rewards” to teach a model. A good example when this type of learning was used is AlphaGo, Google’s neural network that was able to beat the world champion in the game of Go.
Iris Flower Classification Problem
Ok, now when we are up to speed with basic Machine Learning concepts, let’s see what problem we are going to solve using ML.NET. Iris Data Set is famous dataset in the world of pattern recognition and it is considered to be “Hello World” example for machine learning classification problems. It was first introduced by Ronald Fisher, British statistician and botanist, back in 1936. In his paper The use of multiple measurements in taxonomic problems, he used data collected for three different classes of Iris plant: Iris setosa, Iris virginica, and Iris versicolor.
This dataset contains 50 instances for each class. What is interesting about this whole example is that the first class is linearly separable from the other two, but the latter two are not linearly separable from each other. Each instance has five attributes:
- Sepal length in cm
- Sepal width in cm
- Petal length in cm
- Petal width in cm
- Class (Iris setosa, Iris virginica, Iris versicolor)
At the moment ML.NET works with .NET Core 2.0, so make sure you have it installed on your computer. Note that it currently must run in the 64-bit process. Keep this in mind while making your console application. As any other NuGet package you can install it using Package Manager Console using the command:
Install-Package Microsoft.ML -version 0.2
Another way to do it is to use .NET Core CLI. If you are going to use this approach, make sure you have installed .NET Core SDK. Then run this command from your console application project folder:
dotnet add package Microsoft.ML --version 0.2.0
Alternatively, you can use Visual Studio GUI to this as well. All you have to do is right-click on your project and choose Manage NuGetPackage option:
After that, you need to find Microsoft.ML package and install it.
Finally, let’s go to the fun stuff and implement a solution for the Iris classification problem. You can find the code for this implementation here. The first thing we need to do is to get the data. We can find the complete dataset, with 150 samples here.
However, the usual practice when building a model is to have set of data for training and another set of data for testing and evaluating the accuracy of the model. Often, like in this example, we get just one set of data, that we need to split into two separate datasets and that use one for training and other for testing. The ratio should be around 80% to 20%. That is why I’ve chosen 25 samples from the dataset and saved them in the separate files.
Here is how one of these files look like:
Building and Training the Model
Data reflected in this .csv files must be transformed into some kind of objects. That is why in our Iris folder, we are having two classes: IrisFlower and IrisPredict. Information from our dataset files will end up in these objects, and then we will be able to use them for training our model and making the predictions. We will see how these classes are used in a minute, for now, take a look at the way they are implemented:
Now, let’s check out Main method of our solution. You can see that a lot of things are located in other classes, but here we can see the complete workflow. ModelBuilder is fed with training data for training process and with test data for the evaluation process. After that, results are visualized in console with help of IrisCsvReader.
We can see that essentially all the fun stuff is happening in ModelBuilder class. You can find the whole class here. In a nutshell, this class has two methods BuildAndTrain and Evaluate. The first method, BuildAndTrain, is used for creating the model and training it. Here is how it looks like:
For model creation, we are using LearningPipeline. This class is used for defining tasks that our model needs to do. It encapsulates the data loading, data processing/featurization, and learning algorithm. All these steps are added using Add method from this class.
Training data location is passed through the constructor of ModelBuilder, and is used in the first step – data loading. Here we are using TextLoader class and already implemented IrisFlower class. In this line, we are saying that once training is started, we should load data from training data location and map it to IrisFlower objects. After that, we are adding Dictionizer into the pipeline. To understand what this step will do, we have to take a look at our data again.
Notice how the last column in the data represents spices of the Iris flower in string format. Machine learning algorithms cannot work with strings. Also, notice that this column is mapped into Label property of IrisFlower. Underneath the hood, this Dictionizer class will encode this text into some number, so our machine learning algorithm knows how to use it. Finally, we are adding ColumnConcatanator to gather all the attributes. With this, we are done with the preparation of data for our training process.
The three last steps in BuildAndTrain method are the crucial ones. We are adding a type of model that we are going to use. For this purpose, we are using Stochastic
The second method of ModelBuilder is used for model evaluation. Here is the code of that method:
It is much simpler that BuildAndTrain method. Basically, we are loading data using TextLoader class again, but this time we are using test data. After that, we are using ClassigicationEvaluator class and it’s Evaluate method on the provided model and test data. This method returns an object that is containing metric information about the model – ClassificationMetrics object. We want to see just the accuracy, so we are returning AccuracyMacro.
Visualisation of the Output
Just to see how our model is performing, we’ve added one class that is reading from .csv file and creates IrisFlowers objects – IrisCsvReader class. It is pretty straightforward:
This method is called in the Main method of our application. For each object that GetIrisDataFromCsv method returns, we are calling Predict method of the model, and we are then printing out the predicted value along with the actual value. This is how the output of our application looks:
The accuracy of the model is 100%. I would get worried about overfitting if we were using some other data and working on the more complicated problem. However, we were working with the simple data, and this result is somewhat expected. Underneath that, you can see that our model is giving correct predictions for every sample from the test data.
Machine learning is booming for the last couple of years. People in the field are usually using Python or R. Until this moment, it would be very hard to manipulate the data and make quick solutions in C#. With ML.NET that is slowly changing. While it is still missing some features that Python and R have, this is a big step in the right direction. I am really excited to see where ML.NET framework will land and how it will be integrated with rest of the features of the .NET world in .NET Core 3. Machine learning is crossing the early adopter’s chasm and Microsoft helping it.
Read more posts from the author at Rubik’s Code.