In a previous couple of articles, we explored some basicĀ machine learningĀ algorithms and how they fit into the .NET world. Thus far we covered some simpleĀ regressionĀ algorithms,Ā classificationĀ algorithms. Apart from that, we learned a bit about unsupervised learning, more specifically – clustering. We used ML.NETĀ to implement and apply these algorithms. In this article, we explore one of the most popular machine learning algorithms Support Vector Machine or SVM for short.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! šŸ––

1. Dataset and Prerequisites

Data that we use in this article is from PalmerPenguins Dataset. This dataset has been recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset here, or via Kaggle. This dataset is essentially composed of two datasets, each containing data of 344 penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. Also, these datasets contain culmen dimensions for each species. The culmen is the upper ridge of a birdā€™s bill. In the simplified penguinā€™s data, culmen length and depth are renamed as variables culmen_length_mm and culmen_depth_mm.

Data itself is not too complicated. In essence, it is just tabular data:

Note that in this tutorial, we ignore theĀ species feature. This is because we perform unsupervised learning, ie. we don’t need the expected output value of the sample. We want our algorithm to figure that out on its own. Here is how data looks like when we plot it:

The implementations provided here are done in C#, and we use the latest .NET 5. So make sure that you have installed this SDK. If you are using Visual Studio this comes with version 16.8.3. Also, make sure that you have installed the following package:

You can do a similar thing using Visual Studioā€™sĀ Manage NuGetPackageĀ option:

If you need to catch up with the basics of machine learning with ML.NET check out this article.Ā 

2. SVM Intuition

Before we implement SVM with ML.NET, we need to learn a bit about this algorithm.Ā SVM is one of the most popular machine learning algorithms and for a good reason. This algorithm proved over and over again to be really good forĀ bothĀ ā€“ classification and regression and every machine learning engineer should have it in their toolbox. It is also applicable to linear and non-linear data.Ā 

2.1 SVM for Classification

Let’s first explore how this algorithm works for simple binary classification in order to understand how it functions. This means that we will consider only two classes form the PalmerPenguins dataset. As other machine learning algorithms SVM observers every feature vector as a point in a high-dimensional space. In its core, SVM puts all feature vectors on an imaginary n-dimensional plot and draws an imaginary n-dimensional line (a hyperplane) that separates examples with positive labels from examples with negative labels in the case of classification, or collects as much as samples as possible in case of regression. The hyperplane is defined by the function:

where x is the feature vector, w is theĀ feature weights vector with size same as x, and b is the bias term. This is formula should be familiar from our journey throughĀ Linear Regression or Logistic Regression. In the case of binary classification, which we consider at the moment, SVM requires that the positive label has a numeric value of 1, and the negative label has a value of -1. This means that the predicted label for some feature vector x can be calculated using the formula:

The function sign returns value 1 if the input is a positive number and -1 if the input is a negative value. So, SVM model, which during the training process should optimize w and b, can be described with the formula:

We can break this down and write it as:

This all looks very similar to the Logistic Regression approach. To better understand why this algorithm and what is the difference from Logistic Regression, let’s consider constraints under which the algorithm operates:

On a graph that looks like this:

This means that SVM doesn’t only create a hyperplane, but it also constructs additional vectors which are defining theĀ margin. These vectors are called support vectors and the distance between the closest examples of two classes is called the margin. Hyperplane with support vectors is often referred to as street. This means that SVM tries to fit the best street between the samples of different classes. Unlike the Logistic regression which tries to fit the hyperplane as close as possible to these points, SVM tries to fit the hyperplane that is as far as possible from the samples but still separates classes successfully.

Note that a large margin leads to a better generalization, meaning the model will better classify new samples. However, notice that the margin is decided by the Euclidean norm of w (denoted by ||w|| in the image). This effectively means the smaller the weight vector w, the larger the margin and thus better generalization. To sum it up, training the SVM algorithm for classification means finding the value of w and b that makes the margin as wide as possible while avoiding misclassification. Meaning, the objective is defined as a constrained optimization problem:

whereĀ t(i) isĀ ā€“1 for negative samples and t(i) = 1 for positive samples.Ā To be more precise, this optimization problem is a convex quadratic optimization problem with linear constraints. Such problems are known as Quadratic Programming (QP) problems. The solution for this type of problem is outside of the scope of this article. More information in Convex Optimization you can find hereĀ and more information for constraint optimization, in general, can be found here.Ā 

There are many methods to find the optimal w and b for the SVM. One of the most popular ones is Sequential Minimal Optimization (SMO) which is used by the SciKit Learn as well. In its core, the SMO algorithm splits the quadratic programming optimization problem into smaller ones. However, this algorithm is rather complicated so in this article, we implement a simpler one – The Pegasos algorithm. This algorithm uses stochastic gradient descent and it is defined like this:

2.2 Non-Linear Data

Thus far we observed a pretty nice example, data that is linearly separable. In reality, this is almost never the case. So, let’s consider other classes from the PalmerPenguins dataset and load Adelie and Chinstrap classes.

This time it is not so easy to separate classes with just a straight line. Data is a bit scrambled, so what should we do in these situations when data is not linear? Here we can apply probably the greatest SVM advantage – kernel trick. This technique gives you the possibility to get the same result as if you were using polynomial features without actually having to add them. Kernels are just functions that map low-dimensional non-linearly-separable data into a linearly-separable high-dimensional data. Meaning, in our case, we map our 2D data which is not linearly separable into 3D data that is.Ā 

However, we don’t know which mapping works for our data the best, so if we would map all the vectors into a higher dimension and then apply SVM to it that would be very inefficient. That is where the kernel trick comes into play. In essence, it uses kernels to work in higher-dimensional spaces without doing this transformation explicitly. This means that use different kernels on our dataset. For example, we could use the polynomial kernel. However, one of the most popular kernel functions is the Gaussian RBF kernel defined by the formula:

It is a bell-shaped function varying from 0 to 1 and it is often used for adding features using similarity features method. Notice the parameter gamma. It is defining how wide the bell-curve is and this essentially the hyperparameter of the SVM. In essence, when we use this kernel we create a gaussian bell-curve in 3-dimensional space, in this example (since our data is in 2-dimensional space), around the chosen landmark.

Then all points are mapped from 2-dimensional space to 3-dimensional space, but every point is mapped to this curve. That is how we ensure that data is linearly separable in 3-dimensional space. Then SVM is applied. Of course kernel trick is applied so we donā€™t have to do all of these calculations, so the algorithm is pretty efficient. Here is how that would visually look like if we would apply different kernels to our dataset:

2.3 SVM for regression

SVM for regression is not so different from the one for the classification. In essence, all the practices that we learned for classification stand for the regression as well, with one major difference. For the classification SVM tried to fit the largest street among the samples of different classes, without violating margins. In regression, SVM tries to fit as many samples as possible on the street, while minimizing number of the samples of the street. The wideness of the street is controlled by hyperparameter – epsilon.

3. ML.NET supported SVM Algorithms

Unfortunately, ML.NET support for SVM variations is not too big. Additionally, it is limited only to binary classification. This is quite disappointing and we hope that in the future there will be more support for SVM algorithms. It boils down to two SVM variations, both used only for binary classification:

  • Linear SVM – This is ML.NET implementation of the previously mentioned PEGASOSĀ algorithm. Here is how that algorithm works:
    1. Input: S,Ā Ī», T
    2. Initialize: Set w1 = 0
    3. For t = 1..T
      • Choose itĀ āˆˆ {1, …, |S|}Ā uniformly at random
      • Set nt = 1/Ī»t
      • If yit(wt, xit) < 1Ā  then:
        • wt+1Ā ā†Ā (1 + nt * Ī»)wt + nt * yit * xit
      • Else
        • wt+1Ā ā†Ā (1 + nt * Ī»)wt
    4. Output: wT +1
  • Local Deep SVMĀ – This implementation isĀ a generalization of Localized Multiple Kernel Learning for non-linear SVM. In essence, non-linear SVMs are great but can be slow and LD-SVM spreads up this process. As we learned in the previous chapter, non-linear SVM uses kernels, ie. kernel learning. The objective in kernel learning is to jointly learn both kernel and SVM parameters. In particular, Localized Multiple Kernel Learning, which this implementation generalizes,Ā aims to learn a different kernel, and hence classifier, for each point in feature space. This can be time-consuming, especially for largeĀ datasets. LD-SVM reduces the prediction cost by learning a tree-based local feature embedding that is high dimensional and sparse, efficiently encoding non-linearities.

4. Implementation with ML.NET

As we have learned, ML.NET currently supports only binary classification with SVM. Binary classification is performing simple classification on two classes. In essence, it is used for detecting if some sample represented some event or not. So, simple true-false predictions. That is why we had to modify and pre-process data from PalmerPenguin Dataset. We left two featuresĀ culmen depth andĀ culmen length. The other features are removed. We also modified theĀ species feature, which now indicated if the sample belongs to the AdelieĀ species or not (1 if the sample representsĀ Adelie; 0 otherwise). Here is how data looks like now:

This is a simplified dataset and the problem we want to learn – Does some new sample that comes in our system representsĀ Adelie’s class or not. Here is what that means for our dataset visually:

Recommendation Systems

4.1 High-Level Architecutre

But first, let’s consider the high-level architecture of this implementation. In general, we want to build an easily extendable solution that we can easily extend with new SVM algorithms that ML.NET will hopefully include in the future. That is why the folder structure of our solution looks like this:

Recommendation Systems

The Data folder contains .csv with input data and the MachineLearning folder contains everything that is necessary for our algorithm to work. The architectural overview can be represented like this:

Recommendation Systems

At the core of this solution, we have an abstract TrainerBaseĀ class. This class is in the Common folder and its main goal is to standardize the way this whole process is done. It is in this class where weĀ process data and performĀ feature engineering. This class is also in charge ofĀ trainingĀ machine learning algorithm. The classes that implement this abstract class are located in the Trainers folder. Here we can find multiple classes which utilize ML.NET algorithms. These classes define which algorithm should be used. In this particular case, we have only one Predictor located in the Predictor folder.

4.2 Data Models

In order to load data from the dataset and use it with ML.NETĀ algorithms, we need to implement classes that are going to model this data. Two files can be found in Data Folder: PalmerPenguinBinaryData and PricePalmerPenguinBinaryPredictions. The PalmerPenguinBinaryData class models input data and it looks like this:

Recommendation Systems

The PricePalmerPenguinBinaryPredictionsĀ class models output data:

4.3 TrainerBase and ITrainerBase

As we mentioned, this class is the core of this implementation. In essence, there are two parts to it. The first one is the interface that describes this class and another is the abstract class that needs to be overridden with the concrete implementations, however, it implements interface methods. Here is the ITrainerBase interface:

TheĀ TrainerBase class implements this interface. However, it is abstract since we want to inject specific algorithms:

Recommendation Systems

That is one large class. It controls the whole process. Let’s split it up and see what it is all about. First, let’s observe the fields and properties of this class:

The Name property is used by the class that inherits this one to add the name of the algorithm. The ModelPath field is there to define where we will store our model once it is trained. Note that the file name has .mdl extension. Then we have our MlContext so we can use ML.NET functionalities. Don’t forget that this class is a singleton, so there will be only one in our solution. The _dataSplit field contains loaded data. Data is split into train and test datasets within this structure.

The field _model is used by the child classes. These classes define which machine learning algorithm is used in this field. The _trainedModel field is the resulting model that should be evaluated and saved. In essence, the only job of the class that inherits and implements this one is to define the algorithm that should be used, by instantiating an object of the desired algorithm as _model.Ā 

Cool, let’s now explore Fit() method:

This method is the blueprint for the training of the algorithms. As an input parameter, it receives the path to the .csv file. After we confirm that the file exists we use the private method LoadAndPrepareData. This method loads data into memory and splits it into two datasets, train and test dataset. We store the returning value into _dataSplit because we need a test dataset for theĀ evaluation phase. Then we call BuildDataProcessingPipeline().

Recommendation systems 2

This is the method that performs data pre-processing and feature engineering. For this data, there is no need for some heavy work, we just do the normalization. Here is the method:

Next is theĀ Evaluate()Ā method:

It is a pretty simple method that creates a Transformer object by using _trainedModel and test Dataset. Then we utilize MlContext to retrieve regression metrics. Finally, let’s check out Save() method:

This is another simple method that just uses MLContext to save the model into the defined path.

4.4 Trainers

Thanks to all the heavy lifting that we have done in the TrainerBase class, the other Trainer classes are pretty simple and focused only on instantiating the ML.NET algorithm. We have two classes that utilize ML.NET‘s binary SVM classifiers. Let’ take a look at LinearSVMTrainer class:

As you can see, we only replace Name and create an instance of ML.NET’sĀ LinearSvm class. The class that implements LD-SVM is a bit different:

Apart from giving the specific name and creating an object ofĀ LdSvm class, in this class, we utilize theĀ treeDepth hyperparameters. It is injected through the constructor and used to create anĀ LdSvm instance.

4.5 Predictor

The Predictor class is here to load the saved model and run some predictions. Usually, this class is not a part of the same microservice as trainers. We usually have one microservice that is performing the training of the model. This model is saved into file, from which the other model loads it and run predictions based on the user input. Here is how this class looks like:

In a nutshell, the model is loaded from a defined file, and predictions are made on the new sample. Note that we need to create PredictionEngineĀ to do so.

Decision Tree

4.6Ā Usage and Results

Ok, let’s put all of this together.

Not the TrainEvaluatePredict() method. This method does the heavy lifting here. In this method, we can inject an instance of the class that inherits TrainerBase and a new sample that we want to be predicted. Then we call Fit() method to train the algorithm. Then we call Evaluate() method and print out the metrics. Finally, we save the model. Once that is done, we create an instance of Predictor, call Predict() method with a new sample and print out the predictions. In the Main, we create a list of trainer objects, and then we call TrainEvaluatePredictĀ on these objects. Here are the results:

Awesome, so we got different predictions from different algorithms, along with different metrics. All versions gave the correct answer since for the sample we provided we used one of theĀ Adelie instances. Metrics give us the feeling that the LD-SVM algorithm with a tree depth of 5 performed the best. This should be of course taken with a grain of salt and further test the model.

Conclusion

In this article, we covered a large topic of SVM. We had a chance to see how it can handle both, classification and for regression. We also saw how we can handle linear data and how we can handle non-linear data using kernel trick. We used ML.NET to implement it for Palmer Penguin classification. SVM is one amazing algorithm that is an important part of your Machine Learning toolbox.

Thank you for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

Rubikā€™s CodeĀ is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out theĀ servicesĀ we provide.

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted forĀ beginners.

Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.

Become a Machine Learning SuperheroĀ TODAY!