The code that accompanies this article can be found here.
In a previous couple of articles, we started exploring some of the basic machine learning algorithms. We covered some simple regression and classification algorithms. We also used other technologies like TensorFlow, Pytorch and SciKit Learn for the implementation and application of these algorithms and we used optimization techniques such as Gradient Descent. In this article, we focus on one very powerful algorithm – Support Vector Machine or SVM. SVM is one of the most popular machine learning algorithms and for a good reason. This algorithm proved over and over again to be really good for both – classification and regression and every machine learning engineer should have it in their toolbox. It is also applicable to linear and non-linear data. Before we dive into details and implementation of this algorithm let’s see datasets and libraries that we use in this article.
Are you afraid that AI might take your job? Make sure you are the one who is building it.
STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖
Dataset and Prerequisites
Data that we use in this article is from PalmerPenguins Dataset. This dataset has been recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset here, or via Kaggle. This dataset is essentially composed of two datasets, each containing data of 344 penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. Also, these datasets contain culmen dimensions for each species. The culmen is the upper ridge of a bird’s bill. In the simplified penguin’s data, culmen length and depth are renamed as variables culmen_length_mm and culmen_depth_mm.
Data that we use for the regression examples we use the famous Boston Housing Dataset. This dataset is composed 14 features and contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It is a small dataset with only 506 samples.
For the purpose of this article, make sure that you have installed the following Python libraries:
- NumPy – Follow this guide if you need help with installation.
- SciKit Learn – Follow this guide if you need help with installation.
Once installed make sure that you have imported all the necessary modules that are used in this tutorial.
SVM for Classification
Let’s first explore how this algorithm works for simple binary classification in order to understand how it functions. This means that we will consider only two classes form the PalmerPenguins dataset. As other machine learning algorithms SVM observers every feature vector as a point in a high-dimensional space. In its core, SVM puts all feature vectors on an imaginary n-dimensional plot and draws an imaginary n-dimensional line (a hyperplane) that separates examples with positive labels from examples with negative labels in the case of classification, or collects as much as samples as possible in case of regression. The hyperplane is defined by the function:
where x is the feature vector, w is feature weights vector with size same as x, and b is bias term. This is formula should be familiar from our journey through the Linear Regression or Logistic Regression. In the case of binary classification, which we consider at the moment, SVM requires that the positive label has a numeric value of 1, and the negative label has a value of -1. This means that predicted label for some feature vector x can be calculated using formula:
The function sign returns value 1 if the input is a positive number and -1 if the input is a negative value. So, SVM model, which during the training process should optimize w and b, can be described with the formula:
We can break this down and write it as:
This all looks very similar to the Logistic Regression approach. To better understand why this algorithm and what is the difference from Logistic Regression, let’s consider constraints under which algorithm operates:
On a graph that looks like this:
This means that SVM doesn’t only create a hyperplane, but it also constructs additional vectors which are defining margin. These vectors are called support vectors and the distance between the closest examples of two classes is called the margin. Hyperplane with support vectors is often referred to as street. This means that SVM tries to fit the best street between the samples of different classes. Unlike the Logistic regression which tries to fit the hyperplane as close as possible to these points, SVM tries to fit the hyperplane that is as far as possible form the samples but still separates classes successfully.
Note that a large margin leads to a better generalization, meaning the model will better classify new samples. However, notice that the margin is decided by the Euclidean norm of w (denoted by ||w|| in the image). This effectively means the smaller the weight vector w, the larger the margin and thus better generalization. To sum it up, training the SVM algorithm for classification means finding the value of w and b that makes margin as wide as possible while avoiding misclassification. Meaning, objective is defined as a constrained optimization problem:
where t(i) is –1 for negative samples and t(i) = 1 for positive samples. To be more precise, this optimization problem is a convex quadratic optimization problem with linear constraints. Such problems are known as Quadratic Programming (QP) problems. The solution for this type of problem is outside of the scope of this article. More information in Convex Optimization you can find here and more information for constraint optimization, in general, can be found here.
As we mentioned for the classification examples we use PalmerPenguins dataset. However, since we want to do binary classification we need to do some preparations. First, we load the dataset, remove features that we don’t use in this article and remove the one class because we perform binary classification:
We remove all samples that are labeled with class ‘Chinstrap‘ and features that we don’t want to use (we use only culmen_length_mm and culmen_depth_mm).
Then we separate input data and scale it:
After that, we extract output values and mark them with values -1 and 1, since the SVM algorithm requires that.
Another thing that we remove is the 182nd sample form the dataset. Why? Well, this sample was in between classes and it kinda messed with points that we try to make, so we remove it and split data into training and test datasets:
That is it, our dataset is ready and here is what it looks when we plot it:
There are many methods to find the optimal w and b for the SVM. One of the most popular ones is Sequential Minimal Optimization (SMO) which is used by the SciKit Learn as well. In its core, the SMO algorithm splits the quadratic programming optimization problem into smaller ones. However, this algorithm is rather complicated so in this article, we implement a simpler one – The Pegasos algorithm. This algorithm uses stochastic gradient descent and it is defined like this:
So let’s implement this algorithm within the MySVM class:
The implementation is fairly simple. In the constructor, we receive two parameters, number of features and learning rate. We need the number of features in order to initialize w. Note that we add 1 to that size because b is incorporated within w. That is why we need a private __extend_input method. This method is used to extend any input matrix with the column of ones. That is how b is implicitly modeled within the solution. Apart from that this class contains two public methods fit and predict following the blueprint dictated by big libraries like SciKit Learn.
The fit method handles the training process. In it, we first extend the training set and loop for a defined number of epochs. There we perform necessary calculations defined by the Pegasos algorithm. Finally, we update w for each sample in the training set. The predict method just extends the test set and multiplies it with w. If the result is larger than 0 it appends label 1, otherwise, it appends label -1. As you may notice, unlike Logistic Regression, SVM doesn’t output probabilities for each class. It is easy to use this class:
In the end, we get that accuracy of 100%.
Using SciKit Learn
Of course, we can use SciKit Learn implementation of this classifier. Since our data is linearly separable we can use LinearSVC class:
Hyperparameter C is used to control the wideness of the street. Smaller C value leads to a wider street, however, it leads to more margin violations. This means that samples could end up that end up in the middle of the street or even on the wrong side. For loss function, we used hinge loss. This loss is effective for SVM algorithms. It is defined with the formula: max(0, 1 – t). Its derivative is –1 if t < 1 and 0 if t > 1.
Alternatively, we can use the SVC class:
We got the same result, ie. 100% accuracy, but you may notice kernel hyperparameter, which is going to be explained in the next chapter. Anyhow, here are the compared results of all three algorithms within pandas data frame:
If we plot it out these models, here is what we get:
Notice the difference between a linear function that separates two classes between our and SciKit Learn implementation. This is caused by different algorithms that implementations used.
Thus far we observed a pretty nice example, data that is linearly separable. In reality, this is almost never the case. So, let’s consider other classes from the PalmerPenguins dataset and load Adelie and Chinstrap classes.
This time it is not so easy to separate classes with just a straight line. Data is a bit scrambled, so what should we do in these situations when data is not linear? Here we can apply probably the greatest SVM advantage – kernel trick. This technique gives you the possibility to get the same result as if you were using polynomial features without actually having to add them. Kernels are just functions that map low-dimensional non-linearly-separable data into a linearly-separable high-dimensional data. Meaning, in our case, we map our 2D data which is not linearly separable into 3D data that is.
However, we don’t know which mapping works for our data the best, so if we would map all the vectors into a higher dimension and then apply SVM to it that would be very inefficient. That is where the kernel trick comes into play. In essence, it uses kernels to work in higher-dimensional spaces without doing this transformation explicitly. So, let’s use different kernels on our dataset. First, we use the polynomial kernel and here is the result that we get.
As you can see accuracy is around 90%. That is not bad, but can it get better. One of the most popular kernel functions is Gaussian RBF kernel defined by the formula:
It is a bell-shaped function varying from 0 to 1 and it is often used for adding features using similarity features method. Notice the parameter gamma. It is defining how wide the bell-curve is and this essentially the hyperparameter of the SVM. In essence, when we use this kernel we create a gaussian bell-curve in 3-dimensional space, in this example (since our data is in 2-dimensional space), around the chosen landmark.
Then all points are mapped from 2-dimensional space to 3-dimensional space, but every point is mapped to this curve. That is how we ensure that data is linearly separable in 3-dimensional space. Then SVM is applied. Of course kernel trick is applied so we don’t have to do all of these calculations, so the algorithm is pretty efficient. Let’s use RBF:
The accuracy is around 97%. That is much better than the polynomial kernel. If we visualize both approaches, we get something like this:
You can imagine 3D Gaussian bell-curve coming out of your screen in the RBF example if you will 🙂
SVM for regression – SVR
SVM for regression is not so different from the one for the classification. In essence, all the practices that we learned for classification stand for the regression as well, with one major difference. For the classification SVM tried to fit the largest street among the samples of different classes, without violating margins. In regression, SVM tries to fit as many samples as possible on the street, while minimizing number of the samples of the street. The wideness of the street is controlled by hyperparameter – epsilon.
For regression example, we use the Boston housing dataset. This dataset contains 14 features, of which we use 2. Our goal in this example is to predict the value of the property based on the age of the occupant. So, we remove all other features:
Here is how that looks like when we plot it:
As you can see this is the case of non-linear regression. So, let’s try to create a model that will best fit this data.
Using SciKit Learn
We know that regression in this example is non-linear, but for educational purposes, we start with linear SVM regression.
The results are, well, mixed. Sometimes they are not that bad, but sometimes they are far off. It looks like a random guess. When we plot it out, the model makes more sense:
It is interesting. We can almost see how SVM works and how it tried to fit as many points around that line. Of course, we can improve the results by using other kernels. Here is how we use polynomial kernel:
Compared to linear model results are somewhat better. When we plot the model, here is what we get:
Notice the degree parameter in the constructor of the SVR class. Try changing its value of this parameter and the value of epsilon to get better results. Finally, let’s use the RBF kernel:
This model gets the best results. Let’s plot it:
In this article, we covered large topic of SVM. We had a chance to see how it works for classification and for regression. We also saw how we can handle linear data and how we can handle non-linear data using kernel trick. SVM is one amazing algorithm which is important part of your Machine Learning toolbox.
Thank you for reading!
Nikola M. Zivkovic
CAIO at Rubik's Code
Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.
Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.