The **code** that accompanies this article can be found **here.**

In a previous couple of articles, we started exploring some of the basic **machine learning** algorithms. We covered some simple **regression** and **classification** algorithms. We also used other technologies like **TensorFlow**, **Pytorch** and **SciKit Learn** for the implementation and application of these algorithms and we used optimization techniques such as **Gradient Descent**. In this article, we focus on one very powerful algorithm – **Support Vector Machine** or SVM. SVM is one of the most popular machine learning algorithms and for a good reason. This algorithm proved over and over again to be really good for **both** – classification and regression and every machine learning engineer should have it in their toolbox. It is also applicable to linear and non-linear data. Before we dive into details and implementation of this algorithm let’s see **datasets** and libraries that we use in this article.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖

## Dataset and Prerequisites

Data that we use in this article is from **PalmerPenguins** Dataset. This dataset has been recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset **here**, or via Kaggle. This dataset is essentially composed of two datasets, each containing data of 344 penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. Also, these datasets contain **culmen** dimensions for each species. The culmen is the upper ridge of a bird’s bill. In the simplified penguin’s data, culmen length and depth are renamed as variables *culmen_length_mm* and *culmen_depth_mm*.

Data that we use for the regression examples we use the famous Boston Housing Dataset. This dataset is composed 14 features and contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It is a small dataset with only 506 samples.

For the purpose of this article, make sure that you have installed the following *Python *libraries:

**NumPy**– Follow**this guide**if you need help with installation.**SciKit Learn**– Follow**this guide**if you need help with installation.

Once installed make sure that you have imported all the necessary modules that are used in this tutorial.

Apart from that, it would be good to be at least familiar with the basics of **linear algebra**, **calculus** and **probability**.

## SVM for Classification

Let’s first explore how this algorithm works for simple **binary*** classification* in order to understand how it functions. This means that we will consider only two classes form the *PalmerPenguins* dataset. As other machine learning algorithms SVM observers every feature vector as a point in a high-dimensional space. In its core, SVM puts all **feature** vectors on an imaginary *n*-dimensional plot and draws an imaginary *n*-dimensional line (a **hyperplane**) that separates examples with **positive** labels from examples with **negative** labels in the case of classification, or collects as much as samples as possible in case of regression. The hyperplane is defined by the function:

where *x* is the **feature vector**, *w* is **feature weights vector** with size same as *x*, and *b* is bias term. This is formula should be familiar from our journey through the **Linear Regression** or **Logistic Regression**. In the case of binary classification, which we consider at the moment, SVM **requires** that the positive label has a numeric value of 1, and the negative label has a value of -1. This means that predicted label for some feature vector *x* can be calculated using formula:

The function *sign* returns value 1 if the input is a positive number and -1 if the input is a negative value. So, SVM **model**, which during the training process should optimize *w* and *b*, can be described with the formula:

We can break this down and write it as:

This all looks very similar to the Logistic Regression approach. To better understand why this algorithm and what is the difference from **Logistic Regression**, let’s consider **constraints** under which algorithm operates:

On a graph that looks like this:

This means that SVM doesn’t only create a hyperplane, but it also constructs **additional** vectors which are defining **margin**. These vectors are called s**upport vectors** and the distance between the closest examples of two classes is called the **margin**. Hyperplane with support vectors is often referred to as **street**. This means that SVM tries to **fit** the best street between the samples of different classes. Unlike the Logistic regression which tries to fit the hyperplane as close as possible to these points, SVM tries to fit the hyperplane that is as **far** as possible form the samples but still separates classes successfully.

Note that a large margin leads to a better **generalization**, meaning the model will better classify new samples. However, notice that the margin is decided by the Euclidean norm of *w* (denoted by *||w||* in the image). This effectively means the smaller the weight vector *w*, the larger the margin and thus better generalization. To sum it up, training the SVM algorithm for classification means finding the value of *w* and *b* that makes margin as wide as possible while avoiding misclassification. Meaning, objective is defined as a constrained optimization problem:

where *t(i)* is –1 for negative samples and *t(i) = 1* for positive samples. To be more precise, this optimization problem is a **convex quadratic optimization** problem with linear constraints. Such problems are known as **Quadratic Programming** (QP) problems. The solution for this type of problem is outside of the scope of this article. More information in Convex Optimization you can find **here** and more information for constraint optimization, in general, can be found **here**.

### Preparing Data

As we mentioned for the classification examples we use *PalmerPenguins* dataset. However, since we want to do binary classification we need to do some **preparations**. First, we load the dataset, remove features that we don’t use in this article and remove the one class because we perform binary classification:

We remove all samples that are labeled with class ‘*Chinstrap*‘ and features that we don’t want to use (we use only *culmen_length_mm* and *culmen_depth_mm*).

Then we separate input data and scale it:

After that, we extract output values and mark them with values -1 and 1, since the SVM algorithm requires that.

Another thing that we remove is the 182nd sample form the dataset. Why? Well, this sample was in between classes and it kinda messed with points that we try to make, so we remove it and split data into training and test datasets:

That is it, our dataset is ready and here is what it looks when we **plot** it:

### Python Implementation

There are many methods to find the optimal *w* and *b* for the SVM. One of the most popular ones is **Sequential Minimal Optimization** (SMO) which is used by the *SciKit Learn* as well. In its core, the *SMO* algorithm splits the quadratic programming optimization problem into smaller ones. However, this algorithm is rather complicated so in this article, we implement a simpler one – **The Pegasos algorithm**. This algorithm uses stochastic gradient descent and it is defined like this:

So let’s implement this algorithm within the *MySVM* class:

The implementation is fairly simple. In the constructor, we receive two **parameters**, number of features and learning rate. We need the number of features in order to initialize *w*. Note that we add 1 to that size because *b* is incorporated within *w*. That is why we need a private *__extend_input* method. This method is used to **extend** any input matrix with the column of ones. That is how *b* is implicitly modeled within the solution. Apart from that this class contains two **public** methods *fit* and *predict* following the blueprint dictated by big libraries like *SciKit Learn*.

The fit method handles the training process. In it, we first extend the training set and loop for a defined number of **epochs**. There we perform necessary calculations defined by the *Pegasos* algorithm. Finally, we update *w* for **each** sample in the training set. The predict method just extends the test set and multiplies it with *w*. If the result is larger than 0 it appends label 1, otherwise, it appends label -1. As you may notice, unlike *Logistic Regression*, SVM doesn’t output **probabilities** for each class. It is easy to use this class:

In the end, we get that accuracy of 100%.

### Using SciKit Learn

Of course, we can use *SciKit Learn* implementation of this classifier. Since our data is linearly separable we can use *LinearSVC* class:

Hyperparameter *C* is used to control the wideness of the street. Smaller *C* value leads to a wider street, however, it leads to more **margin violations**. This means that samples could end up that end up in the middle of the street or even on the wrong side. For loss function, we used **hinge** loss. This loss is effective for SVM algorithms. It is defined with the formula: ** max(0, 1 – t)**. Its derivative is –1 if t < 1 and 0 if t > 1.

Alternatively, we can use the SVC class:

We got the same result, ie. 100% accuracy, but you may notice **kernel** hyperparameter, which is going to be explained in the next chapter. Anyhow, here are the compared results of all three algorithms within pandas data frame:

If we plot it out these models, here is what we get:

Notice the difference between a linear function that **separates** two classes between our and *SciKit Learn* implementation. This is caused by different algorithms that implementations used.

## Non-Linear Data

Thus far we observed a pretty nice example, data that is linearly separable. In reality, this is almost never the case. So, let’s consider other classes from the *PalmerPenguins* dataset and load *Adelie* and *Chinstrap* classes.

This time it is not so easy to separate classes with just a straight line. Data is a bit scrambled, so what should we do in these situations when data is not linear? Here we can apply probably the greatest SVM advantage –** kernel trick**. This technique gives you the possibility to get the same result as if you were using polynomial features **without** actually having to add them. Kernels are just functions that **map** low-dimensional non-linearly-separable data into a linearly-separable high-dimensional data. Meaning, in our case, we map our 2D data which is not linearly separable into 3D data that is.

However, we don’t know which **mapping** works for our data the best, so if we would map all the vectors into a higher dimension and then apply SVM to it that would be very inefficient. That is where the kernel trick comes into play. In essence, it uses kernels to work in **higher-dimensional** spaces without doing this transformation explicitly. So, let’s use different kernels on our dataset. First, we use the **polynomial kernel** and here is the result that we get.

As you can see accuracy is around 90%. That is not bad, but can it get better. One of the most popular kernel functions is Gaussian **RBF** kernel defined by the formula:

It is a bell-shaped function varying from 0 to 1 and it is often used for adding features using **similarity features** method. Notice the parameter **gamma**. It is defining how wide the bell-curve is and this essentially the hyperparameter of the SVM. In essence, when we use this kernel we create a gaussian bell-curve in 3-dimensional space, in this example (since our data is in 2-dimensional space), around the chosen landmark.

Then all points are mapped from 2-dimensional space to 3-dimensional space, but every point is **mapped** to this curve. That is how we ensure that data is linearly separable in 3-dimensional space. Then SVM is applied. Of course *kernel trick* is applied so we don’t have to do all of these calculations, so the algorithm is pretty efficient. Let’s use *RBF*:

The accuracy is around 97%. That is much better than the polynomial kernel. If we visualize both approaches, we get something like this:

You can imagine 3D Gaussian bell-curve coming out of your screen in the RBF example if you will 🙂

### SVM for regression – SVR

SVM for regression is not so different from the one for the classification. In essence, all the **practices** that we learned for classification stand for the regression as well, with one major difference. For the classification SVM tried to fit the largest street **among** the samples of different classes, without violating margins. In regression, SVM tries to fit as many samples as possible **on** the street, while minimizing number of the samples of the street. The wideness of the street is controlled by hyperparameter – **epsilon**.

### Preparing Data

For regression example, we use the **Boston housing dataset**. This dataset contains 14 features, of which we use 2. Our goal in this example is to predict the **value** of the property based on the **age** of the occupant. So, we remove all other features:

Here is how that looks like when we **plot** it:

As you can see this is the case of **non-linear** regression. So, let’s try to create a model that will best fit this data.

### Using SciKit Learn

We know that regression in this example is non-linear, but for educational purposes, we start with **linear** SVM regression.

The results are, well, mixed. Sometimes they are not that bad, but sometimes they are far off. It looks like a random guess. When we plot it out, the model makes more sense:

It is interesting. We can almost see how SVM works and how it tried to fit as **many** points around that line. Of course, we can improve the results by using other kernels. Here is how we use polynomial kernel:

Compared to linear model results are somewhat **better**. When we plot the model, here is what we get:

Notice the degree parameter in the **constructor** of the SVR class. Try changing its value of this parameter and the value of epsilon to get better results. Finally, let’s use the **RBF** kernel:

This model gets the best results. Let’s plot it:

## Conclusion

In this article, we covered large topic of SVM. We had a chance to see how it works for classification and for regression. We also saw how we can handle linear data and how we can handle non-linear data using kernel trick. SVM is one amazing algorithm which is important part of your Machine Learning toolbox.

Thank you for reading!

#### Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at **Rubik’s Code** and the author of book “**Deep Learning for Programmers**“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

**Rubik’s Code** is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the **services **we provide.