The **code** that accompanies this article can be found **here.**

In the **previous article**, we started exploring some of the **basic** machine learning algorithms. There we covered *Linear Regression, *it’s variations and we implemented it from scratch with *Python*. We also used other technologies like **TensorFlow** and **Pytorch** for the implementation of this algorithm with **gradient descent**. In this article, we focus on the classification algorithm or to be more precise, the algorithms that are used **primarily** for classification problems. Note that we will not cover all the classification algorithms, for example, *SVM* and *Decisions Trees,* because these algorithms can be used for regression as well, so they will get separated articles just for them.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖

When we are solving classification problems we want to predict the **class** **label** of the observed sample. For example, we want to predict a class of penguins based on their bill length and width. There are several approaches to solving this. As we will see some of the solutions are based on calculating distances, while others are based on creating a probabilistic model. One way or another, the goal is to create function *y = f(X)* that minimizes the error of **misclassification**, where *X* is the set of observations and *y* is the output class label. Before we dive into details of each algorithm let’s check out what we need to install and import.

## Dataset and Prerequisites

Data that we use in this article is from **PalmerPenguins** Dataset. This dataset has been recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset **here**, or via Kaggle. This dataset is essentially composed of two datasets, each containing data of 344 penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. Also, these datasets contain **culmen** dimensions for each species. The culmen is the upper ridge of a bird’s bill. In the simplified penguin’s data, culmen length and depth are renamed as variables *culmen_length_mm* and *culmen_depth_mm*.

For the purpose of this article, make sure that you have installed the following *Python *libraries:

**NumPy**– Follow**this guide**if you need help with installation.**SciKit Learn**– Follow**this guide**if you need help with installation.

Once installed make sure that you have imported all the necessary modules that are used in this tutorial.

Apart from that, it would be good to be at least familiar with the basics of **linear algebra**, **calculus** and **probability**.

## Logistic Regression

The first algorithm that we explore in this article is *Logistic Regression*. The name might be a bit confusing because it comes from statistics and it is due to the similar mathematical formulation for **Linear Regression**. Just so simplify things even more for this first algorithm, we explain it in the case of binary classification, meaning we have only two classes. As we mentioned, this algorithm has a similar formulation as linear regression. What we want to do is to model *yi* as a linear function of *xi*, but that is not as simple now when *yi* can have only two values (two classes remember). So, the *Logistic Regression* model still computes a **weighted sum** of the input features and adds a **bias term**, but instead of outputting the result directly, it does some extra processing.

That is why we assign value 0 to the first class (**negative class**) and value 1 to the second class (**positive class**). That is how the problem is transformed into the problem of finding a **continuous** function whose codomain is (0, 1). This means that we want to estimate the **probability** that an observed sample belongs to a particular class. For that purpose, **sigmoid function** or standard logistic function is used:

So, with *Logistic Regression* we still calculate *wX + b *(or to simplify it even further and put all parameters into the matrix – *θX)* value and put the result in sigmoid function. If the **result** is greater than 0.5 (probability is larger than 50%), then the model predicts that the instance belongs to that class positive class(1), or else it predicts that it does not belong to it (negative class). Mathematically we can put it like this:

It is important to note that we need to modify **loss function** as well in order for it to work on this type of data. For this purpose we use **log loss** function which is defined like this:

Unlike the loss function that we used for the **Linear Regression**, this formula doesn’t have its closed form. We can not use the* Normal Equation*, so we need to use **gradient descent** to optimize it. For that purpose we need to calculate partial derivatives of the cost function with regards to the jth model parameter *θj*:

Ok, that would be rough **theory** behind it, let’s move to the implementation.

### Preparing data for Logistic Regression

Let’s first load and prepare data for *Logistic Regression*. We perform **binary** classification, so we don’t use all data from the *PalmerPenguins* dataset. Here is the complete data when we load it:

We remove all samples that are labeled with class ‘*Chinstrap*‘ and features that we don’t want to use (we use only *culmen_length_mm* and *culmen_depth_mm*).

Ok, now data is ready so we can define input data, output data and **split** it into train and test sets.

Note that we **scale** input data using *StandardScaler*. Just for more transparency here is how to input data and output data looks like when we **plot** it:

### Python Implementation

Let’s implement a class for *Logistic Regression* from scratch using *Python* and *NumPy*. The implementation is in the class *MyLogisticRegression*:

Since we use gradient descent for **optimization**, we need to initialize the learning rate through the **constructor**. Apart from that, there are **four** internal functions:

**__sigmoid**– Performs sigmoid function on the input information.**__loss**– Calculates log loss**__gradient**– Calculates gradient based on the input data, estimated values and actual class label**__extend_input**– Just like for Linear regression we need to extend input data with ones so we can add it automatically as the bias term.

The two public methods **fit** and **predict** are used to train the model and to create new predictions for new samples. In the *fit* method, we first **extend** the input and initialize *theta* – the set of parameters. Then we **calculate** the *theta* based on the input data and use them along with the *sigmoid* function to calculate the predictions. After that, we can calculate the *loss*. Finally, we calculate the gradient and use the learning rate to adjust *theta,* ie. we perform gradient descent. The *predict* method just extends input data and gets the **probability** based on the current parameters.

Finally, let’s use the class on prepared data:

Simple, we first create an **object** of the *MyLogisticRegression* class, then we call the *fit* method with the training data and call the *predict* method on the test data. We call the *SciKit Learn* accuracy score, which just gets a percentage of the times prediction aligns with the real value. We had a 100% score, meaning we correctly predicted all values in the test set. Let’s visualize that:

We can even see our linear model presented as a line which separates two classes. This is exatcly what we wanted.

### Using SciKit Learn

Of course, we don’t need to implement this algorithm on our own. *Logistic regression* is available as a module in *SciKit Learn* library. Here is how you can use it:

Pretty much like our from-scratch implementation. The accuracy is again 100%. The principle we used here can be extended to more classes and features.

## K-Nearest Neighbours (KNN)

Unlike *Logistic Regression*, this algorithm is not calculating probabilities but is based on **distances**. This effectively means that it is non-parametric models, but it also means that it keeps all training data in memory after the training. In fact, **storing** training data is the training process. Basically, once a new previously unseen sample is passed into the algorithm it calculates *k* training examples that are closest to *x* and returns the majority label (or average label, depending on the implementation). The distances can be calculated in various different ways. **Euclidean distance** or **Cosine similarity** are often used in practice, but you can play around with *Manhattan distance* or *Chebychev distance.* In this article we use Euclidean distance which can be described with the formula:

To sum it up, this algorithm is simple and intuitive and it can be broken down into several steps:

**Decide**the number of neighbors that algorithm considers**Store**training data with corresponding labels in memory- Once a new input point comes in,
**calculate**it’s the distance from the training points based on the distance function of your choosing **Sort**the results and**pick***k*points that are closest to the new input sample**Detect**the label of the majority of*k*points and assign this label to new input sample

Let’s implement it in *Python*.

### Preparing Data for KNN

For this algorithm, we don’t consider binary classification, but the full classification problem with **three** classes from *PalmerPenguins* dataset. Effectively this only means that we don’t remove *Chinstrap* class from the dataset, the other preprocessing steps remain the same.

Here is how that looks like when we **plot** it:

### Python Implementation

The whole algorithm is **implemented** within *MyKNearestNeighbors* class:

There are two private methods:**__euclidian_distance** – Calculates Euclidean distance between two samples in the dataset**__get_neighbours** – Calculates the distance between all points in the train set and input set and returns k neighbors for each sample from the input set.

The two public methods *fit* and *predict* are used as an interface to this class. The first one just **stores** training data in the class attribute. The *predict* method puts it all together. Basically, it calls *__get_neighbours* method and utilizes that result to assign labels to every sample of the input set. Usage is the same as for previous algorithms:

The accuracy score is once again 100%. We can confirm it by creating *pandas* data frame with both actual values and predictions:

### Using SciKit Learn

*SciKit Learn* of course, provides a class for KNN called *KNeighborsClassifier:*

The accuracy using this approach is also 100%. This way we have the same results as with our implementation, which is quite cool.

## Naive Bayes

Third and final algorithm we explore today is the **Naive Bayes** algorithm. As we mentioned previously, classification problems can be solved by creating a predictive model. That is what we have done with *Logistic Regression*. Another way to create a predictive model would be to **estimate** the conditional probability of the class label, given the observation. Meaning, we can calculate conditional probability for each class label in and the pick the label with the highest probability as most likley label. In theory, Bayes Theorem can be used for this:

The main problem with this approach is that we need a really large dataset to calculate the **conditional probability** *P(x1, x2, …, xn | yi)*, because this formula assumes that each input variable is dependent upon all other variables. If the number of features is large, the size of the dataset becomes an even bigger problem. To simplify this problem we **assume** that each input variable as being independent of each other. This might sound weird…because it is 🙂 In reality, it is really rare that input features don’t **depend** on each other. However, this approach proved to surprisingly well in the wild. That is why we can rewrite the formula from above as:

To calculate *P(yi)* all we have to do is divide the frequency of class *yi* in the training dataset and divide it with the total number of samples in the training set *(P(yi) = # of samples with yi / total # of examples)*. The second part of the **equation**, the conditional probability, can be derived from data as well. So, let’s implement it.

### Preparing Data for Naive Bayes

Before we dive into the algorithm implementation let’s prepare the *PalmerPenguins* dataset again. It is very simmilar for the preparation we have done for KNN:

The only difference is that we stored columns in the *columns* variable because we need it for the algorithm implementation.

### Python Implementation

The implementation of this algorithm is simple and can be found in the class *MyNaiveBias*. Note that we utilized the *Pandas* library a bit more because it provides easy solutions for **grouping** data:

So, the fit method calculating **statistics** for the training dataset. We store all classes, calculate means and standard deviations for input training data grouped by each class. Finally, we calculate prior probability for each class. The *predict* method is doing all the hard work here and utilizes those calculations. First, for each sample from the input set and for each class we calculate **conditional probability** and multiply it with the **prior probability** for that class. In the end, for each input sample, we pick the class with the **largest** probability. We use this class as the ones before:

### Using SciKit Learn

We can use *SciKit Learn* implementation of Naive Bayes like this:

## Conclusion

In this article, we covered three classification algorithms that are often used. We explored Logistic Regression, KNN and Naive Bayes. We had a chance to see how they function under the hood and implement them with Python. Also, we had a chance to use out-of-the-box solutions form SciKit Learn. They are simple, but they are a great starting point before going to more advanced algorithms.

Thank you for reading!

#### Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at **Rubik’s Code** and the author of book “**Deep Learning for Programmers**“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

**Rubik’s Code** is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the **services **we provide.

## Trackbacks/Pingbacks