**Deep Learning **and **Machine Learning** are no longer a novelty. Many applications are utilizing the power of these technologies for cheap **predictions**, object detection and various other purposes. In this article, we cover the **Linear Regression**. You will learn how Linear Regression functions, what is Multiple Linear Regression, implement both algorithms from scratch and with *ML.NET*. Linear Regression is a well-known algorithm and it is the basics of this vast field. In a way, it is the root of it all.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖

## 1. Prerequisites and Dataset

What we want to do in this article, is to make an algorithm that is able to predict the price of the house based on the provided parameters. This algorithm should learn how to do that using the famous *Boston Housing Dataset.* *This dataset is composed of 12 features and contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It is a small dataset with only 506 samples.*

The implementations provided here are done in *C#*, and we use the latest .NET 5. So make sure that you have installed this SDK. If you are using *Visual Studio *this comes with version 16.8.3. Also, make sure that you have installed the following packages*:*

You can do a similar thing using Visual Studio’s *Manage NuGetPackage* option:

If you need to catch up with the basics of machine learning with ML.NET check out **this article**. Apart from that, you should be comfortable with the basics of **linear algebra**.

## 2. Simple Linear Regression Theory

Sometimes data that we have is quite **simple**. Sometimes, the output value of the dataset is just the **linear combination** of features in the input example. Let’s simplify it even further and say that we have only **one** feature in the input data. A mathematical model that describes such a **relationship** can be is presented with the formula:

For example, let’s say that this is our data:

In this particular case, the mathematical model that we want to create is just a **linear function** of the input feature, where *b0* and *b1* are the model’s **parameters**. These parameters should be learned during the training process. After that, the model should be able to give correct output **predictions** for new inputs. To sum it up, during training we need to learn *b0* and *b1 *based on the values of *x* and *y*, so our *f(xi)* is able to return correct predictions for the **new** inputs. If we want to **generalize** even further we can say that model makes a prediction by adding a constant (bias term – *b0*) on the precomputed weighted sum (*b1)* of the input features. However, let’s back to our example and clear things up a little bit before we dive into generalization. Here is what the aforementioned data looks like on the **plot**:

Our linear regression model, by calculating optimal *b0* and *b1,* produces a line that will best fit this data. This line should be optimally **distanced** from all points in the graph. It is called the **regression line**. So, how does the algorithm calculates *b0* and *b1* values?

In the formula above, *f(xi)* represents the predicted output value for ith example from the input, and *b0* and *b1* are regression coefficients that represent the **y-intercept** and **slope** of the regression line. We want that value to be as close as possible to the real value – *y*. Thus model needs to learn the values regression coefficients *b0* and *b1*, based on which model will be able to predict the correct output. In order to make these estimates, the algorithm needs to know how bad are his **current** estimations of these coefficients. At the beginning of the training process, we feed samples into the algorithm which calculates output *f(xi)* of the current sample, based on **initial** values of regression coefficients. Then the error is **calculated** and coefficients are corrected. Error for each sample can be calculated like this:

Meaning, we **subtract** estimated output from the real output. Note that this is a training process and we **know** the value of the output in the i-th sample. Because *ei* depends on coefficient values it can be described by the **function**. If we want to minimize *ei* and for that, we need to define a function based on which we will do so. In this article, we use the **Least Squares Technique** and define the function that we want to minimize as:

The function that we want to minimize is called the **objective function** or **loss function**. In order to minimize *ei*, we need to find coefficients *b0* and *b1* for which *J* will hit the global minimum. Without going into mathematical details (you can check out that **here**), here is how we can calculate values for *b0* and *b1:*

Here *SSxy *is the sum of cross-deviations of y and x:

while *SSxx* is the sum of squared deviations of x:

Ok, so much for the theory, let’s implement this algorithm using *C#*.

## 3. Simple Linear Regression C# Implementation

Let’s implement a class that can do simple linear regression with two parameters as we explained in the previous section.

Our *LinearRegressor* class is quite simple. Following the linear regression formula, this class has two fields *_b0* and *_b1* which are set to zero in the constructor. Following the usual notation, there are two **public** methods *Fit()* and *Predict()*. The *Fit()* method is where we perform the **training** process, while *Predict()* method creates **predictions** based on that training process. Here is how we use this class:

Here we defined one array for the input values *X* and an array of the output values *y*. All we need is to create a *LinearRegressor* object and train it using *Fit()* method. Once that is done, we can make predictions. Here we use the same values that we used for training, which is **not** the best approach and should be **avoided**, however since this is just an educational example we will give it a pass. Here is the result that we get when we run this code:

We can see that predictions are close, but not quite there. If we visualize those predictions here is what we get:

Overall it is a nice approximation.

## 4. Multiple Linear Regression Theory

Ok, that was super simple. The usage of this example is very limited since we usually end up with **datasets** with more **features** in them. Let’s take it up a notch and get a little more practical…and mathematical. We observe a set of labeled samples* {(xi, yi)} Ni=1*. The *N* is the size of the set, while *xi* is the *D*-dimensional feature vector and *yi* is the output. Every feature x is the real number. One such dataset is the famous **Boston Housing Dataset**. Here is what it looks like:

In this dataset, the **output** is the *medv* feature, while the rest of the features are **input** features. As you can see there are several features *(x1 – crim, x2-zn,….)* for each sample* i*. Now, we can **generalize** principles of linear regression and use them on a dataset with more features. We can present the model with a formula:

Or to simplify it even further:

We changed the notion there a little bit, but it is essentially the **same** as the previous formula, it is just **vectorized**. The bias *b0* became *b*. The *w* is now a *D*-dimensional vector (because we have a *D* number of features, remember) of parameters. To predict the *y* for a given *x* we use this model.

Obviously, we want to find the optimal values for coefficients (*w, b*) for which the model will output accurate predictions. Unlike simple linear regression which creates a line, multiple linear regression creates a **hyperplane**, since every feature represents one **dimension**. This hyperplane is chosen like that to be as **close** to all sample values as possible. To calculate the optimal coefficient, this time we want to minimize **Mean Squared Error** function:

To quickly find the values of w and b that minimize MSE we use the so-called **Normal Equation**. This equation gives direct results for mentioned coeficients:

Ok, let’s utilize this in the code.

## 5. Multiple Linear Regression C# Implementation

The algorithm we discussed previously is implemented withing **MultipleLinearRegressor** class:

That is a lot of code so let’s explain it in more detail. This class has two fields *_b* and *_w*. They represent **parameters** of this machine learning algorithm that will be changed during the training process. There are two **private** methods *ExtendInputWithOnes* and *SubArray*. Since we want to learn parameters in one shot and parameter *b* from the equation is not **modeled** in the data, we need to **extend** the input matrix with one column with all ones. This is one in *ExtendInputWithOnes* method. The *SubArray* method retrives sub-array from the passed array. Apart from that, we have two public functions *Fit()* and *Predict()*, just like in the previous implementation. In the *Fit()* method we utilize what we have learned from theory and train our algorithm, while we make new predictions with *Predict()* method. Note the use of the *MathNet* library.

Let’s use this class:

We use some dummy data just to demonstrate how this class functions. In the end for the new sample we get this prediction:

## 6. ML.NET Linear Regression Algorithms

In *ML.NET* we don’t have these plain implementations of the *Linear Regression*, but we have some which are more **advanced**. There are two improved variations of Linear Regression that you can use with *ML.NET*:

**Online Gradient Descent**–**Stochastic gradient descent**is one of the most popular machine learning algorithms. It uses a simple yet efficient iterative technique to fit model coefficients using error gradients. With these iterations, it avoids**memory**problems, which we might face if we try to load a large dataset in our vanilla implementations. Online Gradient Descent is a variation of the Stochastic Gradient descent with a choice of loss functions, and an option to update the weight vector using the average of the vectors seen**over time**.**SDCA**–**Stochastic Dual Coordinate Ascent**(SDCA) is another variation on Stochastic Gradient Descent which is suitable for**large**dataset. The algorithm can be scaled because it’s a streaming training algorithm. This algorithm a state-of-the-art optimization technique for convex objective functions. You can find out more about it in this**paper**.

In the next section, we use these algorithms on Boston Housing Dataset.

## 7. ML.NET Implementation

Before we dive into the *ML.NET* implementation, let’s consider the **high-level architecture** of this implementation. In general, we want to build an easily extendable solution that we can easily extend with new linear algorithms that *ML.NET* might include in the future. That is why the folder structure of our solution looks like this:

*The Data* folder contains .csv with input data and the *MachineLearning* folder contains everything that is necessary for our algorithm to work. The architectural overview can be represented like this:

At the **core** of this solution, we have an abstract *TrainerBase* class. This class is in the *Common* folder and its main goal is to **standardize** the way this whole process is done. It is in this class where we **train** our machine learning algorithm. The classes that implement this abstract class are located in the *Trainers* folder. Here we can find two classes *OGDBostonTrainer* and *SdcaRegressionBostonTrainer*. These classes define which **algorithm** should be used and how the data should be** pre-processed**. In this particular case, we have only one *Predictor* located in the *Predictor* folder.

### 7.1 Data Models

In order to load data from the dataset and use it with *ML.NET* algorithms, we need to implement classes that are going to **model** this data. Two files can be found in Data Folder: *BostonHousingData* and *BostonHousingPricePredictions*. The *BostonHousingData* class models input data and it looks like this:

The *BostonHousingPricePredictions* class models output data:

### 7.2 TrainerBase Class

As we mentioned, this class is the core of this implementation. Here is what it looks like:

That is one large class. It controls the whole process. Let’s split it up and see what it is all about. First, let’s observe the fields and properties of this class:

The *Name* property is used by the class that inherits this one to add the name of the algorithm. The *ModelPath* field is there to define where we will **store** our model once it is trained. Note that the file name has *.mdl* extension. Then we have our *MlContext* so we can use *ML.NET* functionalities. Don’t forget that this class is a **singleton**, so there will be only one in our solution. The *_dataSplit* field contains loaded data. Data is **split** into train and test datasets within this structure. The field *_model* is used by the child classes. These classes define which machine learning **algorithm** is used in this field. The *_trainedModel* field is the resulting model that should be evaluated and saved. Cool, let’s now explore *Fit()* method:

This method is the **blueprint** for the training of the algorithms. As an input parameter, it receives the **path** to the *.csv* file. After we confirm that the file exists we use the private method *LoadAndPrepareData*. This method loads data into **memory** and splits it into two datasets, train and test dataset. We store the returning value into *_dataSplit* because we need a test dataset for the **evaluation** phase. Then we call *BuildDataProcessingPipeline()*.

This is an abstract method that needs to be **overridden** by the child class. In essence, the only job of the class that inherits and implements this one is to define the algorithm that should be used, by instantiating an object of the **desired algorithm** as *_model* and overriding *BuildDataProcessingPipeline()* so data is correctly **prepared** for the defined algorithm. We complete the training pipeline by appending *_model* to the prepared data process pipeline. Finally, we can call the *Fit()* method of the training pipeline and store the trained model in the* _trainedModel* field. Next is the *Evaluate()* method:

It is a pretty simple method that creates a *Transformer* object by using *_trainedModel* and test **Dataset**. Then we utilize *MlContext* to retrieve **regression metrics**. Finally, let’s check out *Save()* method:

This is another simple method that just uses *MLContext* to save the model into the defined path.

### 7.3 Trainers

Thanks to all the heavy lifting that we have done in the *TrainerBase *class, the other *Trainer *classes are pretty **simple** and focused only on **preparing data** for the **concrete algorithm**. We have two classes that utilize *ML.NET*‘s Online Gradient Descent and SDCA implementations. First, let’s check out the class ** OGDBostonTrainer **class that implements Online Gradient Descent for Boston housing dataset:

In the constructor of this class, we define the name and instantiate *ML.NET* object of *OnlineGradientDescent* class. In the overridden *BuildDataProcessingPipeline* we have done some Data pre-processing and **feature engineering**. Namely, we used **one-hot encoding** on the *RiverCoast* feature and used log mean **normalization** on all features. The *SdcaRegressionBostonTrainer* is similar:

Basically, the only difference is that the *_model* field is now assigned *Sdca* class instance.

### 7.4 Predictor

The *Predictor* class is here to load the saved model and run some **predictions**. Usually, this class is not a part of the same microservice as trainers. We usually have one microservice that is performing the training of the model. This model is saved into file, from which the other model loads it and run predictions based on the user input. Here is how this class looks like:

In a nutshell, the model is loaded from a defined file, and predictions are made on the new sample. Note that we need to create *PredictionEngine *to do so.

### 7.4 Usage and Results

Ok, let’s put all of this together.

Not the *TrainEvaluatePredict()* method. This method does the heavy lifting here. In this method, we can inject an instance of the class that inherits *TrainerBase* and a new sample that we want to be predicted. Then we call *Fit()* method to **train** the algorithm. Then we call *Evaluate()* method and print out the **metrics**. Finally, we **save** the model. Once that is done, we create an instance of *Predictor*, call *Predict()* method with a new sample and print out the **predictions**. In the *Main*, we create an instance of *OGDBostonTrainer* class and run *TrainEvaluatePredict* with it. Then we do the same for *SdcaRegressionBostonTrainer*. Here are the results:

Observing the metrics we can say that **SDCA** performed better and consider this estimation to be a better one.

## Conclusion

In this article, we covered a lot of things. First, we explored *Linear regression* and *Multiple Linear Regression* in-depth and then we implement both of those algorithms from scratch. Then we used *ML.NET*‘s algorithms like *Online Gradient Descent* and *SDCA*. We created a robust and extendable solution for *ML.NET* and run the predictions on *Boston Housing Dataset*. Pretty neat. What do you think?

Thank you for reading!

#### Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at **Rubik’s Code** and the author of book “**Deep Learning for Programmers**“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

**Rubik’s Code** is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the **services **we provide.

## Trackbacks/Pingbacks