The code that accompanies this article can be found here.

So far in our journey through the Machine Learning universe, we covered several big topics. We investigated some regression algorithms, classification algorithms and algorithms that can be used for both types of problems (SVMDecision Trees and Random Forest). Apart from that, we dipped our toes in unsupervised learning, saw how we can use this type of learning for clustering and learned about several clustering techniques.

We also talked about how to quantify machine learning model performance and how to improve it with regularization. In all these articles, we used Python for “from the scratch” implementations and libraries like TensorFlow, Pytorch and SciKit Learn. The word optimization popped out more than once in these articles, so in this and next article, we focus on optimization techniques which are an important part of the machine learning process.

Are you afraid that AI might take your job? Make sure you are the one who is building it.


In general, every machine learning algorithm is composed of three integral parts:

  1. A loss function.
  2. Optimization criteria based on the loss function, like a cost function.
  3. Optimization technique – this process leverages training data to find a solution for optimization criteria (cost function).

As you were able to see in previous articles, some algorithms were created intuitively and didn’t have optimization criteria in mind. In fact, mathematical explanations of why and how these algorithms work were done later. Some of these algorithms are Decision Trees and kNN. Other algorithms, which were developed later had this thing in mind beforehand. SVM is one example.

During the training, we change the parameters of our machine learning model to try and minimize the loss function. However, the question of how do you change those parameters arises. Also, by how much should we change them during training and when. To answer all these questions we use optimizers. They put all different parts of the machine learning algorithm together. So far we mentioned Gradient Decent as an optimization technique, but we haven’t explored it in more detail. In this article, we focus on that and we cover the grandfather of all optimization techniques and its variation. Note that these techniques are not machine learning algorithms. They are solvers of minimization problems in which the function to minimize has a gradient in most points of its domain.

Dataset & Prerequisites

Data that we use in this article is the famous Boston Housing Dataset. This dataset is composed 14 features and contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It is a small dataset with only 506 samples.

For the purpose of this article, make sure that you have installed the following Python libraries:

  • NumPy – Follow this guide if you need help with installation.
  • SciKit Learn – Follow this guide if you need help with installation.
  • Pandas – Follow this guide if you need help with installation.

Once installed make sure that you have imported all the necessary modules that are used in this tutorial.

Apart from that, it would be good to be at least familiar with the basics of linear algebra, calculus and probability.

Why do we use Optimizers?

Note that we also use simple Linear Regression in all examples. Due to the fact that we explore optimization techniques, we picked the easiest machine learning algorithm. You can see more details about Linear regression here. As a quick reminder the formula for linear regression goes like this:

where w and b are parameters of the machine learning algorithm. The entire point of the training process is to set the correct values to the w and b, so we get the desired output from the machine learning model. This means that we are trying to make the value of our error vector as small as possible, i.e. to find a global minimum of the cost function.

One way of solving this problem is to use calculus. We could compute derivatives and then use them to find places where is an extrema of the cost function. However, the cost function is not a function of one or a few variables; it is a function of all parameters of a machine learning algorithm, so these calculations will quickly grow into a monster. That is why we use these optimizers.

Gradient Descent

This technique is the most popular optimization techniques. It is also a basis for other techniques as well. There is one useful analogy that describes this process quite well. Imagine that you had a ball inside a rounded valley like in the picture below. If you let the ball roll, it will go from one side of the valley to the other, until it goes still at the bottom.

Essentially, we can look at this behavior like the ball is optimizing its position from left to right, and eventually, it stops at the bottom, i.e. the lowest point of the bowl. The bottom, in this case, is the minimum of our cost function. This is what the gradient descent algorithm is doing. It starts from one position in which by calculating derivatives and second derivatives of cost function it gets the information about where “the ball” should roll. Every time we calculate derivatives we get information about the slope of the side of the function (i.e. bowl) at its current position.

When the slope is negative (downward from left to right), the ball should move to the right, otherwise, it should move to the left. Be aware that the ball is just an analogy, and we are not trying to develop an accurate simulation of the laws of physics. We are trying to get to the minimum of the function using this alternative method since we already realized that using calculus is not optimal.

In a nutshell, the training process of Linear Regression with Gradient Descent can be described like this:

  1. Put the training set in the machine algorithm and get the output for each sample in it,
  2. The output is compared with the desired output and error is calculated using a cost function,
  3. Based on the error value and used cost function, a decision on how the w and b should be changed is made in order to minimize the error value,
  4. The process is repeated until the error is minimal.
Recommendation Systems

Let’s formalize this in more mathematical terms. We don’t know what the optimal values for w and b are in the Linear Regression formula:

Let’s formalize this in more mathematical terms. We don’t know what the optimal values for w and b are in the Linear Regression formula:

where N is the number of samples in the dataset, yi is the real output value and xi is the input vector (where each feature is represented with a separate coordinate). The first step in the Gradient Descent would be to define partial derivates for each parameter. For w, we use chain rule and get the result:

and for b we get:

Once we know how to calculate partial derivatives for all parameters in the machine learning model (in this case w and b) for the defined cost function (in this case MSE), then we can initialize values for tose parameters (w and b). The initialization process is a completely different topic outside of the scope of this tutorial. So, in this article, we will initialize those values to 0. Then we start iteration through the training set examples and update w and b, by utilizing partial derivatives after each sample:

where alpha is the learning rate hyperparameter. This hyperparameter controls how “strong” an update is. When we go through all examples in the training set we call that an epoch. In general, we train our machine learning algorithms for multiple epochs.

Python Implementation

Ok, that is enough theory, let’s implement this with Python. The complete implementation is done within MyGradientDescent class:

It is a pretty simple class. Note that name of this class is maybe not completely accurate. In essence, we created an algorithm that uses Linear regression with Gradient Descent. This is important to say. Here the algorithm is still Linear Regression, but the method that helped us we learn w and b is Gradient Descent. We could switch to any other learning algorithm.

In the constructor of the class, we initialize the value of w and b to zero. Also, we initialize the learning rate hyperparameter. There are two public methods:

  • fit – This is the method that performs the training process. Note that it returns history, where we record loss during training. This is done for future visualizations. Apart from that for updating and b we used formulas defined above.
  • predict – Method that predicts value.

Note that for the loss we use MSE and in the code, we use Sci-Kit learn function to calculate it. Let’s utilize this class on the data:

We can see how our algorithm is diverging because the loss is slowly going down. Once it is done we can plot the history and see how the loss function decreased during training process:

Another thing we can do is to plot the final model:

Also, we can compare real values with predictions (keep in mind that data is scaled):

The biggest problem of the Gradient Descent is that it can converge towards a local minimum and not to a global one. In our case that is not a problem, since MSE is a convex function that has just one minimum – a global one. The other problem is that for big datasets this approach can take some time.

Batch Gradient Descent

However, in practice, it would be inefficient to calculate the partial gradient of the cost function for each small change in the w or b, for each sample. You could see that it is not the fastest approach. That is why simple optimization is done, ie. partial derivative is calculated for a complete set of parameters in one go. For MSE, that is done using the formula:

where θ is the set of all machine learning parameters. In our case, θ0 is b while other θ values come from w. This optimized version is of gradient descent is called batch gradient descent, due to the fact that partial gradient descent is calculated for complete input X (i.e. batch) at each gradient step. This means that w and b can be updated using the formulas:

Python Implementation

The implementation of this algorithm is very similar to the implementation of “vanilla” Gradient Descent. We do so in the class MyBatchGradientDescent:

The only difference is the way we update w and b. When we fit our data in this algorithm, here is what we get:

Even though this approach is faster, we got some different results with it, loss is a bit higher than the one when we used simple gradient descent. Perhaps we should train it more. This difference can also be seen with the model that this approach created:

And in the results that it outputs:

Stohastic Gradient Descent

Stochastic gradient descent (SGD) is an updated version of the Batch Gradient Descent algorithm that speeds up the computation by approximating the gradient using smaller subsets of the training data. These subsets are called mini-batches or just batches. Sometimes in literature, you will find that Stochastic Gradient Descent is a version on Gradient Dataset that picks one random sample from the input dataset and that Mini-Batch Gradient Descent takes a subset of samples from the input dataset. However, those formal lines are a bit blurred in the day to day work. If someone says that they use stochastic gradient descent, it is high chances that they are referring to the one that uses the mini-batches. After all one sample is just a subset with one element. In this article, we are using this context – that Stochastic Gradient Descent uses mini-batches which are the subset of the input dataset.

Decision Tree

Since this algorithm works with considerably less data than Batch Gradient Descent it is faster. Also, this means that this algorithm can be used on big datasets. However, due to its random nature, this process is much less regularized. The loss will not linearly go to it’s minimum, but it will bounce up and down until it stabilizes and diverges. Let’s see how the implementation of this algorithm looks like in Python.

Python Implementation

Ok, the only thing that we need to improve from the previous implementation is to give the user of our class the ability to define the size of the batch. Here is how we do it in the class MySGD:

You can see one additional private function _get_batch. This method retrieves a random subset of input and output data for the training. It is the only difference from MyBatchGradientDescent class. The fit method is modified to utilize this method and generate X_batch and y_batch, which are later on used in the training. Ok, let’s try this on our dataset:

From the output, we can see that loss is going up and down, which we expected, but notice how overall loss is going down. We can observe that when we plot the history:

Decision Tree

Even though this seems a bit odd at first, observe what happens when we plot the predictions and compare it with the results we got from Batch Gradient Descent:

Decision Tree

We got a better approximation of the data! And with that we got more accurate predictions:

Decision Tree

Using Sci-Kit Learn

As it usually goes, everything we write from scratch already exists in the Sci-Kit Learn library, which makes our lives a lot easier. Let’s use SGDRegressor for our problem:

Quite straightforward, nothing more than we seen so far. Note that the alpha parameter is the learning rate and max_iter is the maximal number of epochs. When we plot the result, we get the more or less same thing as our implementation:

Decision Tree

The same goes with the table of results:

Decision Tree


In this article we got a chance to see how Gradient Descent, the most commonly used optimization technique, works. It is the optimization technique from which all started. We had a chance to implement it from scratch using Python and see how we can utilize it with Sci-Kit learn. In the next article, we will cover some other more advanced optimization techniques which are based on a notion of momentum.

Thank you for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.