The code that accompanies this article can be found here.

So far in our journey through the Machine Learning universe, we covered several big topics. We investigated some regression algorithms, classification algorithms and algorithms that can be used for both types of problems (SVMDecision Trees and Random Forest). Apart from that, we dipped our toes in unsupervised learning, saw how we can use this type of learning for clustering and learned about several clustering techniques.  In all these articles, we used Python for “from the scratch” implementations and libraries like TensorFlow, Pytorch and SciKit Learn. 

In a previous couple of articles, we were specifically focused on performance. We talked about how to quantify machine learning model performance and how to improve it with regularization. Apart from that, we covered the topic optimization techniques, both basic ones like Gradient Descent and advanced ones, like Adam. It is pretty surreal how completely different sub-branches grew around the concept of model optimization. One of those sub-branches is hyperparameter optimization or hyperparameter tuning.

Are you afraid that AI might take your job? Make sure you are the one who is building it.

STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖

Hyperparameters are an integral part of every machine learning and deep learning algorithm. Unlike standard machine learning parameters that are learned by the algorithm itself (like w and b in linear regression, or connection weights in a neural network), hyperparameters are set by the engineer before the training process. They are an external factor that controls the behavior of the learning algorithm fully defined by the engineer. Do you need some examples?

The learning rate is one of the most famous hyperparameters, C in SVM is also a hyperparameter, maximal depth of Decision Tree is a hyperparameter, etc. These can be set manually by the engineer. However, if we want to run multiple tests, this can be tiresome. That is where we use hyperparameter optimization. The main goal of these techniques is to find the hyperparameters of a given machine learning algorithm that deliver the best performance as measured on a validation set. In this tutorial, we explore several techniques that can give you the best hyperparameters.

Dataset & Prerequisites

Data that we use in this article is from PalmerPenguins Dataset. This dataset has been recently introduced as an alternative to the famous Iris dataset. It is created by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. You can obtain this dataset here, or via Kaggle. This dataset is essentially composed of two datasets, each containing data of 344 penguins. Just like in Iris dataset there are 3 different species of penguins coming from 3 islands in the Palmer Archipelago. Also, these datasets contain culmen dimensions for each species. The culmen is the upper ridge of a bird’s bill. In the simplified penguin’s data, culmen length and depth are renamed as variables culmen_length_mm and culmen_depth_mm.

Since this dataset is labeled, we will be able to verify the result of our experiments. However, this is often not the case and validation of clustering algorithm results is often a hard and complicated process.

For the purpose of this article, make sure that you have installed the following Python libraries:

  • NumPy – Follow this guide if you need help with installation.
  • SciKit Learn – Follow this guide if you need help with installation.
  • SciPy – Follow this guide if you need help with installation.
  • Sci-Kit Optimization – Follow this guide if you need help with installation.

Once installed make sure that you have imported all the necessary modules that are used in this tutorial.

Apart from that, it would be good to be at least familiar with the basics of linear algebra, calculus and probability.

Preparing the Data

Let’s load and prepare PalmerPenguins dataset. First, we load the dataset, remove features that we don’t use in this article:

Then we separate input data and scale it:

Finally, we split data into training and testing datasets:

When we plot the data here is what it looks like:

Grid Search

Manual hyperparameter tuning is slow and tiresome. That is why we explore the first and simplest hyperparameters optimization technique – Grid Search. This technique is speeding up that process and it is one of the most used hyperparameter optimization techniques. In its essence, it automates the trial and error process. For this technique, we provide a list of all hyperparameter values and this algorithm builds a model for each possible combination, evaluates it, and selects values that provide the best results. It is a universal technique that can be applied to any model.

In our example, we use SVM algorithm for classification. There are three hyperparameters that we take into consideration – C, gamma and kernel. To understand them in more detail, check out this article. For C we want to check the following values: 0.1, 1, 100, 1000; for gamma we use values: 0.0001, 0.001, 0.005, 0.1, 1, 3, 5, and for kernel we use values: ‘linear’ and ‘rbf’. Here is how that looks like in the code:

We utilize Sci-Kit Learn and its SVC class which contains the implementation of SVM for classification. Apart from that, we use GridSearchCV class, which is used for grid search optimization. Combined that looks like this:

This class receives several parameters through the constructor:

  • estimator – the instance machine learning algorithm itself. We pass the new instance of the SVC class there.
  • param_grid – contains hyperparameter dictionary.
  • cv – Determines the cross-validation splitting strategy.
  • scoring – The validation metrics used to evaluate the predictions. We use F1 score.
  • n_jobs – Represents the number of jobs to run in parallel. Value -1 means that is using all processors.

The only thing left to do is run the training process, by utilizing the fit method:

Once training is complete, we can check the best hyperparameters and the score of those parameters:

Also, we can print out all the results:

Ok, let’s now build this model and check how well it performs on the test dataset:

Cool, our model with the proposed hyperparameters got the accuracy ~97%. Here is what the model looks like when plotted:

Random Search

Grid search is super simple. However, it is also computing expensive. Especially in the area of deep learning, where training can take a lot of time. Also, it can happen that some of the hyperparameters are more important than others. That is why the idea of Random Search was born and introduced in this paper. In fact, this study shows that random search is more efficient than the grid search for hyperparameter optimization in terms of computing costs. This technique allows the more precise discovery of good values for the important hyperparameters too.

Just like Grid Search, Random Search creates a grid of hyperparameter values and selects random combinations to train the model. It’s possible for this approach to miss the most optimal combinations, however, it surprisingly picks the best result more often than not and in a fraction of the time compared to Grid Search. Let’s see how that works in the code. Again we utilize Sci-Kit Learn’s SVC class, but this time we use RandomSearchCV class for random search optimization.

Note that we used uniform distribution for C and gamma. Again, we can print out the results:

Note that we got close, but different results than when we used Grid Search. The value of the hyperparameter C was 500 with Grid Search, while with Random Search we got 510.59. From this alone, you can see the benefit of Random Search, since it is unlikely that we would put this value in the grid search list. Similarly for the gamma, we got 0.23 for Random Search against 0.1 for Grid Search. What is really surprising is that Random Search picked linear kernel and not RBF and that it got a higher F1 Score with it. To print all results we use the cv_results_ attribute:

Let’s do the same thing as we did fro Grid Search: create the model with proposed hyperparameters, check the score on the test dataset and plot out the model. 

Wow, the F1 score on test dataset is exactly the same as when we used Grid Search. Check out the model:

Bayesian Optimization

Really cool fact about the previous two algorithms is that all he experiments with various hyperparameter values can be run in parallel. This can save us a lot of time. However, this is also their biggest lack. Meaning, since every experiment is run in isolation, we can not use the information from past experiments in the current one. There is a whole field that is dedicated to the problem of sequenced optimization – sequential model-based optimization (SMBO). Algorithms that are explored in this field use previous experiments and observations of the loss function. Based on them they try to determine the next optimal point. One of such algorithms is Bayesian Optimisation.

Decision Tree

Just like other algorithms from the SMBO group use previously evaluated points (in this case those are hyperparameter values, but we can generalize) to compute the posterior expectation of what the loss function looks like. This algorithm uses two important math concepts – The Gaussian process and acquisition function. Since Gaussian distribution is done over random variables, Gaussian process is its generalization over functions. Just like Gaussian distribution has mean value and covariance, the Gaussian process is described by mean function and covariance function.

The acquisition function is the function using which we evaluate the current loss value. One way to observe it is as a loss function for loss function. It is a function of the posterior distribution over loss function, that describes the utility for all values of the hyperparameters. The most popular acquisition function is expected improvement:

where f is loss function, x’ is the current optimal set of hyperparameters. When we put it all together Byesian optimization is done in 3 steps:

  • Using previously evaluated points for loss function, the posterior expectation is calculated using Gaussian Process.
  • New set of points that maximizes expected improvent is chosen
  • Loss function of new selected points is calculated
Decision Tree

The easies way to bring this to code is by using Sci-Kit optimization library, often called skopt. Following the process that we used on pravious examples, we can do the following:

Again, we defined dictionary for set of hyperparameters. Note that we used Real and Categorical classes from Sci-Kit Optimisation library. Then we utilize BayesSearchCV class in the same way we used GridSearchCV or RandomSearchCV. After the training is done, we can print out the best results:

It is interesting, istn’t it? We got quite different results using this optimization. Loss is a bit higher than when we used Random Search. We can even print out all results:

How the model with these hyperparameters performs on the test dataset? Let’s find out:

This is super interesting. We got better score on test dataset even though we got worse results on validation dataset. Here is the model:

Decision Tree

Just for fun, let’s put all these models side by side:

Decision Tree

Alternatives

In general, previously described methods are the most popular and the most frequently used. However, there are several alternatives that you can consider if previous ones are not working out for you. One of them is Gradient-Based optimization of hyperparameter values. This technique calculates the gradient with respect to hyperparameters and then optimizes them using the gradient descent algorithm. The problem with this approach is that for gradient descent to work well we need function that is convex and smooth, which is often not the case when we talk about hyperparameters. The other approach is the use of Evolutionary algorithms for optimization.

Conclusion

In this article, we covered several well known hyperparameter optimization and tuning algorithms. We learned how we can use Grid search, random search and bayesian optimization to get best values for our hyperparameters. We also saw how we can utilize Sci-Kit Learn classes and methods to do so in code.

Thank you for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.