Code that accompanies this article can be downloaded here.


In the previous article, we had a chance to look at the basics of machine learning and we got introduced to the way ML.NET framework is working. For that purpose, we have used Iris Dataset, which is a very basic classification problem. Let’s take up a notch and try to solve something which is a bit more advanced. In this article, we will see how we can apply same concepts from the previous article on one regression problem. If you remember, regression problems require prediction of the quantity. The output value of these calculations is continuous, meaning it is a real-value, such as an integer or floating point value. Let’s check out our regression problem – Bike Sharing Demands.

Bike Sharing Demands Dataset

Bike sharing systems are a modern way of transportation. They work kind of like rent a car system, but for bikes. The whole system of obtaining a membership, renting a bicycle and returning a bicycle is automated via a network of kiosk locations in a city. People are able to rent a bicycle in one location and return it to a different location. At the moment, there are more than 500 cities around the world have this kind of systems.

One of the benefits of these systems, for us data researches, is that these systems are recording everything. From the departure and arrival location to the duration of the travel. This way we can use these systems as a sensor network, and we can study for various topics, for example for researching mobility in a city or environmental and health issues. That is exactly what we have on our hands in this Bike Sharing Demands Dataset.

This dataset contains the hourly (hour.csv) and the daily (day.csv) count of rental bikes between years 2011 and 2012 in Capital Bikeshare program in Washington, D.C. with the corresponding weather and seasonal information. Here is how that looks:

Both files, hour.csv and day.csv have the following attributes, except for the hr attribute which is only available in hour.csv file:

  • Instant – sample index
  • Dteday – Date when the sample was recorded
  • Season – Season in which sample was recorded
    • Spring – 1
    • Summer – 2
    • Fall – 3
    • Winter – 4
  • Yr – Year in which sample was recorded
    • The year 2011 – 0
    • The year 2012 – 1
  • Mnth – Month in which sample was recorded
  • Hr – Hour in which sample was recorded
  • Holiday – Weather day is a holiday or not (extracted from [Web Link])
  • Weekday  – Day of the week
  • Workingday – If the day is neither weekend nor holiday is 1, otherwise is 0.
  • Weathersit – Kind of weather that was that day when the sample was recorded
    • Clear, Few clouds, Partly cloudy, Partly cloudy – 1
    • Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist – 2
    • Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds – 3
    • Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog – 4
  • Temp – Normalized temperature in Celsius.
  • Atemp – Normalized feeling temperature in Celsius.
  • Hum – Normalized humidity.
  • Windspeed – Normalized wind speed.
  • Casual – Count of casual users
  • Registered – Count of registered users
  • Cnt – Count of total rental bikes including both casual and registered

For the purpose of this article, we will use only hourly samples and try to create a model that will be able to predict count of total bicycle rents.

Feature Engineering

Now, before we jump to the implementation let’s do some feature analysis of the recorded data. This is an important step in building every machine learning model. At the moment ML.NET is not having too many features for this analysis, so all diagrams were created using Python. Nevertheless, we will learn quite a few things here, so let’s check out what was our findings.

The first thing we checked is missing data and we detected no empty or improper data in the dataset. After that, we have done outlier analysis. Outliers are samples that appear far away and diverges from an overall pattern. Using Python, we were able to create this image that shows the distribution of the count in this dataset:

Note the dots that the red arrow points to. These are outliers and there are quite a lot of them. These samples are introducing a non-linearity in our system. We may choose to remove them from the dataset, but for the purpose of this article, we are going to leave them and see where we land. This is another feature that ML.NET will hopefully introduce in the future – removal of the outliers.

Finally, let’s do some feature correlation analysis. Basically, we are going to check the correlation between our features and see is there a certain connection between these features. Once again, using Python we were able to get this image which shows matrix with levels of dependency between some of the features:

What we wanted to get using this is to see what is the relationship between count and some of the features that we didn’t expect to affect this feature. We can see that registered feature has a big influence on the final count result and that windspeed influence is close to zero. This tells us one important thing, we need to remove the registered feature and with it casual feature because they are leakage variables. This means that these features would create overly optimistic if not completely invalid predictive models and that is why we have to remove them.

Implementation

Overall, we are going to follow the same approach as in the previous article, but we want to go one step further. Let’s create a solution using which we will be able to easily try out different regression algorithms from ML.NET on this dataset and evaluate which one worked the best. The whole code that accompanies this article can be found here. So, let’s dive into the implementation.

Data

The first thing we need to do is separate the data into training and test set. Training set will be used to build and train model and test set will be used to evaluate the performance of that model. Best practice is to split the whole dataset in 80:20 ratio and that is exactly what we did:

Another thing we need to do is to create classes that will handle data from this dataset. That is why in our BikeSharingDemandData folder, we are having two classes: BikeSharingDemandSample and BikeSharingDemandPrediction. Information from our dataset files will end up in these objects, and then we will be able to use them for training our model and making the predictions. Take a look at the way they are implemented:

As you can see, we are not using some of the data from the dataset, ie. we skipped some of the columns from the dataset. We removed the dtedate attribute since we are having all that information in other attributes. Registered and casual features are not used, as well. Another thing that you should pay attention to is that in prediction class BikeSharingDemandPrediction output property should be decorated with ColumnName(Score) attribute.

Building a model

For the building the model we use ModelBuilder class. This class is a bit different than the previous time. Take a look:

In a nutshell, this class is getting the algorithm that should be used during building and training the model trough constructor. That way, we will be able to reuse this builder for different algorithms. It gets training data location trough constructor too. Apart from that, this class has BuildAndTrain method. This method is constructing LearningPipe –  the class that is used for defining tasks that our model needs to do. It encapsulates the data loading, data processing/featurization, and learning algorithm.

We are adding few things into our pipeline. We are adding TextLoader, which will pick up data from .csv files and load them into BikeSharingDemandSample objects. Then we are adding ColumCopier and we are gathering features of the same type using ColumnConcatanator. Finally, we are adding the defined algorithm to the pipeline.

Evaluation

In the previous articleModelBuilder had Evaluate method too. However, I realized that I was violating Single Responsibility Principle that way, so we are going to use new class for evaluation – ModelEvaluator. Here is how that looks:

This class has one method – Evaluate. This method receives model and test data location and returns RegressionMetrics. This metric class contains different scores for our regression model. We are going to take few of them into consideration which we will see in a bit.

Workflow

The Main method from Program class still handles the workflow of our application. Now, we want to try out different algorithms and see how each one of them perform on this set of data. That is why the Main method is implemented this way:

To sum it up, firstly we initialized our train and test data locations and created an object of ModelEvaluator. Then we used ModelBuilder to create different types of models, which we later evaluated using the ModelEvaluator object we created. Finally, we printed metrics that we got using PrintMetrics method:

As you can see we printed few metrics:

  • R2 Score
  • Absolute loss
  • Squared loss
  • Root Square Mean loss

Finally, in the output of our application, we can see how our models performed:

As you can see FastTreeTweedieRegressor had the best R2 score – 0.92 and smallest absolute loss – 40.3. We will consider this our best model and preview its predictions.

Visualising

If you take a look back at the Main method you will notice that there is a call of the method VisualizeTenPredictionsForTheModel which we didn’t cover. So, let’s see how this method is implemented and what is it doing:

This method is using model, and test data location to print out ten predictions for the first ten samples in the dataset. For this purpose, it is using BikeSharingDemandsCsvReader class, or to be a more precise GetDataFromCsv method. This is how BikeSharingDemandsCsvReader class looks:

Take a look at the output of this operation:

Conclusion

ML.NET is at its version 0.2 but it shows great features so far. The set of regression learners is quite big, and they cover a variety of algorithms. What we haven’t really tried here is using feature engineering classes and tools that ML.NET provides like CrossValidator class for example. In next few articles, we will investigate how to do that as well, but right now the accent was on solving real-world regression model in .NET environment. I am really satisfied with how it turned out. What do you think?


Read more posts from the author at Rubik’s Code.


 

5 comments

  1. In BikeSharingDemandSample and BikeSharingDemandsCsvReader classes, you skip the ‘workingday’ column in the class definitions. As a result, when you fetch column 15, you’re actually sampling the ‘registered’ column instead of the ‘cnt’ column (16) in the data sets. Am I missing something? Did you skip ‘workingday’ intentionally and also wanted to predict registered users? This does seem to impact the R2 scores when you change both to column 16.

    1. Hi Tod,

      thank you for reading! And thank you for noticing the issue. I made the mistake and took the wrong column – regulars instead of cnt. It is funny how it had affected the results just a little bit, because of the high correlation between those features. Essentially, now we got slightly better results with fast tweedie tree model. I fixed all of this in the code on the repo and in the article as well.
      Thanks for noticing once again.

      Regards,
      Nikola

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.