Machine Learning with ML.NET - Comparing Data Exploration in Python with Data Exploration in ML.NET

In the previous articles, we explored possibilities of ML.NET, Microsoft’s machine learning framework, quite a bit. In the first article of the series, we got familiar with machine learning concepts and run through some ML.NET basics. Then we solved some different real-world scenarios for regression and classification using this framework. In the end, we used the trained machine learning model in the ASP.NET Core application.

The whole idea was to explain machine learning basics to the .NET developers using 0.2 version of the ML.NET (as I write this 0.3 version of ML.NET has been released). An official first version of ML.NET is planned to be released with .NET Core 3.0. However, in all these articles we didn’t speak too much of the one integral part of the machine learning – data exploration. We just mentioned it in the second and third article of the series, but we used Python for it because ML.NET right now is limited in this area.

Still, data exploration and data treatment is one very important step in machine learning. Especially, when we want to improve the results of the trained model. A lot of problems are solved just by visualizing the data and applying proper clearing and transformation techniques. So, let’s run through steps of data exploration, see how they are done in different languages and where we can expect improvements in ML.NET.

Data Exploration Steps

In general, the output quality of our machine learning model depends on the quality of our input data. That is why we should spend a lot of time on data preparation. This is probably the most creative part of data science as well. Initial data that we get can be chaotic and messy. Sometimes it is so hectic that coming up with the hypothesis can be quite challenging. However, once we came up with the hypothesis and decide what our is our input and what is our output data, we can follow the certain steps:

Univariate and Bi-Variate Analysis
Missing values treatment
Outliers detection and treatment
Feature Engineering

Often we have to iterate a few times through these steps before we came up with the best solution. Let’s explore these steps in more details.

Univariate and Bi-variate Analysis

Features in our dataset can have different nature. They can be categorical, meaning they are representing discrete values, or they can be continuous. Apart from that, each of these features can have some specifics. By applying univariate analysis, we are exploring the nature of each individual feature from our dataset.

For continuous features, we need to understand the central tendency and spread of the values. We are using statistical metrics and visualizations methods for this. Central tendency is got by finding mean, median, maximal and minimal values of the feature. There is no nice way to do this using ML.NET. We should get the data using some different mechanism, like creating custom CSV readers and then apply these functions. A similar situation is for the spread of the data. We want to get range, variance, a standard deviation of our feature.

For categorical features, we need to understand the frequency and count of each value. This way we will get a better picture of the distribution of the data, and get a better feeling about what kind of results we can expect. Again, we can do this by coming up with our own mechanisms since ML.NET doesn’t provide many possibilities. Python and R, at the moment, have more suitable solutions for this kind of data analysis.

Apart from analyzing each feature individually, we need to observe how they affect each other. If there is too big correlation we might want to remove some of the features. We had a situation like that in the Bike Sharing Demand example. There we detected features that were having a big correlation using Python and we just haven’t selected them from the dataset. Correlation represents the strength of the relationship between features and varies between -1 (negative linear correlation) and +1 (positive linear correlation).

There we used pandas, numpy, seaborn and pyplot libraries in Python we can get nice visual correlation analysis:

	import pandas as pd
	import numpy as np
	import seaborn as sn
	import matplotlib.pyplot as plt

	data = pd.read_csv('data.csv', sep=';', header = 0)

	corrMatt = data.corr()
	mask = np.array(corrMatt)
	mask[np.tril_indices_from(mask)] = False
	fig,ax= plt.subplots()
	fig.set_size_inches(20,10)
	sn.heatmap(corrMatt, mask=mask,vmax=.8, square=True,annot=True)

view raw

correlation_visualisation_data.py

hosted with ❤ by GitHub

This will give us this output:

There is one more way of removing features from the dataset in ML.NET. That can be done by adding ColumnDroper in the LearningPipeline. Something like this:

pipeline.Add(new ColumnDropper() { Column = 'NameOfTheFeature'});

view raw

AddingColumnDropper.cs

hosted with ❤ by GitHub

Missing Values Treatment

Sometimes we can miss some data in our dataset. Missing data in the training data set can reduce the strength of a model. It can also lead to a biased model. We had that situation when we were investigating the Wine Quality dataset. The way we approach missing data in our dataset can have a huge effect on the final model. Again, detection of missing data is better handled in Python. For example, we can use missingno library in combination with pandas for detecting missing data:

	import missingno as msno
	import pandas as pd

	data = pd.read_csv('data.csv', sep=';', header = 0)
	msno.matrix(data,figsize=(10,3))

view raw

missing_data_detection.py

hosted with ❤ by GitHub

This would give us a nice visualization of missing data:

In ML.NET, missing values are detected by adding MissingValueIndicator class to the pipeline. This class creates a boolean output column with the same number of slots as the input column, where the output value is true if the value in the input column is missing.

pipeline.Add(new MissingValueIndicator("FeatureName"));

view raw

MissingValuesIndicator.cs

hosted with ❤ by GitHub

Even though we don’t have the best mechanism for detecting missing data in ML.NET, we have a good way to handle them – MissingValueSubstitutor class. When we add this class to the pipeline, we replace missing values for certain feature with some value. There are some options here, but essentially we would do this:

pipeline.Add(new MissingValueSubstitutor("NameOfTheFeature") { ReplacementKind = NAReplaceTransformReplacementKind.Mean});

view raw

MissingDataHandleling.cs

hosted with ❤ by GitHub

As you can see we can define different ReplacementKind for some feature, and we used Mean option from NAReplaceTransformReplacementKind enumeration. This means that all missing values for that feature will be replaced with a mean value of that feature. Here are some other options that we can select using this enumeration:

	public enum NAReplaceTransformReplacementKind
	{
	Default = 0,
	DefaultValue = 0,
	Def = 0,
	Mean = 1,
	Min = 2,
	Minimum = 2,
	Max = 3,
	Maximum = 3,
	SpecifiedValue = 4,
	Val = 4,
	Value = 4
	}

view raw

NAReplaceTransformReplacementKind.cs

hosted with ❤ by GitHub

Sometimes we just don’t want to take missing values into our calculations. We can remove rows with missing values by adding MissingValueDropper in the pipeline:

pipeline.Add(new MissingValuesDropper("FeatureName"));

view raw

MissingValuesDropper.cs

hosted with ❤ by GitHub

Outlier Detection and Treatment

Outliers are values in our dataset that diverge from an overall pattern. These values are far away from the majority of values that we have for some feature. For example, if for some continuous feature we determine that most of the values fall in the range from 0 to 11, record with the value 3300 is an outlier. These values can be just an error or a natural occurrence and depend on that we decide what do we want to do with it. Outliers can have a drastic impact on a model because they are changing statistics of the whole dataset.

In general, in ML.NET we are not having elegant ways to detect outliers like we have in Python, where we can use Box-plot, Histogram or Scatter Plot. In one of the previous examples we have used Box-plot:

Note the dots that the red arrow points to. These are outliers and there are quite a lot of them. These samples are introducing a non-linearity in our system. We may choose to remove them from the dataset or treat them separately. Sometimes, we can modify their values to a median or mean value. ML.NET has not provided a nice way to detect or handle outliers so far.

Feature Engineering

Feature engineering is the science and art of making the data you have more useful. This step always comes last and it is the most useful one. The general idea is to extract more information by transforming and creating new features. What do we mean by creating new features? We are not inventing new data, but sometimes we just need a piece of information from one feature. For example, we would apply this technique when we need just the day from a feature that contains a date. We would create a new feature, with extracted information from the date feature.

Also, sometimes we want to transform our data. This process cannot be random, however. By transformation, we consider the replacement of a value by a function. For instance, we can replace the all values in one feature with the square values. We can use a cube root or logarithm value as a transformation as well. In other words, a transformation is a process that changes the distribution or relationship of a variable with others. Let’s see some examples of feature transformation.

Feature Scaling and Normalization

One of the most common problems that we can have in our dataset is that different features use different scales. A model can misinterpret this and give favor to one feature. That is why we often apply feature scaling. The idea is to bring all features into the same scale without breaking the data distribution. In Python, we can use sklearn library and StandardScaler class:

	# Scalint the data
	from sklearn.preprocessing import StandardScaler
	scaler = StandardScaler()
	X_train = scaler.fit_transform(X_train)
	X_test = scaler.transform(X_test)

view raw

scaling_data.py

hosted with ❤ by GitHub

In ML.NET we are having a variety of so-called normalizers that can be added to the pipeline. These normalizers are in the Microsoft.ML.Transforms namespace and they are handling various kinds of transformations. Using these we can apply standard logarithmic and mean-variance normalization of the data. Also, as we could see in one of the previous articles, it is possible to use BinNormalizer for binning transformation. This kind of transformation is applicable only to categorical variables. Here are some examples:

	// Performs the following operation on a vector X: Y = (X – M) / D, where M is mean and D is either L2 norm, L1 norm or LInf norm.
	pipeline.Add(new LogMeanVarianceNormalizer("FeatureName"));

	// Normalizes the data based on the computed mean and variance of the data.
	pipeline.Add(new MeanVarianceNormalizer("FeatureName") { FixZero = true});

	// Normalize the columns only if needed.
	pipeline.Add(new ConditionalNormalizer("FeatureName"));

	// The values are assigned into equidensity bins and a value is mapped to its bin_number/ number_of_bins.
	pipeline.Add(new BinNormalizer("FeatureName") { NumBins = 2});

view raw

Normalizers.cs

hosted with ❤ by GitHub

One Hot Encoding

Another problem we can encounter when dealing with data is that categorical feature is presented by continuous value. For example, multiple classes can be presented with values 0, 1 and 2. Here we can apply so-called one hot encoding. This means that new features will be created, basically a new feature for each possible value of the categorical feature. In Python, we can use OneHotEncoder class from sklearn, or np_utils from Keras. Here is how that can be easily done:

	import pandas as pd
	from keras.utils import np_utils

	data = pd.read_csv('data.csv', header = 0)
	X = data.iloc[:, :-1]
	y = data.iloc[:, -1]
	y_categorical = np_utils.to_categorical(y)

view raw

onehotencoding_keras.py

hosted with ❤ by GitHub

And here is how data looks like before and after this method is applied:

In ML.NET, we can add CategoricalOneHotVectorizer or CategoricalHashOneHotVectorizer to the pipeline for the same effect. Something like this:

	pipeline.Add(new CategoricalHashOneHotVectorizer("FeatureName"));
	pipeline.Add(new CategoricalOneHotVectorizer("FeatureName"));

view raw

OneHotEncoding.cs

hosted with ❤ by GitHub

These are just some of the transformations that we can apply on the data.

Conclusion

In previous articles, we also emphasized that ML.NET can improve a lot in the area of data exploration. Python is having better visualization functions, and its approach seems to be more user-friendly for data exploration at the moment. ML.NET have a lot of space for improvements in this area. It will be interesting to see what the guys from Microsoft will come up with until version 1.0. However, once we know what we need to handle, ML.NET is has a variety of tools that we can make use of. Here Microsoft’s framework is on the same level as the other languages and frameworks.

Thank you for reading!

Read more posts from the author at Rubik’s Code.

Trackbacks/Pingbacks

Dew Drop - July 16, 2018 (#2766) - Morning Dew - […] Machine Learning with ML.NET – Comparing Data Exploration in Python with Data Exploration in ML.NE… (Nikola Živković) […]

Machine Learning with ML.NET – Comparing Data Exploration in Python with Data Exploration in ML.NET

Data Exploration Steps

Univariate and Bi-variate Analysis

Missing Values Treatment

Outlier Detection and Treatment