It is always fun and educational to read deep learning scientific papers. Especially if it is in the area of the current project that you are working on. However, often these papers contain architectures and solutions that are hard to train. Especially if you want to try out, let’s say, some of the winners of the ImageNet Large Scale Visual Recognition (ILSCVR) competition. I can remember reading about VGG16 and thinking “That is all cool, but my GPU is going to die”. However, we can overcome that and use the most prestigious architecture thanks to the concept of transfer learning. In this article, we explore transfer learning with famous Convolutional Neural Networks and how it can be achieved with ML.NET.
Are you afraid that AI might take your job? Make sure you are the one who is building it.
STAY RELEVANT IN THE RISING AI INDUSTRY! 🖖
1. Dataset and Prerequisites
In this article, we use the “Cats vs Dogs” dataset. This dataset contains 23,262 images of cats and dogs. It is available at Kaggle.
When we unzip the folder we can notice two folders train1 and test. The train1 folder contains training images that we use in this tutorial. Notice that image names start with cat or dog. These are essentially our labeles, which means that target will be defined using these names.
The implementations provided here are done in C#, and we use the latest .NET 5. So make sure that you have installed this SDK. If you are using Visual Studio this comes with version 16.8.3. Also, make sure that you have installed the following packages:
You can do the same from the Package Manager Console:
IMPORTANT NOTE: Unfortunately we couldn’t make this solution work with the latest SciSharp.TensorFlow.Redist version (in the moment that is version 2.4.1). In order for this tutorial to work, make sure you are using the 2.3.1 version of the mentioned NuGet package.
You can do a similar thing using Visual Studio’s Manage NuGetPackage option:
If you need to catch up with the basics of machine learning with ML.NET check out this article.
2. Transfer Learning Intuition
Historically, image classification is a problem that popularized deep neural networks especially visual types of neural networks – Convolutional neural networks (CNN). We will not go into details about what are CNNs and how they work. However, we can say that CNNs were popularized after they broke a record in The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) back in 2012. This competition evaluates algorithms for object detection and image classification at a large scale. The dataset that they provide contains 1000 image categories and over 1.2 million images. The goal of the image classification algorithm is to correctly predict to which class the object belongs to. Since 2012. every winner of this competition used CNNs.
Training deep neural networks can be computational and time-consuming. To get really good results, you need a lot of computing power, which means a lot of GPUs and this means…well, a lot of money. You could of course train these big architectures and get SOTA results on cloud environments, but this is also quite expensive. For a while, these architectures were not available for regular developers. However, the concept of transfer learning changed that. Especially, for the problem, we are solving today – image classification. Today we can use state-of-the-art architectures that won at ImageNet competition thanks to the transfer learning and pre-trained models.
At this moment one might wonder “What are pre-trained models?”. Essentially, a pre-trained model is a saved network that was previously trained on a large dataset, for example on the ImageNet dataset. There are two ways in which you can use those. You can use it as the out-of-the-box solution and or you can use it with transfer learning. Since large datasets are usually used for some global solution you can customize a pre-trained model and specialize it for certain problems. This way you can utilize some of the most famous neural networks without losing too much time and resources on training. Additionally, you can fine-tune these models, by modifying the behavior of the chosen layers. The whole idea revolves around using lower layers of pre-trained CNN model and adding additional layers that will customize the architecture for the specific problems.
Essentially, serious image classification solutions are usually composed of two parts. We call them backbone and head. Backbone is usually deep architecture that was pre-trained on the ImageNet dataset without top layers. Head is a part of the image classification model that is used for the prediction of custom classes. These layers are added on top of the pre-trained model. With these systems, we have two phases: bottleneck and training phase. During the bottleneck phase, images of the specific dataset are run through the backbone architecture, and results are stored. During the training phase stored output from the backbone is used to train custom layers.
Over the years several architectures that won the ImageNet competition became quite popular. Some of them are: VGG16, GoogLeNet (Inception), ResNet. However, not only these architectures are popular for transfer learning. For example, MobileNet is also often used, because it is easy to train. Let’s dig a little deeper and learn more about each of these architectures.
VGG16 is the first architecture we consider. It is a large convolutional neural network proposed by K. Simonyan and A. Zisserman in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. this network achieves 92.7% top-5 test accuracy in ImageNet dataset. However, it was trained for weeks. Here is high-level overview of the model:
GoogLeNet is also called Inception. This is because it utilizes two concepts: 1×1 Convolution and Inception Module. The first concept, 1×1 Convolution is used for as a dimension reduction module. By reducing number of dimensions, number of computations also goes down, which means that depth and width of the network can be increased. Instead of using fixed size for each convolution layer, GoogLeNet uses Inception Module:
As you can see 1×1 convolution layer, 3×3 convolution layer, 5×5 convolution layer, and 3×3 max pooling layer perform their operations together and than their results are stack together again at output. GoogLeNet has 22 layer in total, and it looks something like this:
Residual Networks or ResNet are the final architecture we are going to use in this article. The problem that previous architecture have is that they are very deep. They have a lot of layers and because of that they are hard to train (vanishing gradient). So, ResNet addressed that problem with so-called “identity shortcut connection”, or residual blocks:
In essence, ResNet follows VGG’s 3×3 convolutional layer design, where each convolutional layer is followed by a batch normalization layer and ReLU activation function. The difference is however that we before the final ReLu, ResNet injects input. One of the variations is that, input value is passes through the 1×1 convolution layer.
The core idea is that deeper network should not produce a training error higher than shallower network. Authors of the ResNet hypothesize that if you add layers that don’t do anything to a network error should stay the same. This means that letting the network fit a residual is easier than letting them directly fit the complete desired data. This is accomplished by residual blocks.
This is how the complete ResNet architecture looks like:
MobileNet is a family of general purpose computer vision neural networks designed with mobile devices in mind to support classification, detection and more. In this article we use MobileNetV2.
4. ML.NET and TensorFlow
ML.NET doesn’t provide a way to build neural network, aprat from simple perceptron. Essentially, it is not a tool for that, like TensorFlow and Pythorch are in Python. If you are looking for a way to build deep neural network form scratch in .NET, you can check out SciSharp Stack. However, you can use ML.NET in combination with TensorFlow (TensorFlow.NET to be more precise) to use pre-trained models that are TensorFlow provides.
In a nutshell, we add a reference to the ML.NET NuGet packages in .NET applications and ML.NET includes and references the native TensorFlow library. This gives us the possibility to utilize TensorFlow’s pretrained models.
5. Implementation with ML.NET
5.1 High-Level Architecutre
Before we dive deeper into this implementation, let’s consider the high-level architecture of this implementation. Here we build solution, that allows us to quickly change the artchitecture that we use for image classification. The idea is to separate differents partrs of application based on the work they do. The folder structure of our solution looks like this:
The DataModel folder contains classes that model the data. The DataLoader class loads images from the folder, pre-process it and splits into Training and Test datasets. The ImageClassifier class wraps the model. It uses data provided by the DataLoader to train the model. Finally, the Predictor class uses the test dataset and train model to make predictions. The architectural overview can be represented like this:
5.2 Data Models
In order to load data from the images and use it with ML.NET algorithms, we need to implement classes that are going to model this data. Three files can be found in Data Folder: ImageData, ModelInput and ModelOutput. The InputData class contains basic information of the image:
This class contains image path and the label. The label information is extracted from the filename.
The ModelInput class models output data that is used as an input to the model:
The Image property is the byte array and byte representation of the image. The model expects image data to be of this type for training. The LabelAsKey is the encoded value of the Label.
Finaly, the ModelOutput represents output of the model:
5.3 Data Loader
The DataLoader is the class that loads data from the images folder and pre processes them. It has three public properties which exposes different dataset. In essence, all magic is happening within the constrictor of the class, but with the help of different private methods. The constructor just uses training folder path to do it all. First we initialize necessary fields and then we use methods to build data processing pipeline and pre-process the data:
The BuildDataProcessingPipeline method created a pipeline to encode the label and loads image as bytes from defined folder. This pipeline is then used in LoadAndPrepareData. Essentially, this method does all the heavy-lifting in this class. First it calls LoadImages method, which creates labeles for all images in the training folder. Than it shuffles the data and runs it throught the pipeline. Finally it splits data into three datasets.
5.4 Image Classifier
The ImageClassifier class is the class that trains the model. The instance of DataLoader class is injected in this class because it contains necessary datasets. Apart from that, this class receives type of architecture as an input. This is done using enumeration ImageClassificationTrainer.Architecture.
The most important method in this class is BuildTrainingPipeline. This method defines the model that is used for training. Notice the Options object that we create. We define everything here, from the number of epochs to the architecture that is used. This methid is called within the Fit method, the only public method in the class. This method trains and saves the model.
The Predictor class is here to load the saved model and run some predictions. Usually, this class is not a part of the same microservice as trainers. We usually have one microservice that is performing the training of the model. This model is saved into file, from which the other model loads it and run predictions based on the user input. Here is how this class looks like:
In a nutshell, the model is loaded from a defined file, and predictions are run on the 10 images from the Test Dataset.
5.6 Usage and Results
Ok, let’s put all of this together.
Here e first create instance of the DataLoader class, using the path to the image folder. Then we create instance of the Image Classifier. We use MobileNetV2 architecture to do so. After that we call Fit method to train this model. Finally we create Predictor object and run predictions. Here is what we get as an output:
Notice that for all ten images we got the good predictions.
In this article, we explored the world of Transfer Leraning. We focused on image classification problem. We learned about some cool architectures and saw how we can use them with ML.NET.
Thank you for reading!
Nikola M. Zivkovic
CAIO at Rubik's Code
Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.
Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.