It is always fun and educational to read deep learning scientific papers. Especially if it is in the area of the current project that you are working on. However, often these papers contain architectures and solutions that are hard to train. Especially if you want to try out, let’s say, some of the winners of ImageNet Large Scale Visual Recognition (ILSCVR) competition. I can remember reading about VGG16 and thinking “That is all cool, but my GPU is going to die”. In order to make our lives easier, TensorFlow 2 provided a number of pre-trained models, that you can quickly utilize. In this article, we are going to find out how you can do that with some of the famous Convolutional Neural Network architectures.
At this moment one might wander “What are pre-trained models?”. Essentially, a pre-trained model is a saved network that was previously trained on a some large dataset, for example on ImageNet dataset. They can be found in tensorflow.keras.applications module. There are two ways in which you can use those. You can use it as out of the box solution and or you can use it with transfer learning. Since, large datasets are usually used for some global solution you can customize pre-trained model and specialize it for certain problem. This way you can utilize some of the most famous neural networks without loosing too much time and resources on training. Additionally, you can fine tune these models, by modifying behavior of the chosen layers. This will be covered in the future articles.
In this article, we use three pre-trained models to solve classification example: VGG16, GoogLeNet (Inception) and ResNet. Each of these architectures was winner of ILSCVR competition. VGG16 had the best results together with GoogLeNet in 2014 and ResNet won in 2015. These models are part of the TensorFlow 2, i.e. tensorflow.keras.applications module. Let’s dig a little deeper about each of these architectures.
VGG16 is the first architecture we consider. It is a large convolutional neural network proposed by K. Simonyan and A. Zisserman in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. this network achieves 92.7% top-5 test accuracy in ImageNet dataset. However, it was trained for weeks. Here is high-level overview of the model:
GoogLeNet is also called Inception. This is because it utilizes two concepts: 1×1 Convolution and Inception Module. The first concept, 1×1 Convolution is used for as a dimension reduction module. By reducing number of dimensions, number of computations also goes down, which means that depth and width of the network can be increased. Instead of using fixed size for each convolution layer, GoogLeNet uses Inception Module:
As you can see 1×1 convolution layer, 3×3 convolution layer, 5×5 convolution layer, and 3×3 max pooling layer perform their operations together and than their results are stack together again at output. GoogLeNet has 22 layer in total, and it looks something like this:
Residual Networks or ResNet are the final architecture we are going to use in this article. The problem that previous architecture have is that they are very deep. They have a lot of layers and because of that they are hard to train (vanishing gradient). So, ResNet addressed that problem with so-called “identity shortcut connection”, or residual blocks:
In essence, ResNet follows VGG’s 3×3 convolutional layer design, where each convolutional layer is followed by a batch normalization layer and ReLU activation function. The difference is however that we before the final ReLu, ResNet injects input. One of the variations is that, input value is passes through the 1×1 convolution layer.
In this article, we use “Cats vs Dogs” dataset. This dataset contains 23,262 images of cats and dogs.
You may notice that images are not normalized and that they have different shapes. The cool thing is that it is available as a part of TensorFlow Datasets. So, make sure that you have installed TensorFlow Dataset in your environment:
pip install tensorflow-dataset
Unlike other datasets from the library this dataset is not divided into train and test data so we need to perform the split ourselves. You can find more information about the dataset here.
This implementation is split into several parts. First we implement class that is in charge of loading data and preparing it. Then we import pre-trained models and build a class that will modify it’s top layers. Finally we run the training process and evaluation process. Before everything, of course, we have to import some libraries and define some global constant:
All right, let’s dive into the implementation!
This class is in charge of loading the data and preparing it for processing. Here is what it looks like:
There is a lot going on in this class. It has several methods of which one is “public”:
- _prepare_data – Internal method used to resize and normalize images from dataset. Utilized from constructor.
- _resize_sample – Internal method used for resizing single image.
- _prepare_batches – Internal method used to create batches from images. Creates train_batches, validation_batches and test_batches that are used for training and evaluation process.
- get_random_raw_images – Method used to get certain number of random images from raw, non processed data.
However, majority of things happen in the constructor of the class. Let’s take a closer look.
First we define image and batch size that are injected through parameters. Then, since dataset is not already split into training and testing data, we split data using split weights. This is really a cool feature that TensorFlow Dataset introduced, because we stay within TensorFlow ecosystem and we don’t have to involve other libraries like Pandas or SciKit Learn. Once we performed data split we calculate the number of the training samples and call helper function that prepares data for training. All we need to do after this is to instantiate an object of this class and have fun with loaded data:
Here is the output:
Base Models & Wrapper
Next thing on our list is loading of the pre-trained models. As we already mentioned these models are located in tensorflow.kearas.applications. Loading them is pretty straight forward:
That is how we created based models of the three architectures of interest. Notice that for every model include_top parameter is defined as False. This means that these models are used for feature extraction. Once we have them, we need to modify top layers of these models so they are applicable to our concrete problem. We do that using Wrapper class. This class accepts injected pre-trained model and adds one Global Average Polling Layer and one Dense layer. Essentially, the final Dense layer is used for our binary classification (car or dog). Wrapper class puts all these things together into one model:
Then we can create real models for classification Cats vs Dogs dataset and compile those models:
Note that we marked that base models are not trainable. This means that during the training process we will train only top layers that we have added and the weights on the lower layers will not change.
Before we get into the whole training process, let’s reflect on the fact that in principle biggest part of these models is already trained. So, what we can do is perform evaluation process and see where we land:
It is interesting that without prior training of any of these models, we get ok-ish results (50% accuracy):
Initial loss: 5.30
Initial accuracy: 0.51
Initial loss: 7.21
Initial accuracy: 0.51
Initial loss: 6.01
Initial accuracy: 0.51
Starting with 50% accuracy is not bad thing at all. So, let’s run training process and see are we getting any better. First we train VGG16:
The history looks something like this:
Then we train GoogLeNet:
History of this training process looks like this:
Finally we train ResNet:
And here is the history of that process:
Training of these three models lasted just couple of hours, instead for weeks, thanks to the fact that we trained just top layers and not the whole network.
We saw that in the begining, without any training we got around 50% of accuracy. Let’s see what is the situation after the training:
Here is the output:
We can see that all three models are having really good results, with ResNet being in the front with 97% accuracy.
In this article, we demonstrated how to perform transfer learning with TensorFlow. We created a playground in which we can try out different pre- trained architectures on the data and get good results after just a matter of hours. In our example, we worked with three famous convolutional architectures and quickly modified them for specific problem. In next article, we will fine tune these models and check if we can get even better results.
Thank you for reading!
Read more posts from the author at Rubik’s Code.