In the previous article, we had a chance to see how one can scrape images from the web using Python. Apart from that, in one of the articles before that we could see how we can perform transfer learning with TensorFlow. In that article, we used famous Convolution Neural Networks on already prepared TensorFlow dataset. So, technically we are missing one step between scraping data from the web and training, right? How can we create TensorFlow dataset from images we just scraped from the web? In this article, we will do just that, prepare data and unify it under TensorFlow dataset.

For the purpose of this article and speeding up the process, we use one interesting source of images – Images of LEGO Bricks. Here you can find 16 different classes of LEGO bricks. Each brick is selected from Mecabricks and there are 400 different angles for each one of them. So, let’s imagine that we scraped these images from the web and now we want to create a TensorFlow dataset that is going to be used in learning process for neural network for classifying LEGO bricks. Watch your step 🙂

Images

Once you download the images from the link above, you will notice that they are split into 16 directories, meaning there are 16 classes of LEGO bricks. If we were scraping these images, we would have to split them into these folders ourselves. This is important thing to do, since the all other steps depend on this. To sum it up, these all Lego Brick images are split into these folders:

Implementation

In general, there are two ways we can achieve the goal. One is using Keras generator and the other is using pure TensorFlow core functionalities. No matter which approach do we choose, we need to import some libraries:

	import pandas
	import os
	import numpy as np
	import pathlib
	import IPython.display as display
	import matplotlib.pyplot as plt
	from PIL import Image

	import tensorflow as tf

view raw imports.py hosted with ❤ by GitHub

Apart from that, we need to load the path to the images and define classes. For that we use names of the folders in which images are located:

	data_directory = pathlib.WindowsPath("./LEGO brick images/train")
	CLASSES = np.array([item.name for item in data_directory.glob('*') if item.name != "LICENSE.txt"])

view raw classes.py hosted with ❤ by GitHub

Here is the data which is in CLASSES variable:

array([‘11214 Bush 3M friction with Cross axle’, ‘18651 Cross Axle 2M with Snap friction’, ‘2357 Brick corner 1x2x2’, ‘3003 Brick 2×2’, ‘3004 Brick 1×2’, ‘3005 Brick 1×1’, ‘3022 Plate 2×2’, ‘3023 Plate 1×2’, ‘3024 Plate 1×1’, ‘3040 Roof Tile 1x2x45deg’, ‘3069 Flat Tile 1×2’, ‘32123 half Bush’, ‘3673 Peg 2M’, ‘3713 Bush for Cross Axle’, ‘3794 Plate 1X2 with 1 Knob’, ‘6632 Technic Lever 3M’], dtype='<U38′)

So, let’s first check out how we can create TensorFlow dataset with Keras using this information.

Keras Implementation

Creating dataset using Keras is pretty straight forward:

	from tf.keras.preprocessing.image import ImageDataGenerator

	image_generator = ImageDataGenerator(rescale=1./255)

	dataset = image_generator.flow_from_directory(directory=str(data_directory),
	batch_size=32,
	shuffle=True,
	target_size=(300, 500),
	classes = list(CLASSES))

view raw keras_generator.py hosted with ❤ by GitHub

We are using ImageDataGenerator class from keras.preprocessing.image module. The only parameter we need in the constructor is rescale parameter. Using this we basically normalize all images. Once this object is created we call flow_from_firectory method. Here we pass on the path to the directory in which images are located and list of class names. We also pass on the information of the batch size, and the size to which all images will be resized.

This way we get 300×500 normalized images in the batches of 32 images. Here is how those images look like:

The next batch can be obtained like this:

image_batch, label_batch = next(dataset)

view raw keras_next_batch.py hosted with ❤ by GitHub

TensorFlow Implementation

While Keras implemetation is quite easy, sometimes it’s performance can be bad. Meaning, that it can take some time while this is done. That is why we can do the same thing with pure TensorFlow. First thing that we need to do is get list of all image paths. That is done like this:

list_dataset = tf.data.Dataset.list_files(str(data_directory/'*/*'))

view raw list_dataset.py hosted with ❤ by GitHub

That way in list_dataset variable we have this info:

b’LEGO brick images\\train\\11214 Bush 3M friction with Cross axle\\201706171006-0003.png’ b’LEGO brick images\\train\\6632 Technic Lever 3M\\201706171606-0395.png’ b’LEGO brick images\\train\\3673 Peg 2M\\0362.png’ b’LEGO brick images\\train\\2357 Brick corner 1x2x2\\201706171206-0032.png’ b’LEGO brick images\\train\\3023 Plate 1×2\\0175.png’

Once that is done, we implement DataSetCreator class for the purpose of preparing images and the dataset. Here is what that looks like:

	class DataSetCreator(object):
	def __init__(self, batch_size, image_height, image_width, dataset):
	self.batch_size = batch_size
	self.image_height = image_height
	self.image_width = image_width
	self.dataset = dataset

	def _get_class(self, path):
	pat_splited = tf.strings.split(path, os.path.sep)
	return pat_splited[-2] == CLASS_NAMES

	def _load_image(self, path):
	image = tf.io.read_file(path)
	image = tf.image.decode_jpeg(image, channels=3)
	image = tf.image.convert_image_dtype(image, tf.float32)
	return tf.image.resize(image, [self.image_height, self.image_width])

	def _load_labeled_data(self, path):
	label = self._get_class(path)
	image = self._load_image(path)
	return image, label

	def load_process(self, shuffle_size = 1000):
	self.loaded_dataset = self.dataset.map(self._load_labeled_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)

	self.loaded_dataset = self.loaded_dataset.cache()

	# Shuffle data and create batches
	self.loaded_dataset = self.loaded_dataset.shuffle(buffer_size=shuffle_size)
	self.loaded_dataset = self.loaded_dataset.repeat()
	self.loaded_dataset = self.loaded_dataset.batch(self.batch_size)

	# Make dataset fetch batches in the background during the training of the model.
	self.loaded_dataset = self.loaded_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

	def get_batch(self):
	return next(iter(self.loaded_dataset))

view raw dataset_creator.py hosted with ❤ by GitHub

This class is initialized by batch size, image dimensions and the list of files. There are three private methods:

_get_class – Based on the path of the file, it retrieves the class of the image.
_load_image – Loads image from the defined path.
_load_labeled_data – Utilizes previous two functions, returns image data and it’s class (label).

However, majority of important stuff happens in load_process method. Let’s take a closer look:

	def load_process(self, shuffle_size = 1000):
	self.loaded_dataset = self.dataset.map(self._load_labeled_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)

	self.loaded_dataset = self.loaded_dataset.cache()

	# Shuffle data and create batches
	self.loaded_dataset = self.loaded_dataset.shuffle(buffer_size=shuffle_size)
	self.loaded_dataset = self.loaded_dataset.repeat()
	self.loaded_dataset = self.loaded_dataset.batch(self.batch_size)

	# Make dataset fetch batches in the background during the training of the model.
	self.loaded_dataset = self.loaded_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

view raw load_process.py hosted with ❤ by GitHub

In this function, we utilize map function and for each image file path that we previously loaded we call _load_labeled_data method. This in turn loads all images and it’s classes into self.loaded_dataset. Once this is done, we can cache and shuffle dataset. After that we create batches. Additional cool thing that we do is call the prefatch method on the dataset. This method let’s dataset work in the background. Basically, during the training process, dataset performs lazy loading of the images from the disk. This won’t slow down the training process. This is why this implementation is so cool.

Finally, we can create an object of the DataSetCreator class and use get_batch method to get the data:

	dataProcessor = DataProcessor(32, 300, 500, list_dataset)
	dataProcessor.load_process()

	image_batch, label_batch = dataProcessor.get_batch()

view raw usage.py hosted with ❤ by GitHub

The result is the same as with Keras implementation:

Conclusion

In this article, we created TensorFlow dataset using downloaded images. This dataset now can be used for training some neural networks or different classification algorithms.

Thank you for reading!

Read more posts from the author at Rubik’s Code.

3 Comments

Bogdan on December 9, 2019 at 10:18 am

Hello,

How would the code for creating a segmentation dataset look like? There are almost none tutorials on this topic out there.

Thank you
- rubikscode on December 9, 2019 at 12:30 pm
  
  Hi Bogdan,
  
  Thanks for reading our blog.
  I am not sure I understand the question. Are we talking about Image Segmentation? For that, there shouldn’t be any special approach if you are using U-Net, for example. The other way to go is using SciKit Image.
  
  Cheers
Medaja blogger on December 10, 2019 at 1:48 pm

This is so great, I would like to know how you came up with such a code.