Process of building machine learning, deep learning or AI applications has several steps. One of them is analysis of the data and finding which parts of it are usable and which are not. We also need to pick machine learning algorithms or neural network architectures that we need to use in order to solve the problem. We might even choose to use reinforcement learning or transfer learning. However, often clients don’t have data that could solve their problem. More often than not, it is our job to get data from the web that is going to be utilized by machine learning algorithm or neural network.

This is usually the rule when we work on computer vision tasks. Clients rely on your ability to gather the data that is going to feed your VGG, ResNet, or custom Convolutional Neural Network. So, in this article we focus on the step that comes before data analysis and all the fancy algorithms – data scraping, or to be more precise, image scraping. We are going to figure out two ways to get images from some web site using Python.

Technologies

In general, there are multiple ways that you can download images from a web page. There are even multiple Python packages that can help you with this task. In this article, we explore two of those packages Beautiful Soup and Scrapy. They are both good libraries for pulling data out of HTML.

To install them run this command for Beautiful Soup:

 pip install beautifulsoup4

And this command for Scrapy:

pip install scrapy

Since these tools can not function without Pillow, make sure that this library is installed as well:

pip install Pillow

Both of these libraries are great tools so let’s see what problem we need to solve. In this example, we want to download featured image from all blog posts from our blog page. If we inspect that page we can see that URLs where those images are located are stored withing <img> HTML tag and they are nested within <a> and <div> tags. This is important because we can use CSS classes as identifier.

Now when we know a little bit more about our task, let’s implement solution first with Beautiful Soup and then with Scrapy.

Beautiful Soup

This library is pretty intuitive to use. However, we need to import other libraries in order to finish this task:

These libraries are used to send web requests (requests and urllib.request) and to store data in files (shutil). Now, we can send request to blog page, get response and parse it with Beautiful Soup:

We extracted all elements from HTML DOM that have tag <a> and class entry-featured-image-url using find_all method. They are all stored within aas variable, if you print it out it will look something like this:

[<a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/11/25/introduction-to-chatbots-and-their-business-value/”><img alt=”Introduction to Chatbots and Their Business Value” class=”” height=”9999″ src=”https://i1.wp.com/rubikscode.net/wp-content/uploads/2019/11/featured.png?fit=1080%2C608&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/11/18/transfer-learning-with-tensorflow-2-model-fine-tuning/”><img alt=”Transfer Learning with TensorFlow 2 – Model Fine Tuning” class=”” height=”9999″ src=”https://i2.wp.com/rubikscode.net/wp-content/uploads/2019/11/Feature.png?fit=1080%2C608&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/11/11/transfer-learning-with-tensorflow-2/”><img alt=”Transfer Learning with TensorFlow 2″ class=”” height=”9999″ src=”https://i1.wp.com/rubikscode.net/wp-content/uploads/2019/11/Add-a-heading-2.png?fit=1080%2C608&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/11/04/using-mongodb-in-python/”><img alt=”Using MongoDB in Python” class=”” height=”9999″ src=”https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/11/Add-a-heading-1.png?fit=1080%2C608&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/10/28/business-value-of-artificial-intelligence/”><img alt=”Business Value of Artificial Intelligence” class=”” height=”9999″ src=”https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/10/make-or-brake.png?fit=3400%2C1480&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/10/21/5-awesome-new-features-python-3-8/”><img alt=”5 Awesome New Features – Python 3.8″ class=”” height=”9999″ src=”https://i2.wp.com/rubikscode.net/wp-content/uploads/2019/10/featured.png?fit=3400%2C1480&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/10/14/29-skills-for-being-a-successful-data-scientist/”><img alt=”29 Skills for Being a Successful Data Scientist” class=”” height=”9999″ src=”https://i2.wp.com/rubikscode.net/wp-content/uploads/2019/10/Linear-Algebra.png?fit=3400%2C1480&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/10/07/can-you-be-data-scientist-and-software-developer-at-the-same-time/”><img alt=”Can you be Data Scientist and Software Developer at the same time?” class=”” height=”9999″ src=”https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/10/AutoML.png?fit=3400%2C1480&ssl=1″ width=”9999″/></a>, <a class=”entry-featured-image-url” href=”https://rubikscode.net/2019/09/30/transformer-series/”><img alt=”Transformer Series” class=”” height=”9999″ src=”https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/09/Copy-of-GAN-Series.png?fit=3400%2C1480&ssl=1″ width=”9999″/></a>]

Now we need to use that information and get data from each individual link. We also need to store it files. Let’s firs extract URL links and image names for each image from aas variable.

We utilize findChildren function for each element in the aas array and append it’s attributes to image_info list. Here is the result:

[(‘https://i1.wp.com/rubikscode.net/wp-content/uploads/2019/11/featured.png?fit=1080%2C608&ssl=1’, ‘Introduction to Chatbots and Their Business Value’), (‘https://i2.wp.com/rubikscode.net/wp-content/uploads/2019/11/Feature.png?fit=1080%2C608&ssl=1’, ‘Transfer Learning with TensorFlow 2 – Model Fine Tuning’), (‘https://i1.wp.com/rubikscode.net/wp-content/uploads/2019/11/Add-a-heading-2.png?fit=1080%2C608&ssl=1’, ‘Transfer Learning with TensorFlow 2’), (‘https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/11/Add-a-heading-1.png?fit=1080%2C608&ssl=1’, ‘Using MongoDB in Python’), (‘https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/10/make-or-brake.png?fit=3400%2C1480&ssl=1’, ‘Business Value of Artificial Intelligence’), (‘https://i2.wp.com/rubikscode.net/wp-content/uploads/2019/10/featured.png?fit=3400%2C1480&ssl=1’, ‘5 Awesome New Features – Python 3.8’), (‘https://i2.wp.com/rubikscode.net/wp-content/uploads/2019/10/Linear-Algebra.png?fit=3400%2C1480&ssl=1′, ’29 Skills for Being a Successful Data Scientist’), (‘https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/10/AutoML.png?fit=3400%2C1480&ssl=1’, ‘Can you be Data Scientist and Software Developer at the same time?’), (‘https://i0.wp.com/rubikscode.net/wp-content/uploads/2019/09/Copy-of-GAN-Series.png?fit=3400%2C1480&ssl=1’, ‘Transformer Series’)]

Great, we have links and image names, all we need to do is download data. For that purpose, we build download_image function:

It is a simple function. first we send the request to the URL that we extracted form the HTML. Then based on the title we create file name. During this process we remove all spaces and special characters. Eventually we create a file with the proper name and copy all data from the response into that file using shutil. In the end we call all the function for each image in the list:

The result looks like this:

Scrapy

The other tool that we can use for downloading images is Scrapy. While Beautiful Soup is intuitive and very simple to use, you still need to use other libraries and things can get messy if we are working on bigger project. Scrapy is great for those situations. Once this library is installed, you can create new Scrapy project with this command:

scrapy startproject name_of_project

This is going to create project structure that is similar to the Django project structure. The files that are interesting are settings.py, items.py and ImgSpyder.py.

The first thing we need to do is add file or image pipeline in settings.py. In this example, we add image pipeline. Also, location where images are stored needs to be added. That is why we add these two lines to the settings:

Then we move on to the items.py. Here we define the structure of downloaded items. In this case, we use Scrapy for downloading images, however, it is one powerful tool for downloading other types of data as well. Here is what that looks like:

Here we defined ImageItem class which inherits Item class from Scrapy. We define two mandatory fields when we work with Image Pipeline: images and images_urls and we define them as scrapy.Field(). It is important to notice that these fields must have these names. Apart from that, note that img_urls needs to be a list and needs to contain absolute URLs. For this reason we have to create a function to transform relative URLs to absolute URLs. Finally, we implement crawler within ImgSpyder.py:

In this file, we create class ImgSpyder which inherits Spider class from Scrapy. This class essentially is used for crawling and downloading data. You can see that each Spider has a name. This name is used for running the process later on. Field start_urls defines which web pages are crawled. When we initiate Spider, it shoot requests to the pages defined in start_urls array.

Response is processed in parse method, which we override in ImgSpider class. Effectively this means that, when we run this example, it sends request to http://rubikscode.net and then processes the response in parse method. In this method, we create instance of ImageItem. Then we use css selector to extract image URLs and store them in img_urls array. Finally, we put everything from img_urls array into the ImageItem object. Note that we don’t need to put anything in images field of the class, that is done by Scrapy. Let’s run this crawler with this command:

 scrapy crawl img_spyder

We use name defined within the class. The other way to run this crawler is like this:

scrapy runspider ImgSpyder.py

The result:

Conclusion

In this article, we explored two tools for downloading images from the web. We saw two examples of how this task can be performed and how mentioned tools can be utilized with Python.

Thank you for reading!


Read more posts from the author at Rubik’s Code.


Deep Learning for ProgrammersLearn how to use software development experience to become deep learning superstar!
  • Why should you care about deep learning?
  • Learn just enough math to be dangerous.
  • Get familiar with Python and TensorFlow.
  • Use familiar paradigms like Object Oriented Programming to understand main Deep Learning concepts.
  • Explore and implement 12 neural network architectures.
  • Solve various real-world problems with neural networks.
  • Learn how to generate images with neural networks.
GO!