The code that accompanies this article can be received after subscription

* indicates required

Process of building machine learning, deep learning or AI applications has several steps. One of them is analysis of the data and finding which parts of it are usable and which are not. We also need to pick machine learning algorithms or neural network architectures that we need to use in order to solve the problem. We might even choose to use reinforcement learning or transfer learning. However, often clients don’t have data that could solve their problem. More often than not, it is our job to get data from the web that is going to be utilized by machine learning algorithm or neural network.

Ultimate Guide to Machine Learning with Python

This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero 
TODAY!

This is usually the rule when we work on computer vision tasks. Clients rely on your ability to gather the data that is going to feed your VGG, ResNet, or custom Convolutional Neural Network. So, in this article we focus on the step that comes before data analysis and all the fancy algorithms – data scraping, or to be more precise, image scraping. We are going to show three ways to get images from some web site using Python. In this article we cover several topics:

  1. Prerequsites
  2. Scraping images with BeutifulSoup
  3. Scraping images with Srapy
  4. Scraping images from Google with Selenium

1. Prerequsites

In general, there are multiple ways that you can download images from a web page. There are even multiple Python packages and tools that can help you with this task. In this article, we explore three of those packages: Beautiful SoupScrapy and Selenium. They are all good libraries for pulling data out of HTML.

The first thing we need to do is to install them. To install  Beautiful Soup run this command:

pip install beautifulsoup4

To install Scrapy, run this command:

pip install scrapy

Also, make sure that you installed Selenium:

pip install selenium

In order for Selenium to work, you need to install Google Chrome and corresponding ChromeDriver. To do so, follow these steps:

  • Install Google Chrome 
  • Detect version of installed Chrome. You can do so by going to About Google Chrome.
  • Finally, download ChromeDriver for your version from here.

Since these tools can not function without Pillow, make sure that this library is installed as well:

pip install Pillow

Both of these libraries are great tools so let’s see what problem we need to solve. In this example, we want to download featured image from all blog posts from our blog page. If we inspect that page we can see that URLs where those images are located are stored withing <img> HTML tag and they are nested within <a> and <div> tags. This is important because we can use CSS classes as identifier.

Now when we know a little bit more about our task, let’s implement solution first with Beautiful Soup and then with Scrapy. Finally, we will see how we can download images from Google, using Selenium.

2. Scraping images with Beautiful Soup

This library is pretty intuitive to use. However, we need to import other libraries in order to finish this task:

from bs4 import BeautifulSoup

import requests
import urllib.request
import shutil

These libraries are used to send web requests (requests and urllib.request) and to store data in files (shutil). Now, we can send request to blog page, get response and parse it with Beautiful Soup:

url = "https://rubikscode.net/"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

aas = soup.find_all("a", class_='entry-featured-image-url')

We extracted all elements from HTML DOM that have tag <a> and class entry-featured-image-url using find_all method. They are all stored within aas variable, if you print it out it will look something like this:

[<a class="entry-featured-image-url" href="https://rubikscode.net/2021/05/31/create-deepfakes-in-5-minutes-with-first-order-model-method/"><img alt="Create Deepfakes in 5 Minutes with First Order Model Method" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?resize=400%2C250&ssl=1" srcset="https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?fit=1200%2C675&ssl=1 479w, https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/05/24/test-driven-development-tdd-with-python/"><img alt="Test Driven Development (TDD) with Python" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?resize=400%2C250&ssl=1" srcset="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?fit=1200%2C675&ssl=1 479w, https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/26/machine-learning-with-ml-net-sentiment-analysis/"><img alt="Machine Learning with ML.NET – Sentiment Analysis" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/19/machine-learning-with-ml-net-nlp-with-bert/"><img alt="Machine Learning with ML.NET – NLP with BERT" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?resize=400%2C250&ssl=1" srcset="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?fit=1200%2C675&ssl=1 479w, https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/12/machine-learning-with-ml-net-evaluation-metrics/"><img alt="Machine Learning With ML.NET – Evaluation Metrics" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/05/machine-learning-with-ml-net-object-detection-with-yolo/"><img alt="Machine Learning with ML.NET – Object detection with YOLO" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?resize=400%2C250&ssl=1" srcset="https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?fit=1200%2C675&ssl=1 479w, https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/03/29/the-rising-value-of-big-data-in-application-monitoring/"><img alt="The Rising Value of Big Data in Application Monitoring" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/03/22/transfer-learning-and-image-classification-with-ml-net/"><img alt="Transfer Learning and Image Classification with ML.NET" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
 <a class="entry-featured-image-url" href="https://rubikscode.net/2021/03/15/machine-learning-with-ml-net-recommendation-systems/"><img alt="Machine Learning with ML.NET – Recommendation Systems" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?resize=400%2C250&ssl=1 480w " width="400"/></a>]
Recommendation systems 2

Now we need to use that information and get data from each individual link. We also need to store it files. Let’s firs extract URL links and image names for each image from aas variable.

image_info = []

for a in aas:
    image_tag = a.findChildren("img")
    image_info.append((image_tag[0]["src"], image_tag[0]["alt"]))

We utilize findChildren function for each element in the aas array and append it’s attributes to image_info list. Here is the result:

[('https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?resize=400%2C250&ssl=1',
  'Create Deepfakes in 5 Minutes with First Order Model Method'),
 ('https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?resize=400%2C250&ssl=1',
  'Test Driven Development (TDD) with Python'),
 ('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?resize=400%2C250&ssl=1',
  'Machine Learning with ML.NET – Sentiment Analysis'),
 ('https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?resize=400%2C250&ssl=1',
  'Machine Learning with ML.NET – NLP with BERT'),
 ('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?resize=400%2C250&ssl=1',
  'Machine Learning With ML.NET – Evaluation Metrics'),
 ('https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?resize=400%2C250&ssl=1',
  'Machine Learning with ML.NET – Object detection with YOLO'),
 ('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?resize=400%2C250&ssl=1',
  'The Rising Value of Big Data in Application Monitoring'),
 ('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?resize=400%2C250&ssl=1',
  'Transfer Learning and Image Classification with ML.NET'),
 ('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?resize=400%2C250&ssl=1',
  'Machine Learning with ML.NET – Recommendation Systems')]

Great, we have links and image names, all we need to do is download data. For that purpose, we build download_image function:

def download_image(image):
    response = requests.get(image[0], stream=True)
    realname = ''.join(e for e in image[1] if e.isalnum())
    
    file = open("./images_bs/{}.jpg".format(realname), 'wb')
    
    response.raw.decode_content = True
    shutil.copyfileobj(response.raw, file)
    del response

It is a simple function. first we send the request to the URL that we extracted form the HTML. Then based on the title we create file name. During this process we remove all spaces and special characters. Eventually we create a file with the proper name and copy all data from the response into that file using shutil. In the end we call all the function for each image in the list:

for i in range(0, len(image_info)):
    download_image(image_info[i])

The result can be found in defined folder:

Recommendation systems 2

Of course, this solution can be fruther generalized and implemented in a form of a class. Something like this:

from bs4 import BeautifulSoup

import requests
import shutil

class BeautifulScrapper():
    def __init__(self, url:str, classid:str, folder:str):
        self.url = url
        self.classid = classid
        self.folder = folder
    
    def _get_info(self):
        image_info = []

        response = requests.get(self.url)
        soup = BeautifulSoup(response.text, "html.parser")
        aas = soup.find_all("a", class_= self.classid)

        for a in aas:
            image_tag = a.findChildren("img")
            image_info.append((image_tag[0]["src"], image_tag[0]["alt"]))

        return image_info

    def _download_images(self, image_info):
        response = requests.get(image_info[0], stream=True)
        realname = ''.join(e for e in image_info[1] if e.isalnum())
        
        file = open(self.folder.format(realname), 'wb')
        
        response.raw.decode_content = True
        shutil.copyfileobj(response.raw, file)
        del response

    def scrape_images(self):
        image_info = self._get_info()    

        for i in range(0, len(image_info)):
            self.download_image(image_info[i])

It is pretty much the same thing, just now you can create multiple objects of this class with different URLs and configurations.

2. Scraping images with Scrapy

The other tool that we can use for downloading images is Scrapy. While Beautiful Soup is intuitive and very simple to use, you still need to use other libraries and things can get messy if we are working on bigger project. Scrapy is great for those situations. Once this library is installed, you can create new Scrapy project with this command:

scrapy startproject name_of_project

This is going to create project structure that is similar to the Django project structure. The files that are interesting are settings.py, items.py and ImgSpyder.py.

The first thing we need to do is add file or image pipeline in settings.py. In this example, we add image pipeline. Also, location where images are stored needs to be added. That is why we add these two lines to the settings:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:/images/scrapy'

Then we move on to the items.py. Here we define the structure of downloaded items. In this case, we use Scrapy for downloading images, however, it is one powerful tool for downloading other types of data as well. Here is what that looks like:

import scrapy

class ImageItem(scrapy.Item):
    images = scrapy.Field()
    image_urls = scrapy.Field()

Here we defined ImageItem class which inherits Item class from Scrapy. We define two mandatory fields when we work with Image Pipelineimages and images_urls and we define them as scrapy.Field(). It is important to notice that these fields must have these names. Apart from that, note that img_urls needs to be a list and needs to contain absolute URLs. For this reason we have to create a function to transform relative URLs to absolute URLs. Finally, we implement crawler within ImgSpyder.py:

import scrapy
from ..items import ImageItem

class ImgSpider(scrapy.spiders.Spider):
    name = "img_spider"
    start_urls = ["https://rubikscode.net/"]

    def parse(self, response):
        image = ImageItem()
        img_urls = []

        for img in response.css(".entry-featured-image-url img::attr(src)").extract():
            img_urls.append(img)

        image["image_urls"] = img_urls

        return image

In this file, we create class ImgSpyder which inherits Spider class from Scrapy. This class essentially is used for crawling and downloading data. You can see that each Spider has a name. This name is used for running the process later on. Field start_urls defines which web pages are crawled. When we initiate Spider, it shoot requests to the pages defined in start_urls array.

Recommendation Systems

Response is processed in parse method, which we override in ImgSpider class. Effectively this means that, when we run this example, it sends request to http://rubikscode.net and then processes the response in parse method. In this method, we create instance of ImageItem. Then we use css selector to extract image URLs and store them in img_urls array. Finally, we put everything from img_urls array into the ImageItem object. Note that we don’t need to put anything in images field of the class, that is done by Scrapy. Let’s run this crawler with this command:

scrapy crawl img_spyder

We use name defined within the class. The other way to run this crawler is like this:

scrapy runspider ImgSpyder.py

4. Scraping images from Google with Selenium

In both previous cases, we explored how we can download images from one site. However, if we want to download large amount of images, performing Google search is probably the best option. This process can be automated as well. For automation, we can use Selenium.

This tool is used for various types of automation, for example, it is good for test automation. In this tutorial, we use it to perform necessary search in Google and download images. Note, that for this to work, you have to install Google Chrome and use corresponding ChromeDriver as described in the first section of this article.

It is easy to use selenium, however first you need to import it:

import selenium
from selenium import webdriver

Once this is done you need to define the path to ChromDriver:

DRIVER_PATH = 'C:\\Users\\n.zivkovic\\Documents\\chromedriver_win32\\chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
Once you do this, Google Chrome is automatically opened. From this moment on using the variable wd we can manipulate it. Let’s go to google.com:
wd.get('https://google.com')
Recommendation Systems
Cool! Now, let’s implement a class that performs google search and downloades images from the first page:
class GoogleScraper():
    '''Downloades images from google based on the query.
       webdriver - Selenium webdriver
       max_num_of_images - Maximum number of images that we want to download
    '''
    def __init__(self, webdriver:webdriver, max_num_of_images:int):
        self.wd = webdriver
        self.max_num_of_images = max_num_of_images

    def _scroll_to_the_end(self):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)  

    def _build_query(self, query:str):
        return f"https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={query}&oq={query}&gs_l=img"

    def _get_info(self, query: str):
        image_urls = set()

        wd.get(self._build_query(query))
        self._scroll_to_the_end()

        # img.Q4LuWd is the google tumbnail selector
        thumbnails = self.wd.find_elements_by_css_selector("img.Q4LuWd")

        print(f"Found {len(thumbnails)} images...")
        print(f"Getting the links...")

        for img in thumbnails[0:self.max_num_of_images]:
            # We need to click every thumbnail so we can get the full image.
            try:
                img.click()
            except Exception:
                print('ERROR: Cannot click on the image.')
                continue

            images = wd.find_elements_by_css_selector('img.n3VNCb')
            time.sleep(0.3)

            for image in images:
                if image.get_attribute('src') and 'http' in image.get_attribute('src'):
                    image_urls.add(image.get_attribute('src'))

        return image_urls

    def download_image(self, folder_path:str, url:str):
        try:
            image_content = requests.get(url).content

        except Exception as e:
            print(f"ERROR: Could not download {url} - {e}")

        try:
            image_file = io.BytesIO(image_content)
            image = Image.open(image_file).convert('RGB')
            file = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')

            with open(file, 'wb') as f:
                image.save(f, "JPEG", quality=85)
            print(f"SUCCESS: saved {url} - as {file}")

        except Exception as e:
            print(f"ERROR: Could not save {url} - {e}")

    def srape_images(self, query:str, folder_path='./images'):
        folder = os.path.join(folder_path,'_'.join(query.lower().split(' ')))

        if not os.path.exists(folder):
            os.makedirs(folder)

        image_info = self._get_info(query)
        print(f"Downloading images...")

        for image in image_info:
            self.download_image(folder, image)
Ok, that is a lot of code. Let’s investigate it piece by piece. We start from the constructor of the class:
def __init__(self, webdriver:webdriver, max_num_of_images:int):
        self.wd = webdriver
        self.max_num_of_images = max_num_of_images
Through the constructor of the class, we inject webdriver. Also, we define number of images from the first page that we want to download. Next, let’s observe the only public method of this class – srape_images().
  def srape_images(self, query:str, folder_path='./images'):
      folder = os.path.join(folder_path,'_'.join(query.lower().split(' ')))

      if not os.path.exists(folder):
          os.makedirs(folder)

      image_info = self._get_info(query)
      print(f"Downloading images...")

      for image in image_info:
          self.download_image(folder, image)
This method defines a flow and utilizes other methods from this class. Note that one of the prameters is query term which is used inside of Google Search. Also, the folder where images are stored is also injected in this method. The flow goes like this:
  • We create a folder for the specific search
  • We get all the links from images using _get_info method
  • We download images using _download_image method

Let’s explore these private methods a little bit more.

def _get_info(self, query: str):
        image_urls = set()

        wd.get(self._build_query(query))
        self._scroll_to_the_end()

        # img.Q4LuWd is the google tumbnail selector
        thumbnails = self.wd.find_elements_by_css_selector("img.Q4LuWd")

        print(f"Found {len(thumbnails)} images...")
        print(f"Getting the links...")

        for img in thumbnails[0:self.max_num_of_images]:
            # We need to click every thumbnail so we can get the full image.
            try:
                img.click()
            except Exception:
                print('ERROR: Cannot click on the image.')
                continue

            images = wd.find_elements_by_css_selector('img.n3VNCb')
            time.sleep(0.3)

            for image in images:
                if image.get_attribute('src') and 'http' in image.get_attribute('src'):
                    image_urls.add(image.get_attribute('src'))

        return image_urls
The purpose of the _get_info method is to get necessary links of the images. Note that after we perform search in google, we use find_elements_by_css_selector, with appropriate css identifier. This is very similar to the things we have done with BeautifulSoup and Srapy. Another thing we should pay attention to is that we need to click on each image in order to get the good resolution of the image. Then we need to use  find_elements_by_css_selector once again. Once that is done, the correct link is obtained. The second, private method is _download_image.
    def download_image(self, folder_path:str, url:str):
        try:
            image_content = requests.get(url).content

        except Exception as e:
            print(f"ERROR: Could not download {url} - {e}")

        try:
            image_file = io.BytesIO(image_content)
            image = Image.open(image_file).convert('RGB')
            file = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')

            with open(file, 'wb') as f:
                image.save(f, "JPEG", quality=85)
            print(f"SUCCESS: saved {url} - as {file}")

        except Exception as e:
            print(f"ERROR: Could not save {url} - {e}")
This method is pretty straight forward. If we want to use GoogleScrapper, here is how we can do so:
DRIVER_PATH = 'path_to_driver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
wd.get('https://google.com')

gs = GoogleScraper(wd, 10)
gs.srape_images('music')
We download 10 images for the search term music. Here is the result:
Recommendation Systems

Conclusion

In this article, we explored three tools for downloading images from the web. We saw two examples of how this task can be performed and how mentioned tools can be utilized with Python.

Thank you for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of books: Ultimate Guide to Machine Learning and Deep Learning for Programmers. He loves knowledge sharing, and he is an experienced speaker. You can find him speaking at meetups, conferences, and as a guest lecturer at the University of Novi Sad.

Discover more from Rubix Code

Subscribe now to keep reading and get access to the full archive.

Continue reading