Process of building machine learning, deep learning or AI applications has several steps. One of them is analysis of the data and finding which parts of it are usable and which are not. We also need to pick machine learning algorithms or neural network architectures that we need to use in order to solve the problem. We might even choose to use reinforcement learning or transfer learning. However, often clients don’t have data that could solve their problem. More often than not, it is our job to get data from the web that is going to be utilized by machine learning algorithm or neural network.
This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero TODAY!
This is usually the rule when we work on computer vision tasks. Clients rely on your ability to gather the data that is going to feed your VGG, ResNet, or custom Convolutional Neural Network. So, in this article we focus on the step that comes before data analysis and all the fancy algorithms – data scraping, or to be more precise, image scraping. We are going to show three ways to get images from some web site using Python. In this article we cover several topics:
- Prerequsites
- Scraping images with BeutifulSoup
- Scraping images with Srapy
- Scraping images from Google with Selenium
1. Prerequsites
In general, there are multiple ways that you can download images from a web page. There are even multiple Python packages and tools that can help you with this task. In this article, we explore three of those packages: Beautiful Soup, Scrapy and Selenium. They are all good libraries for pulling data out of HTML.
The first thing we need to do is to install them. To install Beautiful Soup run this command:
pip install beautifulsoup4
To install Scrapy, run this command:
pip install scrapy
Also, make sure that you installed Selenium:
pip install selenium

In order for Selenium to work, you need to install Google Chrome and corresponding ChromeDriver. To do so, follow these steps:
- Install Google Chrome
- Detect version of installed Chrome. You can do so by going to About Google Chrome.
- Finally, download ChromeDriver for your version from here.
Since these tools can not function without Pillow, make sure that this library is installed as well:
pip install Pillow
Both of these libraries are great tools so let’s see what problem we need to solve. In this example, we want to download featured image from all blog posts from our blog page. If we inspect that page we can see that URLs where those images are located are stored withing <img> HTML tag and they are nested within <a> and <div> tags. This is important because we can use CSS classes as identifier.

Now when we know a little bit more about our task, let’s implement solution first with Beautiful Soup and then with Scrapy. Finally, we will see how we can download images from Google, using Selenium.
2. Scraping images with Beautiful Soup
This library is pretty intuitive to use. However, we need to import other libraries in order to finish this task:
from bs4 import BeautifulSoup
import requests
import urllib.request
import shutil
These libraries are used to send web requests (requests and urllib.request) and to store data in files (shutil). Now, we can send request to blog page, get response and parse it with Beautiful Soup:
url = "https://rubikscode.net/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
aas = soup.find_all("a", class_='entry-featured-image-url')
We extracted all elements from HTML DOM that have tag <a> and class entry-featured-image-url using find_all method. They are all stored within aas variable, if you print it out it will look something like this:
[<a class="entry-featured-image-url" href="https://rubikscode.net/2021/05/31/create-deepfakes-in-5-minutes-with-first-order-model-method/"><img alt="Create Deepfakes in 5 Minutes with First Order Model Method" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?resize=400%2C250&ssl=1" srcset="https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?fit=1200%2C675&ssl=1 479w, https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/05/24/test-driven-development-tdd-with-python/"><img alt="Test Driven Development (TDD) with Python" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?resize=400%2C250&ssl=1" srcset="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?fit=1200%2C675&ssl=1 479w, https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/26/machine-learning-with-ml-net-sentiment-analysis/"><img alt="Machine Learning with ML.NET – Sentiment Analysis" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/19/machine-learning-with-ml-net-nlp-with-bert/"><img alt="Machine Learning with ML.NET – NLP with BERT" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?resize=400%2C250&ssl=1" srcset="https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?fit=1200%2C675&ssl=1 479w, https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/12/machine-learning-with-ml-net-evaluation-metrics/"><img alt="Machine Learning With ML.NET – Evaluation Metrics" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/04/05/machine-learning-with-ml-net-object-detection-with-yolo/"><img alt="Machine Learning with ML.NET – Object detection with YOLO" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?resize=400%2C250&ssl=1" srcset="https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?fit=1200%2C675&ssl=1 479w, https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/03/29/the-rising-value-of-big-data-in-application-monitoring/"><img alt="The Rising Value of Big Data in Application Monitoring" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/03/22/transfer-learning-and-image-classification-with-ml-net/"><img alt="Transfer Learning and Image Classification with ML.NET" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?resize=400%2C250&ssl=1 480w " width="400"/></a>,
<a class="entry-featured-image-url" href="https://rubikscode.net/2021/03/15/machine-learning-with-ml-net-recommendation-systems/"><img alt="Machine Learning with ML.NET – Recommendation Systems" class="" height="250" sizes="(max-width:479px) 479px, 100vw " src="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?resize=400%2C250&ssl=1" srcset="https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?fit=1200%2C675&ssl=1 479w, https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?resize=400%2C250&ssl=1 480w " width="400"/></a>]

Now we need to use that information and get data from each individual link. We also need to store it files. Let’s firs extract URL links and image names for each image from aas variable.
image_info = []
for a in aas:
image_tag = a.findChildren("img")
image_info.append((image_tag[0]["src"], image_tag[0]["alt"]))
We utilize findChildren function for each element in the aas array and append it’s attributes to image_info list. Here is the result:
[('https://i2.wp.com/rubikscode.net/wp-content/uploads/2020/05/Deepfakes-Featured-Image.png?resize=400%2C250&ssl=1',
'Create Deepfakes in 5 Minutes with First Order Model Method'),
('https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/05/tddfeatured.png?resize=400%2C250&ssl=1',
'Test Driven Development (TDD) with Python'),
('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/SentimentAnalysis.png?resize=400%2C250&ssl=1',
'Machine Learning with ML.NET – Sentiment Analysis'),
('https://i0.wp.com/rubikscode.net/wp-content/uploads/2021/04/FeaturedNLP.png?resize=400%2C250&ssl=1',
'Machine Learning with ML.NET – NLP with BERT'),
('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/04/Featured-evaluation-metrics.png?resize=400%2C250&ssl=1',
'Machine Learning With ML.NET – Evaluation Metrics'),
('https://i2.wp.com/rubikscode.net/wp-content/uploads/2021/04/mlnetyolo.png?resize=400%2C250&ssl=1',
'Machine Learning with ML.NET – Object detection with YOLO'),
('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/risingvalue.png?resize=400%2C250&ssl=1',
'The Rising Value of Big Data in Application Monitoring'),
('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/ImageClassificationFeatured.png?resize=400%2C250&ssl=1',
'Transfer Learning and Image Classification with ML.NET'),
('https://i1.wp.com/rubikscode.net/wp-content/uploads/2021/03/This-Week-in-AI-Issue-5.png?resize=400%2C250&ssl=1',
'Machine Learning with ML.NET – Recommendation Systems')]
Great, we have links and image names, all we need to do is download data. For that purpose, we build download_image function:
def download_image(image):
response = requests.get(image[0], stream=True)
realname = ''.join(e for e in image[1] if e.isalnum())
file = open("./images_bs/{}.jpg".format(realname), 'wb')
response.raw.decode_content = True
shutil.copyfileobj(response.raw, file)
del response
It is a simple function. first we send the request to the URL that we extracted form the HTML. Then based on the title we create file name. During this process we remove all spaces and special characters. Eventually we create a file with the proper name and copy all data from the response into that file using shutil. In the end we call all the function for each image in the list:
for i in range(0, len(image_info)):
download_image(image_info[i])
The result can be found in defined folder:

Of course, this solution can be fruther generalized and implemented in a form of a class. Something like this:
from bs4 import BeautifulSoup
import requests
import shutil
class BeautifulScrapper():
def __init__(self, url:str, classid:str, folder:str):
self.url = url
self.classid = classid
self.folder = folder
def _get_info(self):
image_info = []
response = requests.get(self.url)
soup = BeautifulSoup(response.text, "html.parser")
aas = soup.find_all("a", class_= self.classid)
for a in aas:
image_tag = a.findChildren("img")
image_info.append((image_tag[0]["src"], image_tag[0]["alt"]))
return image_info
def _download_images(self, image_info):
response = requests.get(image_info[0], stream=True)
realname = ''.join(e for e in image_info[1] if e.isalnum())
file = open(self.folder.format(realname), 'wb')
response.raw.decode_content = True
shutil.copyfileobj(response.raw, file)
del response
def scrape_images(self):
image_info = self._get_info()
for i in range(0, len(image_info)):
self.download_image(image_info[i])
It is pretty much the same thing, just now you can create multiple objects of this class with different URLs and configurations.
2. Scraping images with Scrapy
The other tool that we can use for downloading images is Scrapy. While Beautiful Soup is intuitive and very simple to use, you still need to use other libraries and things can get messy if we are working on bigger project. Scrapy is great for those situations. Once this library is installed, you can create new Scrapy project with this command:
scrapy startproject name_of_project
This is going to create project structure that is similar to the Django project structure. The files that are interesting are settings.py, items.py and ImgSpyder.py.
The first thing we need to do is add file or image pipeline in settings.py. In this example, we add image pipeline. Also, location where images are stored needs to be added. That is why we add these two lines to the settings:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:/images/scrapy'
Then we move on to the items.py. Here we define the structure of downloaded items. In this case, we use Scrapy for downloading images, however, it is one powerful tool for downloading other types of data as well. Here is what that looks like:
import scrapy
class ImageItem(scrapy.Item):
images = scrapy.Field()
image_urls = scrapy.Field()
Here we defined ImageItem class which inherits Item class from Scrapy. We define two mandatory fields when we work with Image Pipeline: images and images_urls and we define them as scrapy.Field(). It is important to notice that these fields must have these names. Apart from that, note that img_urls needs to be a list and needs to contain absolute URLs. For this reason we have to create a function to transform relative URLs to absolute URLs. Finally, we implement crawler within ImgSpyder.py:
import scrapy
from ..items import ImageItem
class ImgSpider(scrapy.spiders.Spider):
name = "img_spider"
start_urls = ["https://rubikscode.net/"]
def parse(self, response):
image = ImageItem()
img_urls = []
for img in response.css(".entry-featured-image-url img::attr(src)").extract():
img_urls.append(img)
image["image_urls"] = img_urls
return image
In this file, we create class ImgSpyder which inherits Spider class from Scrapy. This class essentially is used for crawling and downloading data. You can see that each Spider has a name. This name is used for running the process later on. Field start_urls defines which web pages are crawled. When we initiate Spider, it shoot requests to the pages defined in start_urls array.

Response is processed in parse method, which we override in ImgSpider class. Effectively this means that, when we run this example, it sends request to http://rubikscode.net and then processes the response in parse method. In this method, we create instance of ImageItem. Then we use css selector to extract image URLs and store them in img_urls array. Finally, we put everything from img_urls array into the ImageItem object. Note that we don’t need to put anything in images field of the class, that is done by Scrapy. Let’s run this crawler with this command:
scrapy crawl img_spyder
We use name defined within the class. The other way to run this crawler is like this:
scrapy runspider ImgSpyder.py
4. Scraping images from Google with Selenium
In both previous cases, we explored how we can download images from one site. However, if we want to download large amount of images, performing Google search is probably the best option. This process can be automated as well. For automation, we can use Selenium.
This tool is used for various types of automation, for example, it is good for test automation. In this tutorial, we use it to perform necessary search in Google and download images. Note, that for this to work, you have to install Google Chrome and use corresponding ChromeDriver as described in the first section of this article.
It is easy to use selenium, however first you need to import it:
import selenium
from selenium import webdriver
Once this is done you need to define the path to ChromDriver:
DRIVER_PATH = 'C:\\Users\\n.zivkovic\\Documents\\chromedriver_win32\\chromedriver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
wd.get('https://google.com')

class GoogleScraper():
'''Downloades images from google based on the query.
webdriver - Selenium webdriver
max_num_of_images - Maximum number of images that we want to download
'''
def __init__(self, webdriver:webdriver, max_num_of_images:int):
self.wd = webdriver
self.max_num_of_images = max_num_of_images
def _scroll_to_the_end(self):
wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
def _build_query(self, query:str):
return f"https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={query}&oq={query}&gs_l=img"
def _get_info(self, query: str):
image_urls = set()
wd.get(self._build_query(query))
self._scroll_to_the_end()
# img.Q4LuWd is the google tumbnail selector
thumbnails = self.wd.find_elements_by_css_selector("img.Q4LuWd")
print(f"Found {len(thumbnails)} images...")
print(f"Getting the links...")
for img in thumbnails[0:self.max_num_of_images]:
# We need to click every thumbnail so we can get the full image.
try:
img.click()
except Exception:
print('ERROR: Cannot click on the image.')
continue
images = wd.find_elements_by_css_selector('img.n3VNCb')
time.sleep(0.3)
for image in images:
if image.get_attribute('src') and 'http' in image.get_attribute('src'):
image_urls.add(image.get_attribute('src'))
return image_urls
def download_image(self, folder_path:str, url:str):
try:
image_content = requests.get(url).content
except Exception as e:
print(f"ERROR: Could not download {url} - {e}")
try:
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
with open(file, 'wb') as f:
image.save(f, "JPEG", quality=85)
print(f"SUCCESS: saved {url} - as {file}")
except Exception as e:
print(f"ERROR: Could not save {url} - {e}")
def srape_images(self, query:str, folder_path='./images'):
folder = os.path.join(folder_path,'_'.join(query.lower().split(' ')))
if not os.path.exists(folder):
os.makedirs(folder)
image_info = self._get_info(query)
print(f"Downloading images...")
for image in image_info:
self.download_image(folder, image)
def __init__(self, webdriver:webdriver, max_num_of_images:int):
self.wd = webdriver
self.max_num_of_images = max_num_of_images
def srape_images(self, query:str, folder_path='./images'):
folder = os.path.join(folder_path,'_'.join(query.lower().split(' ')))
if not os.path.exists(folder):
os.makedirs(folder)
image_info = self._get_info(query)
print(f"Downloading images...")
for image in image_info:
self.download_image(folder, image)
- We create a folder for the specific search
- We get all the links from images using _get_info method
- We download images using _download_image method
Let’s explore these private methods a little bit more.
def _get_info(self, query: str):
image_urls = set()
wd.get(self._build_query(query))
self._scroll_to_the_end()
# img.Q4LuWd is the google tumbnail selector
thumbnails = self.wd.find_elements_by_css_selector("img.Q4LuWd")
print(f"Found {len(thumbnails)} images...")
print(f"Getting the links...")
for img in thumbnails[0:self.max_num_of_images]:
# We need to click every thumbnail so we can get the full image.
try:
img.click()
except Exception:
print('ERROR: Cannot click on the image.')
continue
images = wd.find_elements_by_css_selector('img.n3VNCb')
time.sleep(0.3)
for image in images:
if image.get_attribute('src') and 'http' in image.get_attribute('src'):
image_urls.add(image.get_attribute('src'))
return image_urls
def download_image(self, folder_path:str, url:str):
try:
image_content = requests.get(url).content
except Exception as e:
print(f"ERROR: Could not download {url} - {e}")
try:
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
with open(file, 'wb') as f:
image.save(f, "JPEG", quality=85)
print(f"SUCCESS: saved {url} - as {file}")
except Exception as e:
print(f"ERROR: Could not save {url} - {e}")
DRIVER_PATH = 'path_to_driver'
wd = webdriver.Chrome(executable_path=DRIVER_PATH)
wd.get('https://google.com')
gs = GoogleScraper(wd, 10)
gs.srape_images('music')

Conclusion
In this article, we explored three tools for downloading images from the web. We saw two examples of how this task can be performed and how mentioned tools can be utilized with Python.
Thank you for reading!

Nikola M. Zivkovic
CAIO at Rubik's Code
Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of books: Ultimate Guide to Machine Learning and Deep Learning for Programmers. He loves knowledge sharing, and he is an experienced speaker. You can find him speaking at meetups, conferences, and as a guest lecturer at the University of Novi Sad.
Thanks Nikola for the nice overview! I just successfully used your Selenium code to scrape images from Google. I will of course credit you and Rubik’s Code on my repo.
I noticed one typo in your code: in some places ‘wd’ should be replaced by ‘self.wd’. Furthermore, I think it would be useful to also mention the necessary imports. For option 4 they are as follows:
import hashlib
import io
import os
from PIL import Image
import requests
from selenium import webdriver
import time
Thanks Tom,
I am glad you like the code and find it useful!
Thanks for noticing the mistakes, I will definitely expand it and modify it.
Best,
Nikola
How many Maximum images can we download using this line of code?