As an object-oriented language, Python is one of the simplest to learn. Python’s classes and objects are far more user-friendly than any other computer language. Furthermore, several libraries exist that make creating a web scraping tool in Python a breeze. So, web scraping with Python steps is most likely straightforward. I bet your next question is, what is web scraping? Follow me!
This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero TODAY!
In this article, we cover:
1. What is Web Scraping?
Web scraping is defined as the automated extraction of specific data from the internet. It has numerous applications, such as gathering data for a machine learning project, developing a price comparison tool, or any other new idea requiring a massive volume of data.
While it is theoretically possible to extract data manually, the vastness of the internet makes this approach impractical in many circumstances. Knowing how to create a web scraper can be helpful.
2. Addressing The Legality of Web Scraping
While scraping is legal, the data you extract may not be. Make sure that you are not interfering with any of the following:
2.1 Copyrighted content
This content type is someone’s intellectual property, it is legally protected and cannot be simply reused.
2.2 Personal data
If the information you collect can be used to identify a person, it is deemed personal data and is most likely protected by law in that region. It’s advisable to avoid storing the data unless you have a solid legal cause to.
In general, you should always read every website’s terms and conditions before scraping to ensure that you are not violating their policies. If you’re unsure how to continue, contact the site’s owner and request permission.
This Python web scraping article will go over all you need to know to get started with a simple application. You’ll learn how to assess a website before scraping it, extract precise data with BeautifulSoup, and after Java rendering with Selenium, and save everything in a new CSV or JSON file. You will be able to grasp quickly how to accomplish web scraping by following the methods provided below.
3. Web Scraping with Python Steps
This article walks through the steps of web scraping with the help of Beautiful Soup, a Python Web Scraping library.
Web scraping involves the following steps:
- Make a request to targeted URL’s webpage using HTTP. The server answers by sending back the webpage’s HTML content.
- After retrieving the HTML text, we must parse the data. Because most HTML data is hierarchical, we cannot extract data simply through string processing. A parser that can construct a nested/tree structure of HTML data is required. Other HTML parser libraries are available, but html5lib is the most advanced.
- All that remains is to navigate and search the parse tree we generated, i.e., tree traversal. Beautiful Soup, a third-party Python package, will be used. It is a Python library that extracts data from HTML and XML files.
3.1 Installing the appropriate third-party libraries
Pip is the simplest way to install external libraries in Python. Install with the following steps:
pip install requests
pip install html5lib
pip install bs4
3.2 Retrieving HTML content/text from a webpage
To begin, import the requests library. After that, enter the webpage URL you wish to scrape. Send a request (HTTP) to the supplied URL and save the server response in a response object called r. Use print r.content to acquire the webpage’s HTML content (raw) of ‘string’ type.
import requests
URL = "https://www.skillproperty.org/what-to-do-before-selling-your-home/"
r = requests.get(URL)
print(r.content)
3.3 Parsing the HTML content
Take this for example,
soup = BeautifulSoup(r.content, 'html5lib')
We make a BeautifulSoup object by supplying two parameters:
- r.content: This is the unprocessed HTML content.
- html5lib: Specifying the HTML parser to be used.
soup.prettify() is now printed, and it gives you a visual picture of the parse tree generated from the raw HTML content.
#This will not run on online IDE
import requests
from bs4 import BeautifulSoup
URL = "http://www.skillproperty.com/blog"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
# should this return any error, install html5lib or 'pip install html5lib'
print(soup.prettify())
3.4 Searching and navigating through the parse tree
Now we’d like to extract some valuable data from the HTML content. The soup object includes all of the data in the hierarchical structure that may be retrieved programmatically. In this example, we are working on a webpage full of quotes. As a result, we’d like to develop a program to save those quotes.
#program to scrap website and save quotes
import requests
from bs4 import BeautifulSoup
import csv
URL = "http://www.messagestogirl.com/romantic-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[] # a list to store quotes
table = soup.find('div', attrs = {'id':'all_quotes'})
for row in table.findAll('div',
attrs = {'class':'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):
quote = {}
quote['theme'] = row.h5.text
quote['url'] = row.a['href']
quote['img'] = row.img['src']
quote['lines'] = row.img['alt'].split(" #")[0]
quote['author'] = row.img['alt'].split(" #")[1]
quotes.append(quote)
filename = 'motivational_quotes.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['theme','url','img','lines','author'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
Before proceeding, it is advised that you examine the HTML text of the webpage that we produced using soup.prettify() and look for a way to navigate to the quotes.
- It is discovered that all of the quotations are contained within a div container with the id ‘all quotes.’ So, we use the find() method to locate that div element (referred to as table in the preceding code):
table = soup.find('div', attrs = {'id':'all_quotes'})
The first parameter is the HTML tag to search for, and the second is a dictionary-type element to describe the additional properties connected with that tag. The find() method returns the first element that matches. You can try printing the table.prettify() to get an idea of what this code does.
- In the table element, each quote is contained within a div container with the class quote. As a result, we cycle through each div container with the class quotation.
In this case, we use the findAll() function, which, in terms of arguments, is comparable to the find method but returns a list of all matching components. A variable called row is now used to go through each quote.
Now consider this piece of code:
for row in table.find_all_next('div', attrs =
{'class': 'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):
quote = {}
quote['theme'] = row.h5.text
quote['url'] = row.a['href']
quote['img'] = row.img['src']
quote['lines'] = row.img['alt'].split(" #")[0]
quote['author'] = row.img['alt'].split(" #")[1]
quotes.append(quote)
To save all information about a quote, we develop a dictionary. Dot notation can be used to access the hierarchical structure. We then use .text to get the text inside an HTML element.
quote['theme'] = row.h5.text
Treating the tag as a dictionary gives us the power to add, modify and remove that tag’s attributes.
quote['url'] = row.a['href']
Finally, all of the quotes are appended to the list named quotes.
So, this is a basic example of how to make a web scraper in Python. From here, you can attempt to scrap any other website you want! You can take out time in learning python as this is a nice skill to acquire.
Conclusion
You’re on your own from here on out. Building web scrapers in Python, obtaining data, and deriving conclusions from vast volumes of information is a fascinating and challenging process in and of itself.
Thank you for reading!
This bundle of e-books is specially crafted for beginners.
Everything from Python basics to the deployment of Machine Learning algorithms to production in one place.
Become a Machine Learning Superhero TODAY!
Rubik's Code
Building Smart Apps
Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development.
Read our blog posts here.