Web Scraping with Python (Completed)

Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve very different use cases. Thus it’s essential to understand what we’re choosing and why. Let’s go…
In this mega article, We learn how to make inconspicuous web requests remain undetected and various other methods to parse data on a large scale. Write Python scripts and crawlers to harvest delicious data from across the web. Fill your arsenal with powerful libraries such as BeautifulSoup, Scrapy, and Selenium.
We’ll be using BeautifulSoup. It is more than enough to steal data.
For those who…
- Have a basic understanding of Python
- Have a specific need for a third party’s data
Tools for the Job
- BeautifulSoup is a lightweight, easy-to-learn, and highly effective way to programmatically isolate information on a single webpage at a time. It’s common to use BeautifulSoupin conjunction with the Requests library, where requests fetch a page, and BeautifulSoupsurf extracts the resulting data.
- Scrapy is a tool for building crawlers. These are monstrosities unleashed upon the web like a swarm, and haste-fully grabbing data. Because Scrapy serves the purpose of mass-scraping, it is much easier to get in trouble with.
- Selenium isn’t exclusively a scraping tool as much as an automation tool that can be used to scrape sites. Selenium is the nuclear option for attempting to navigate sites programmatically, and should be treated as such: there are much better options for simple data extraction.
Ready for Go
Install beautifulsoup and requests
pip3 install beautifulsoup4 requests
import requests from bs4 import BeautifulSoup headers = { 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'GET', 'Access-Control-Allow-Headers': 'Content-Type', 'Access-Control-Max-Age': '3600', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' }
Note: Sites can still keep us at bay in a variety of ways.
But establishing headers works surprisingly well in most cases.
Let’s get a page and inspect it using BeautifulSoup now:
import requests from bs4 import BeautifulSoup ... url = "https://example.com" req = requests.get(url, headers) soup = BeautifulSoup(req.content, 'html.parser') print(soup.prettify())
We started by sending a request to https://example.com. We then create a BeautifulSoup object that accepts the raw response content via req.content. The second parameter, ‘html.parser,’ tells BeautifulSoup. That we’re dealing with an HTML document. If you’re interested in parsing things like XML, there are other parsers available.
When we create a BeautifulSoup object from the HTML of a page, our object contains the HTML structure of that page, which can now be easily parsed by various methods. First, let’s see how our variable soup looks by printing it with print(soup.prettify()):
<html class="gr__example_com"><head> <title>Example Domain</title> <meta charset="utf-8"> <meta http-equiv="Content-type" content="text/html; charset=utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta property="og:site_name" content="Example dot com"> <meta property="og:type" content="website"> <meta property="og:title" content="Example"> <meta property="og:description" content="An Example website."> <meta property="og:image" content="https://example.com/img/image.jpg"> <meta name="twitter:title" content="Hackers and Slackers"> <meta name="twitter:description" content="An Example website."> <meta name="twitter:url" content="https://example.com/"> <meta name="twitter:image" content="https://example.com/img/image.jpg"> </head> <body data-gr-c-s-loaded="true"> <div> <h1>Example Domain</h1> <p>This domain is established to be used for illustrative examples in documents.</p> <p>You may use this domain in examples without prior coordination or asking for permission.</p> <p><a href="https://www.iana.org/domains/example">More information...</a></p> </div> </body> </html>
Finding the Target Elements
Effective web scraping requires us to recognize patterns in document’s HTML that we can take advantage of. This is especially the case when dealing with sites that actively try to prevent us from doing just that. Understanding tools we have at our disposal is the first step to developing a keen eye for what’s possible.
Using find() & find_all()
The simplest way to find information in our soup variable is to use soup.find(…) or soup.find all (…).
With one exception, find returns the first HTML element found.
Whereas find all returns a list of all elements matching the criteria (even if only one element is found, find all returns a list of a single item).
We can look for DOM elements in our soup variable by using specific criteria.
If a positional argument is passed to find_all, it will return all anchor tags on the site.
soup.find_all("a") # <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a> # <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a> # <a href="https://example.com/tillie" class="girl" id="link3">Tillie</a>
<a>
tags assigned a certain class:soup.find_all("a" class_="boy") # <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a> # <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a>
Apart from anchor tags, we can get any element with the class name “boy” by using the following code:
soup.find_all(class_="boy") # <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a> # <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a>
soup.find("a", id="link1") # <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>
soup.find_all(attrs={"data-args": "bologna"})
CSS Selectors
soup.select(".widget.author p")
In this example, we’re looking for the second paragraph tag in a widget. We could also modify this to get only the first paragraph tag inside the author widget.
soup.select(".widget.author p:nth-of-type(2)")
soup.select("body > div:first-of-type > div > ul li")
Find Few Attributes
soup.find_all('a').get('href')
soup.find(id="logo").get('src')
It’s not always qualities we’re looking for; sometimes it’s just the text within a tag:
soup.find('p').get_text()
Pesky Tags to Deal With
The page’s meta tags, notably the og tags they’ve specified to explicitly provide the bite-sized information we’re looking for, would definitely be an useful first source of information in our example of making link previews. Getting hold of these tags is a little more difficult:
soup.find("meta", property="og:description").get('content')
That’s downright revolting. Meta tags are a particularly intriguing situation; because they’re all dubbed’meta,’ we need a second identifier (in addition to the tag name) to distinguish which meta tag we’re interested in. Only then will we be able to obtain the tag’s true content.
Realizing Something Will Always Break
If we were to try the above selector on an HTML page that did not contain an og:description, our script would break unforgivingly. This means we always need to build in a plan B, and at the very least deal with a lack of tag altogether.
import requests from bs4 import BeautifulSoup def scrape_page_metadata(url): """Scrape target URL for metadata.""" headers = { 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'GET', 'Access-Control-Allow-Headers': 'Content-Type', 'Access-Control-Max-Age': '3600', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' } pp = pprint.PrettyPrinter(indent=4) r = requests.get(url, headers=headers) html = BeautifulSoup(r.content, 'html.parser') metadata = { 'title': get_title(html), 'description': get_description(html), 'image': get_image(html), 'favicon': get_favicon(html, url), 'sitename': get_site_name(html, url), 'color': get_theme_color(html), 'url': url } pp.pprint(metadata) return metadata
We have a function that attempts to scrape a given URL for us. This function lays the foundation for snatching a page’s metadata. The result we’re looking for is a dictionary named metadata, which contains the data we manage to scrape successfully. Each key in our dictionary has a corresponding function which tries to scrape the corresponding information.
... def get_title(html): """Scrape page title.""" title = None if html.title.string: title = html.title.string elif html.find("meta", property="og:title"): title = html.find("meta", property="og:title").get('content') elif html.find("meta", property="twitter:title"): title = html.find("meta", property="twitter:title").get('content') elif html.find("h1"): title = html.find("h1").string return title def get_description(html): """Scrape page description.""" description = None if html.find("meta", property="description"): description = html.find("meta", property="description").get('content') elif html.find("meta", property="og:description"): description = html.find("meta", property="og:description").get('content') elif html.find("meta", property="twitter:description"): description = html.find("meta", property="twitter:description").get('content') elif html.find("p"): description = html.find("p").contents return description def get_image(html): """Scrape share image.""" image = None if html.find("meta", property="image"): image = html.find("meta", property="image").get('content') elif html.find("meta", property="og:image"): image = html.find("meta", property="og:image").get('content') elif html.find("meta", property="twitter:image"): image = html.find("meta", property="twitter:image").get('content') elif html.find("img", src=True): image = html.find_all("img").get('src') return image
- get_title tries to get the
<title>
tag, which has a very low chance of failing. If fails, to trying to pull the first<h1>
tag on the page (We’re probably scraping a garbage site if we reach to this point). - get_description is nearly identical to our method for scraping page titles.
- get_image looks for the page’s “share” image for social media platforms. Our last resort is to pull the first
<img>
tag containing a source image.
Source Code
Note: A complete source code of common scraping script for maximum website.
"""Scrape metadata from target URL.""" import requests from bs4 import BeautifulSoup import pprint def scrape_page_metadata(url): """Scrape target URL for metadata.""" headers = { 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'GET', 'Access-Control-Allow-Headers': 'Content-Type', 'Access-Control-Max-Age': '3600', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' } pp = pprint.PrettyPrinter(indent=4) r = requests.get(url, headers=headers) html = BeautifulSoup(r.content, 'html.parser') metadata = { 'title': get_title(html), 'description': get_description(html), 'image': get_image(html), 'favicon': get_favicon(html, url), 'sitename': get_site_name(html, url), 'color': get_theme_color(html), 'url': url } pp.pprint(metadata) return metadata def get_title(html): """Scrape page title.""" title = None if html.title.string: title = html.title.string elif html.find("meta", property="og:title"): title = html.find("meta", property="og:title").get('content') elif html.find("meta", property="twitter:title"): title = html.find("meta", property="twitter:title").get('content') elif html.find("h1"): title = html.find("h1").string return title def get_description(html): """Scrape page description.""" description = None if html.find("meta", property="description"): description = html.find("meta", property="description").get('content') elif html.find("meta", property="og:description"): description = html.find("meta", property="og:description").get('content') elif html.find("meta", property="twitter:description"): description = html.find("meta", property="twitter:description").get('content') elif html.find("p"): description = html.find("p").contents return description def get_image(html): """Scrape share image.""" image = None if html.find("meta", property="image"): image = html.find("meta", property="image").get('content') elif html.find("meta", property="og:image"): image = html.find("meta", property="og:image").get('content') elif html.find("meta", property="twitter:image"): image = html.find("meta", property="twitter:image").get('content') elif html.find("img", src=True): image = html.find_all("img").get('src') return image def get_site_name(html, url): """Scrape site name.""" if html.find("meta", property="og:site_name"): site_name = html.find("meta", property="og:site_name").get('content') elif html.find("meta", property='twitter:title'): site_name = html.find("meta", property="twitter:title").get('content') else: site_name = url.split('//')[1] return site_name.split('/')[0].rsplit('.')[1].capitalize() return sitename def get_favicon(html, url): """Scrape favicon.""" if html.find("link", attrs={"rel": "icon"}): favicon = html.find("link", attrs={"rel": "icon"}).get('href') elif html.find("link", attrs={"rel": "shortcut icon"}): favicon = html.find("link", attrs={"rel": "shortcut icon"}).get('href') else: favicon = f'{url.rstrip("/")}/favicon.ico' return favicon def get_theme_color(html): """Scrape brand color.""" if html.find("meta", property="theme-color"): color = html.find("meta", property="theme-color").get('content') return color return None
This tutorial’s source code is available on Github, along with instructions on how to download and execute the script yourself.
Let’s enjoy… https://github.com/farjanul-nayem/Web-Scraping-with-Python
[…] […]
[…] […]