Web Scraping with Python (Completed)

Web scraping in Python is dominated by three major libraries: BeautifulSoupScrapy, and Selenium. Each of these libraries intends to solve very different use cases. Thus it’s essential to understand what we’re choosing and why. Let’s go…

In this mega article, We learn how to make inconspicuous web requests remain undetected and various other methods to parse data on a large scale. Write Python scripts and crawlers to harvest delicious data from across the web. Fill your arsenal with powerful libraries such as BeautifulSoup, Scrapy, and Selenium.

We’ll be using BeautifulSoup. It is more than enough to steal data.

For those who…

  • Have a basic understanding of Python
  • Have a specific need for a third party’s data

Tools for the Job

  • BeautifulSoup is a lightweight, easy-to-learn, and highly effective way to programmatically isolate information on a single webpage at a time. It’s common to use BeautifulSoupin conjunction with the Requests library, where requests fetch a page, and BeautifulSoupsurf extracts the resulting data.
  • Scrapy is a tool for building crawlers. These are monstrosities unleashed upon the web like a swarm, and haste-fully grabbing data. Because Scrapy serves the purpose of mass-scraping, it is much easier to get in trouble with.
  • Selenium isn’t exclusively a scraping tool as much as an automation tool that can be used to scrape sites. Selenium is the nuclear option for attempting to navigate sites programmatically, and should be treated as such: there are much better options for simple data extraction.

Ready for Go

We need to set the stage before we steal any data. We’ll begin by installing our two favorite libraries:

Install beautifulsoup and requests

pip3 install beautifulsoup4 requests

As previously said, requests will give us our target’s HTML, which will be parsed by beautifulsoup4.
We must acknowledge that many websites have safeguards in place 
to prevent scrapers from getting their data. To get around this, we can fake the headers we send along 
with our requests to make our scraper appear to be valid browser:

import requests
from bs4 import BeautifulSoup
headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

Note: Sites can still keep us at bay in a variety of ways.

But establishing headers works surprisingly well in most cases.

Let’s get page and inspect it using BeautifulSoup now:

import requests
from bs4 import BeautifulSoup
...
url = "https://example.com"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())


We started by sending a request to https://example.com. We then create a BeautifulSoup object that accepts the raw response content via req.content. The second parameter, ‘html.parser,’ tells BeautifulSoup. That we’re dealing with an HTML document. If you’re interested in parsing things like XML, there are other parsers available.

When we create a BeautifulSoup object from the HTML of a page, our object contains the HTML structure of that page, which can now be easily parsed by various methods. First, let’s see how our variable soup looks by printing it with print(soup.prettify()):

<html class="gr__example_com"><head>
    <title>Example Domain</title>
    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta property="og:site_name" content="Example dot com">
    <meta property="og:type" content="website">
    <meta property="og:title" content="Example">
    <meta property="og:description" content="An Example website.">
    <meta property="og:image" content="https://example.com/img/image.jpg">
    <meta name="twitter:title" content="Hackers and Slackers">
    <meta name="twitter:description" content="An Example website.">
    <meta name="twitter:url" content="https://example.com/">
    <meta name="twitter:image" content="https://example.com/img/image.jpg">
</head>

<body data-gr-c-s-loaded="true">
  <div>
    <h1>Example Domain</h1>
      <p>This domain is established to be used for illustrative examples in documents.</p>
      <p>You may use this domain in examples without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
  </div>
</body>

</html>

Finding the Target Elements

Effective web scraping requires us to recognize patterns in document’s HTML that we can take advantage of. This is especially the case when dealing with sites that actively try to prevent us from doing just that. Understanding tools we have at our disposal is the first step to developing a keen eye for what’s possible.

Using find() & find_all()

The simplest way to find information in our soup variable is to use soup.find(…) or soup.find all (…).
With one exception, find returns the first HTML element found.

Whereas find all returns a list of all elements matching the criteria (even if only one element is found, find all returns a list of a single item).
We can look for DOM elements in our soup variable by using specific criteria.

If a positional argument is passed to find_all, it will return all anchor tags on the site.

Finding all <a> tags:

soup.find_all("a")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a>
# <a href="https://example.com/tillie" class="girl" id="link3">Tillie</a>

All anchor tags with the class name “boy” can also be found. We can filter by class name by passing the class_argument.  Take notice of the underscore!
Find all <a> tags assigned a certain class:

soup.find_all("a" class_="boy")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a>


Apart from anchor tags, we can get any element with the class name “boy” by using the following code:

soup.find_all(class_="boy")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a>

In the same way that we searched for classes, we can now search for elements by id. 
Because we should only expect single element with an id to be returned, we should use find here:

soup.find("a", id="link1")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>

Occasionally, we’ll come across elements that don’t have consistent class or id values. 
We can find DOM elements with any attribute, including non-standard ones, thanks to the fact that we can search for them:

soup.find_all(attrs={"data-args": "bologna"})

CSS Selectors

One of the most powerful ways to find what you’re looking for in HTML is to use CSS selectors. 
This is especially true for sites that try to make your life difficult. 
We can find and leverage highly-specific patterns in the target’s DOM structure by using CSS selectors. 
This is the most effective way to ensure that we’re getting exactly what we need. 
If you’re not familiar with CSS selectors, strongly advise you to brush up on your knowledge. 
Listed below are a few examples:

soup.select(".widget.author p")


In this example, we’re looking for the second paragraph tag in a widget. We could also modify this to get only the first paragraph tag inside the author widget.

soup.select(".widget.author p:nth-of-type(2)")

Consider site that has no identifying information on its tags to prevent people like
you from scraping its data. Even if we didn’t have names to go by, we could look at the page’s DOM structure.
And figure out unique way to get to the element we wanted:

soup.select("body > div:first-of-type > div > ul li")

pattern like this is most likely limited to single set of li> tags on the page we’re looking at. 
The disadvantage of this strategy is that we are subject to the site owner’s whims, 
as their HTML structure may change.

Find Few Attributes

We’ll nearly always want the contents or attributes of tag, rather than the whole HTML of tag.
If we’re scraping anchor tags, for example, 
we’re usually only interested in the href value rather than the complete tag. 
To access the values of attributes on tag, use the .get method:

soup.find_all('a').get('href')

The above finds all of page’s a> tags’ destination URLs. 
Another example would be to take screenshot of website’s logo:
soup.find(id="logo").get('src')


It’s not always qualities we’re looking for; sometimes it’s just the text within tag:

soup.find('p').get_text()

Pesky Tags to Deal With

The page’s meta tags, notably the og tags they’ve specified to explicitly provide the bite-sized information we’re looking for, would definitely be an useful first source of information in our example of making link previews. Getting hold of these tags is a little more difficult:

soup.find("meta", property="og:description").get('content')

That’s downright revolting. Meta tags are a particularly intriguing situation; because they’re all dubbed’meta,’ we need a second identifier (in addition to the tag name) to distinguish which meta tag we’re interested in. Only then will we be able to obtain the tag’s true content.

Realizing Something Will Always Break

If we were to try the above selector on an HTML page that did not contain an og:description, our script would break unforgivingly. This means we always need to build in a plan B, and at the very least deal with a lack of tag altogether.

import requests
from bs4 import BeautifulSoup

def scrape_page_metadata(url):
    """Scrape target URL for metadata."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    pp = pprint.PrettyPrinter(indent=4)
    r = requests.get(url, headers=headers)
    html = BeautifulSoup(r.content, 'html.parser')
    metadata = {
        'title': get_title(html),
        'description': get_description(html),
        'image': get_image(html),
        'favicon': get_favicon(html, url),
        'sitename': get_site_name(html, url),
        'color': get_theme_color(html),
        'url': url
        }
    pp.pprint(metadata)
    return metadata


We have a function that attempts to scrape a given URL for us. This function lays the foundation for snatching a page’s metadata. The result we’re looking for is a dictionary named metadata, which contains the data we manage to scrape successfully. Each key in our dictionary has a corresponding function which tries to scrape the corresponding information.

...

def get_title(html):
    """Scrape page title."""
    title = None
    if html.title.string:
        title = html.title.string
    elif html.find("meta", property="og:title"):
        title = html.find("meta", property="og:title").get('content')
    elif html.find("meta", property="twitter:title"):
        title = html.find("meta", property="twitter:title").get('content')
    elif html.find("h1"):
        title = html.find("h1").string
    return title

def get_description(html):
    """Scrape page description."""
    description = None
    if html.find("meta", property="description"):
        description = html.find("meta", property="description").get('content')
    elif html.find("meta", property="og:description"):
        description = html.find("meta", property="og:description").get('content')
    elif html.find("meta", property="twitter:description"):
        description = html.find("meta", property="twitter:description").get('content')
    elif html.find("p"):
        description = html.find("p").contents
    return description

def get_image(html):
    """Scrape share image."""
    image = None
    if html.find("meta", property="image"):
        image = html.find("meta", property="image").get('content')
    elif html.find("meta", property="og:image"):
        image = html.find("meta", property="og:image").get('content')
    elif html.find("meta", property="twitter:image"):
        image = html.find("meta", property="twitter:image").get('content')
    elif html.find("img", src=True):
        image = html.find_all("img").get('src')
    return image

  • get_title tries to get the <title> tag, which has a very low chance of failing. If fails, to trying to pull the first <h1> tag on the page (We’re probably scraping a garbage site if we reach to this point).
  • get_description is nearly identical to our method for scraping page titles.
  • get_image looks for the page’s “share” image for social media platforms. Our last resort is to pull the first <img> tag containing a source image.

Source Code

Note: A complete source code of common scraping script for maximum website.

"""Scrape metadata from target URL."""
import requests
from bs4 import BeautifulSoup
import pprint


def scrape_page_metadata(url):
    """Scrape target URL for metadata."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    pp = pprint.PrettyPrinter(indent=4)
    r = requests.get(url, headers=headers)
    html = BeautifulSoup(r.content, 'html.parser')
    metadata = {
        'title': get_title(html),
        'description': get_description(html),
        'image': get_image(html),
        'favicon': get_favicon(html, url),
        'sitename': get_site_name(html, url),
        'color': get_theme_color(html),
        'url': url
        }
    pp.pprint(metadata)
    return metadata


def get_title(html):
    """Scrape page title."""
    title = None
    if html.title.string:
        title = html.title.string
    elif html.find("meta", property="og:title"):
        title = html.find("meta", property="og:title").get('content')
    elif html.find("meta", property="twitter:title"):
        title = html.find("meta", property="twitter:title").get('content')
    elif html.find("h1"):
        title = html.find("h1").string
    return title


def get_description(html):
    """Scrape page description."""
    description = None
    if html.find("meta", property="description"):
        description = html.find("meta", property="description").get('content')
    elif html.find("meta", property="og:description"):
        description = html.find("meta", property="og:description").get('content')
    elif html.find("meta", property="twitter:description"):
        description = html.find("meta", property="twitter:description").get('content')
    elif html.find("p"):
        description = html.find("p").contents
    return description


def get_image(html):
    """Scrape share image."""
    image = None
    if html.find("meta", property="image"):
        image = html.find("meta", property="image").get('content')
    elif html.find("meta", property="og:image"):
        image = html.find("meta", property="og:image").get('content')
    elif html.find("meta", property="twitter:image"):
        image = html.find("meta", property="twitter:image").get('content')
    elif html.find("img", src=True):
        image = html.find_all("img").get('src')
    return image


def get_site_name(html, url):
    """Scrape site name."""
    if html.find("meta", property="og:site_name"):
        site_name = html.find("meta", property="og:site_name").get('content')
    elif html.find("meta", property='twitter:title'):
        site_name = html.find("meta", property="twitter:title").get('content')
    else:
        site_name = url.split('//')[1]
        return site_name.split('/')[0].rsplit('.')[1].capitalize()
    return sitename


def get_favicon(html, url):
    """Scrape favicon."""
    if html.find("link", attrs={"rel": "icon"}):
        favicon = html.find("link", attrs={"rel": "icon"}).get('href')
    elif html.find("link", attrs={"rel": "shortcut icon"}):
        favicon = html.find("link", attrs={"rel": "shortcut icon"}).get('href')
    else:
        favicon = f'{url.rstrip("/")}/favicon.ico'
    return favicon


def get_theme_color(html):
    """Scrape brand color."""
    if html.find("meta", property="theme-color"):
        color = html.find("meta", property="theme-color").get('content')
        return color
    return None


This tutorial’s source code is available on Github, along with instructions on how to download and execute the script yourself.

Let’s enjoy… https://github.com/farjanul-nayem/Web-Scraping-with-Python

Leave a Reply

Your email address will not be published.