Understanding Web Scraping with Python Part 1

Tari Yekorogha
10 min readAug 2, 2022

getting data from the internet.

Photo by Louis Reed on Unsplash

What is Web Scraping?

Web scraping, simply put, is the automated gathering of data from the internet. It’s the use of programs ( also called robots) to gather data from websites and various other web resources automatically. The term, Web scraping, has many other names such as web harvesting, web crawling e.t.c. It makes you wonder if the inventor of the concept was a marvel fan 😂

I prefer the term Web scraping so that’s what I’m going to call it throughout this article. As you can tell from the article’s topic this is going to be a series in which we take a deep dive into Web Scraping. So ready your Laptop, your Python interpreter, and your internet connection and come with me on this journey.

Our First Web Scraper

To cut a long story short the internet is simply a bunch of requests and responses ie I request data from Person A then Person A checks his cabinet to see if they have the data. If they do, they respond with my requested data and an HTTPStatus Code 200 signalling that the search went ok whereas if they didn’t find the data they respond without my data and an HTTP status code 404 signifying that “Sorry we, weren’t able to find your requested data in our cabinet”. This is practically how the internet works but if you would like to know more feel free to click here.

Anyways what our modern browsers do for us is simply beautify the responses to our requests but underneath the fancy layout is still code. That’s why clicking this link 👉🏾http://pythonscraping.com/pages/page1.html produces a prettier result than running the code below:

We need to start thinking of a webpage’s address as an address to a file rather than a page because most modern web pages have plenty of resource files associated with them. Examples of such include image files. When a web browser reaches an HTML tag such as :

<img src="rubikscube.jpg">

The browser knows that it needs to make another request to the server to get the data at the file rubikscube.jpg to fully render the page for its user.

Unfortunately, our Python code does not have the logic to go back and request multiple files. Not yet anyways 😼. It can only read the single HTML file that you’ve directly requested.

Yeah, we are gonna be using urllib throughout the entirety of this series. urllib is a standard Python library. It is the URL handling module for python ( although there are third-party modules for URL handling). It contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent. You should definitely check out the documentation for urllib since we are gonna be using it throughout this article series.

BeautifulSoup

Photo by Max Griss on Unsplash

Beautiful Soup, so rich and green,

Waiting in a hot tureen!

Who for such dainties would not stoop?

Soup of the evening, beautiful Soup!

Yeah… Apparently, the BeautifulSoup library was named after a Lewis Carroll poem of the same name in Alice’s Adventures in Wonderland. It is just me or are the makers of web scraping big movie fans 😂.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers, like you and me, hours or days of work.

Beautiful Soup is one of the libraries used in Web Scraping and to use it of course it needs to be installed on your device. If you are using Windows and you don’t have it installed you can do that by running the following command in CMD or Powershell ( preferably run as Administrator)

pip install bs4

If you’re using Linux or Mac. Too bad, I’m using a Windows 😝 and it’s my article so I can be biased if I want to. Just Kidding, in fact, here’s a trade secret if you ever find yourself stuck in your software engineering or coding journey just google it. See:

By now, surely, you have Beautiful Soup installed on your device. If you're not sure. Crank up your terminal, type in python , then type in from bs4 import BeutifulSoup like so:

If you have a screen similar to mine then you are safe if you have an ImportError then try to install BeautifulSoup again. If you keep getting issues you can follow, then dm me on Twitter. We’ll tackle the problem together😤

Trying out Beautiful Soup

The code snippet above shows a simple use of the BeautifulSoup module. First, the urlopen function is imported from the urllib.request module. Second, the BeautifulSoup function is imported from the bs4 module. Thirdly, a request is made to the server that handles the files located at the address ( or URL) ‘http://www.pythonscraping.com/pages/page1.html’ using the urlopen function the response is then stored in the variable, html. Fourthly, the BeautifulSoup function after being assigned the arguments, html.read().

Note: that the function, read, is used to get the HTML content of the page though Beautiful Soup can also use the response that was directly returned by the function urlopen without calling the read function on it.

https://jovian.ml/kingtroga/web-scraping-with-python-head-to-toe/v/6&cellId=14

Anyways, as I was writing. The fourth thing the code above those is to parse the response stored in the variable, html, using the BeautifulSoup function into something Python would easily understand. To do that the BeautifulSoup function takes two arguments. The variable, html, and the parser, html.parser. Afterwhich the Beautiful soup object created from this operation is stored in the variable, bs.

Lastly, using the Beautiful soup object which identifies as bs we get the text stored in the h1 tag of our HTML response. Cool right?

The BeautifulSoup object is what is most commonly used in the BeautifulSoup library. You’re even gonna see that for yourself. The HTML content we transformed into a Beautiful Soup object has the following form

Meaning that a possible way of getting the h1 tag would be to run this code:

Since it’s nested two layers deep. Though there’s no need for that because as we already know from the code written above we can call it directly. We can call it using any of these three ways

That’s that. The next thing I want to tell you about is the second argument of the BeautifulSoup function, ‘html.parser’.

The second argument specifies the parser that we want Beautiful Soup to use to create our BeautifulSoup object. We used the html.parser because it is included with Python 3 and requires no extra installations. There are others. For example, there’s lxml which like every other external library can be installed through pip:

pip install lxml

To use lxml one simply has to substitute it for html.parser whenever they are using the BeautifulSoup function like so:

The Pros of lxml include:

  1. It’s generally better at html.parser at parsing “messy” or malformed HTML code.
  2. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags.
  3. It is somewhat faster than html.parser, although it’s the speed of your network that matters

The Cons of lxml include:

  1. It needs to be installed separately and depends on third-party C libraries to function. This can lead to problems with portability and ease of use, compared to html.parser.

Another HTML parser is html5lib. The Pro of which include that

  • Like lxml, it is an extremely forgiving parser that takes even more initiative correcting broken HTML. Hence, why it is a good choice if you ever work with messy or handwritten HTML sites.

The Cons include that:

  1. It depends on an external dependency
  2. It is slower than both lxml and html.parser.

Like lxml, its usage requires installation and passage of the string ‘html5lib’ to the BeautifulSoup function as a second argument

pip install html5lib

That’s all I have to tell you about Beautiful Soup for now. More to the next articles.

The Web is buggy

Photo by Ed van duijn on Unsplash

If I had to summarize software engineering in three words I would say:

The web isn’t much different. Data is poor formatted, websites go down, and closing tags go missing. So to handle and avoid the hell that is bugs let’s learn how to process all the possible bugs that could arise from our code as it’s the developer's job to do so.

If we take a look at the second line of code in the first scraper you and I built. We can anticipate that two major things can go wrong in this line

The first thing that could go wrong is that the page we requested is not found on the server (or there was an error in retrieving it). The second is that the server is not found.

In the first scenario, urllib would return an HTTP error. The HTTP error may be “404 Page Not Found”, “500 Internal Server Error”, “400 Bad Request”, and so forth. We can handle any of these exceptions like this:

The code above simple states to the Python interpreter that

Hey, after importing these modules. Try to request a response from this URL. If that works then run the else block if not and if you have an HTTP error to throw then store your error in the variable, e, and then print e out the the screen

If you don’t get my fun explanation above then I’m guessing that you don’t know how a try-except-else block works. That’s chill. Just click here to figure it out.

Okay, now if the error that occurs from our code seems to be that the server wasn’t found at all. For example, if http://www.pythonscraping.com was down, or the URL was mistyped. The urlopen function will throw an URLError. These can be handled with a try-except block as shown in the following code:

At this point, one would think they have saved the day but NO. There’s still a possibility that the content on the page is not what you are expecting. Therefore it’s considered smart that every time you interact with a Beautiful soup object you check to see if the tags actually exist. You see the problem doesn’t arise if the tag doesn’t exist No. Beautiful Soup simply returns None if the case of the that. The problem comes up when one attempts to access a tag on a None object itself. Here’s a live example of what I’m speaking of

The exception throw is the AttributeError. But how do we fight against these two situations? An easy way would be to explicitly check for both cases:

The Error handling of every error may seem stressful at first but it’s worth it. As it adds organization to our code and makes it less difficult to write and less difficult to read. Below is our First Scraper coder but it’s written in a more efficient way

When writing scrapers, one needs to think about the overall pattern of their code in order to handle exceptions and make it readable at the same time because one would most like want to have reusable code. Having functions like getTitle (complete with good error handling) makes it easy to quickly and reliably scrape the web.

Thank you for reading the first part of this series. See you next time Champ.

Summary

  1. Introduction to Web Scraping
  2. Introduction to BeautifulSoup
  3. Introduction to Error handling in Web Scraping

Bonus

Most, if not all, of the content in this article series, is inspired by the book “Web Scraping with Python by Ryan Mitchell. Someone once said…

Knowledge isn’t power until it's applied.

So I’m going to apply all that we’ve learnt so far in a Personal project of mine feel free to join me.

Okay… So Top40Weekly is a website that provides a list of the top 40 songs currently on the US chart. I’m gonna be using all that we’ve learnt so far in addition to my python knowledge to the title of the page from the page.

First I need the URL of the page which is Top40Weekly.

Second I need to pray that the developers of this site followed the convention that the title is supposed to be kept under an h1 tag. Hopefully, they followed the rules. Okay let’s write some code

The title could not be found but Why? I’m gonna edit the code to find out. And here's what I discovered…

My request is returning an HTTPError 403: ‘Forbidden’ of which I did a quick Google search of how to fix and Viola…

The code works. Not bad if I do say so myself. 😌

You can find my code on Github. Goodbye for real this time.

References

  1. Web Scraping with Python by Ryan Mitchell
  2. w3schools.com
  3. Google.com👍🏾

Sign up to discover human stories that deepen your understanding of the world.

Tari Yekorogha
Tari Yekorogha

Written by Tari Yekorogha

A Christian boy with a laptop and a dream to break into the tech world.

No responses yet

Write a response