an old school gramophone — Photo by Jace & Afsoon on Unsplash

Scraping for Categories, Winners, and Nominees for the 2022 GRAMMYs

6 min readApr 9, 2022

web-scraping with python made fun

I’m one of those people who always claim to be really into music yet the only thing that I know about the 2022 GRAMMYs is that Tyler’s album won 😂. And if you are seeing this you probably don’t know much about the event too. Let’s fix that by web scraping for the information so that before our friends start a conversation we can’t keep up with it. We’ll already know so much, we would be the ones starting the conversation like “How are you doing?”:

Joey from FRIENDS saying “How you doin to rachel” to Rachel — Joey from friends

Description

For this project, I’m going to be scrapping the official GRAMMYs website using Python
Libraries used include: requests, BeautifulSoup, re, nltk, pandas

Web Scraping begins

First, let’s load all the libraries that we would be using for this project

Next, let’s download the page with requests

Then we’ll use BeautifulSoup to parse and extract information

To extract information from the site we’ll follow the following steps

Step 1: Open the 2022 GRAMMYs page

A pic of the official 2022 grammy list page — Image1 by Blog owner

Step 2: Scroll down to the awards section

An image show the part of the webpage that has General Field — Image2 by Blog owner

Step 3: Right-click on the “General Field” element and click on inspect

A visual image of the step above — Image3 by Blog owner

Step 4: With the developer options open and we see that the “General Field” element is a <h1> element and is nested under a <div> element.

Visual imagery of the step above — Image4 by Blog owner

Step 5: Next we’ll click on the <div> element to see what it contains (nests)

Step 6: Lastly we scroll down to find out that the <div> element of the class prose contains all the information we need. So we would be working mainly with that.

We just found out that all the Grammy details we need are in a div element that belongs to the class prose. So let’s extract the details and start storing the info we need

First, we get a list of all the categories in the 2022 grammy awards

From one of the pictures we looked at above we can see that the “General Field” element is a <h1> element

Visual imagery of the explanation above — Image7 by Blog owner

If we check the page further we see that every other category’s design has the same look with and only with the “General Field” element

That means in the <div.prose> class only the categories are <h1> tags. Let's get the categories then. First, we'll get all the <h1> tags then we'll store the categories in a list. For a more detailed explanation of what's happening here

For the awards and winners

Inspection of the page shows that all awards and winners are in bold that is they have the <strong> tag. See:

Exhibit A:

Visual imagery of the explaination above — Image9 by Blog owner

2. Exhibit B:

Knowing that let’s get all the <strong> tags.

Secondly, let’s get each award given and store it in a list

Everything looks good so far 😅. But this feels too easy, let’s check the number of awards stored

https://jovian.ml/kingtroga/scraping-the-web-for-details-on-the-2022-grammys/v/1&cellId=23

OH UH, the’s supposed to be 86. I know that much about the 2022 GRAMMYS 😅

Unto the webpage for a manual check to see what’s wrong
so award 44 is not in bold that is it’s not enclosed in a strong tag.

Visual imagery of explanation above — Image11 by Blog Owner

a picture of bob the builder asking “Can we fix it?” — Bob, Bob the builder.

Yes, we can Bob. Run along now don’t deny those children their cartoons.😤

Now that Bob has gone, to fix it we’ll simply get all the <p> tags and use regex expressions to get just the forty-fourth award. To do that in Python all one needs is the re module and we have already imported that.

There you have it we saved the day 🦸. Unto the next task. Getting the list of winners

We are basically doing the same thing we did to get the award list. Just that the pattern to follow this time is WINNER

Visual imagery for explanation above — Image 12 by Blog owner

Or so we would have thought if we didn’t just have to squash some annoying bugs we were getting the award list. Who would have thought that the official GRAMMY website had bugs? I guess you could say their pitch isn’t perfect 😂
Anyways a manual check revealed the following:

1. There’s a winner without the word “WINNER”

2. There’s a winner with the word “WINNNER”

3. There’s an award that two people tied for, and the two people have the word “TIE” instead

4. There’s an award that is also tied but this time the people involved contain the word ‘Tie’

With all of that clarified. Let’s get the list of winners. With all the work we just did our names better be on that list 😤

Above the sent_tokenize function from nltk was used. For a detailed explanation of how it works click on nltk. The number of winners should be 86 now if we completed our task correctly

It would be cool if we could get the number of each award in each category in an automated way but after a manual inspection of the website. The HTML shows that there’s no automated way to get that info. But that didn’t stop me 😤 and just like my African parents fought lions on their way to school. I checked the entire page and counted each award in each category manually. Don’t worry I crossed checked my data on the internet just to be sure I was accurate 😂

Finally, now that we have gotten the data we were scraping for. Let’s compile our data into data frames and save it as CSV files on our systems

Summary and Conclusion

In this project, using python, we first imported the needed modules that are requests, BeautifulSoup, re, nltk, pandas.
We then downloaded the page contents of the 2022 GRAMMYs List with the requests module
Afterwhich we scraped the page, using the BeautifulSoup module, for the Categories, Awards, and Winners of the 2022 GRAMMYs
Lastly, we stored our found information in a data frame then saved it to our system as a CSV file using the pandas module
Note: To get the files, check the references for a link to my Github page. Go to the page and download it from there

Thank you for following me along on this journey. Your name may not be on the grammy list but surely you’re a winner. If you need me, I would be listening to Tyler’s new song, “Come on, Let’s go”. Bye🎧

References and Future Work

References:

Future work:

There’s still room for improvement in the data I collected. The data could be cleaned more
Using the ‘no_of_awards’ list one could perform hierarchical indexing on a new data frame that would contain the Categories, Awards, and Winners together.