Scraping for Categories, Winners, and Nominees for the 2022 GRAMMYs
web-scraping with python made fun
I’m one of those people who always claim to be really into music yet the only thing that I know about the 2022 GRAMMYs is that Tyler’s album won 😂. And if you are seeing this you probably don’t know much about the event too. Let’s fix that by web scraping for the information so that before our friends start a conversation we can’t keep up with it. We’ll already know so much, we would be the ones starting the conversation like “How are you doing?”:

Description
- For this project, I’m going to be scrapping the official GRAMMYs website using Python
- Libraries used include: requests, BeautifulSoup, re, nltk, pandas
Web Scraping begins
First, let’s load all the libraries that we would be using for this project
Next, let’s download the page with requests
Then we’ll use BeautifulSoup to parse and extract information
To extract information from the site we’ll follow the following steps
- Step 1: Open the 2022 GRAMMYs page
- Step 2: Scroll down to the awards section
- Step 3: Right-click on the “General Field” element and click on inspect
- Step 4: With the developer options open and we see that the “General Field” element is a
<h1>
element and is nested under a<div>
element.
- Step 5: Next we’ll click on the
<div>
element to see what it contains (nests)
- Step 6: Lastly we scroll down to find out that the
<div>
element of the class prose contains all the information we need. So we would be working mainly with that.
We just found out that all the Grammy details we need are in a div element that belongs to the class prose. So let’s extract the details and start storing the info we need
First, we get a list of all the categories in the 2022 grammy awards
- From one of the pictures we looked at above we can see that the “General Field” element is a
<h1>
element
- If we check the page further we see that every other category’s design has the same look with and only with the “General Field” element
That means in the <div.prose>
class only the categories are <h1>
tags. Let's get the categories then. First, we'll get all the <h1>
tags then we'll store the categories in a list. For a more detailed explanation of what's happening here
For the awards and winners
Inspection of the page shows that all awards and winners are in bold that is they have the <strong>
tag. See:
- Exhibit A:
2. Exhibit B:
Knowing that let’s get all the <strong>
tags.
Secondly, let’s get each award given and store it in a list
Everything looks good so far 😅. But this feels too easy, let’s check the number of awards stored
https://jovian.ml/kingtroga/scraping-the-web-for-details-on-the-2022-grammys/v/1&cellId=23
OH UH, the’s supposed to be 86. I know that much about the 2022 GRAMMYS 😅
- Unto the webpage for a manual check to see what’s wrong
- so award 44 is not in bold that is it’s not enclosed in a strong tag.

Yes, we can Bob. Run along now don’t deny those children their cartoons.😤
Now that Bob has gone, to fix it we’ll simply get all the <p>
tags and use regex expressions to get just the forty-fourth award. To do that in Python all one needs is the re module and we have already imported that.
There you have it we saved the day 🦸. Unto the next task. Getting the list of winners
We are basically doing the same thing we did to get the award list. Just that the pattern to follow this time is WINNER
- Or so we would have thought if we didn’t just have to squash some annoying bugs we were getting the award list. Who would have thought that the official GRAMMY website had bugs? I guess you could say their pitch isn’t perfect 😂
- Anyways a manual check revealed the following:
1. There’s a winner without the word “WINNER”
2. There’s a winner with the word “WINNNER”
3. There’s an award that two people tied for, and the two people have the word “TIE” instead
4. There’s an award that is also tied but this time the people involved contain the word ‘Tie’
With all of that clarified. Let’s get the list of winners. With all the work we just did our names better be on that list 😤
Above the sent_tokenize function from nltk was used. For a detailed explanation of how it works click on nltk. The number of winners should be 86 now if we completed our task correctly
It would be cool if we could get the number of each award in each category in an automated way but after a manual inspection of the website. The HTML shows that there’s no automated way to get that info. But that didn’t stop me 😤 and just like my African parents fought lions on their way to school. I checked the entire page and counted each award in each category manually. Don’t worry I crossed checked my data on the internet just to be sure I was accurate 😂
Finally, now that we have gotten the data we were scraping for. Let’s compile our data into data frames and save it as CSV files on our systems
Summary and Conclusion
- In this project, using python, we first imported the needed modules that are requests, BeautifulSoup, re, nltk, pandas.
- We then downloaded the page contents of the 2022 GRAMMYs List with the requests module
- Afterwhich we scraped the page, using the BeautifulSoup module, for the Categories, Awards, and Winners of the 2022 GRAMMYs
- Lastly, we stored our found information in a data frame then saved it to our system as a CSV file using the pandas module
- Note: To get the files, check the references for a link to my Github page. Go to the page and download it from there
Thank you for following me along on this journey. Your name may not be on the grammy list but surely you’re a winner. If you need me, I would be listening to Tyler’s new song, “Come on, Let’s go”. Bye🎧
References and Future Work
References:
- https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list
- https://en.wikipedia.org/wiki/List_of_Grammy_Award_categories
- https://github.com/kingtroga/scraping-the-web-for-details-on-the-2022-grammys
Future work:
- There’s still room for improvement in the data I collected. The data could be cleaned more
- Using the ‘no_of_awards’ list one could perform hierarchical indexing on a new data frame that would contain the Categories, Awards, and Winners together.