an old school gramophone
Photo by Jace & Afsoon on Unsplash

Scraping for Categories, Winners, and Nominees for the 2022 GRAMMYs

Tari Yekorogha
6 min readApr 9, 2022

--

web-scraping with python made fun

I’m one of those people who always claim to be really into music yet the only thing that I know about the 2022 GRAMMYs is that Tyler’s album won 😂. And if you are seeing this you probably don’t know much about the event too. Let’s fix that by web scraping for the information so that before our friends start a conversation we can’t keep up with it. We’ll already know so much, we would be the ones starting the conversation like “How are you doing?”:

Joey from FRIENDS saying “How you doin to rachel” to Rachel
Joey from friends

Description

  • For this project, I’m going to be scrapping the official GRAMMYs website using Python
  • Libraries used include: requests, BeautifulSoup, re, nltk, pandas

Web Scraping begins

First, let’s load all the libraries that we would be using for this project

Next, let’s download the page with requests

Then we’ll use BeautifulSoup to parse and extract information

To extract information from the site we’ll follow the following steps

A pic of the official 2022 grammy list page
Image1 by Blog owner
  • Step 2: Scroll down to the awards section
An image show the part of the webpage that has General Field
Image2 by Blog owner
  • Step 3: Right-click on the “General Field” element and click on inspect
A visual image of the step above
Image3 by Blog owner
  • Step 4: With the developer options open and we see that the “General Field” element is a <h1> element and is nested under a <div> element.
Visual imagery of the step above
Image4 by Blog owner
  • Step 5: Next we’ll click on the <div> element to see what it contains (nests)
Visual Imagery of the step above
Image5 by Blog owner
  • Step 6: Lastly we scroll down to find out that the <div> element of the class prose contains all the information we need. So we would be working mainly with that.
Visual imagery of the step above
Image6 by Blog owner

We just found out that all the Grammy details we need are in a div element that belongs to the class prose. So let’s extract the details and start storing the info we need

First, we get a list of all the categories in the 2022 grammy awards

  • From one of the pictures we looked at above we can see that the “General Field” element is a <h1> element
Visual imagery of the explanation above
Image7 by Blog owner
  • If we check the page further we see that every other category’s design has the same look with and only with the “General Field” element
Visual Imagery of the explanation above
Image8 by Blog owner

That means in the <div.prose> class only the categories are <h1> tags. Let's get the categories then. First, we'll get all the <h1> tags then we'll store the categories in a list. For a more detailed explanation of what's happening here

For the awards and winners

Inspection of the page shows that all awards and winners are in bold that is they have the <strong> tag. See:

  1. Exhibit A:
Visual imagery of the explaination above
Image9 by Blog owner

2. Exhibit B:

Visual imagery of the explanation above
Image10 by Blog owner

Knowing that let’s get all the <strong> tags.

Secondly, let’s get each award given and store it in a list

Everything looks good so far 😅. But this feels too easy, let’s check the number of awards stored

https://jovian.ml/kingtroga/scraping-the-web-for-details-on-the-2022-grammys/v/1&cellId=23

OH UH, the’s supposed to be 86. I know that much about the 2022 GRAMMYS 😅

  • Unto the webpage for a manual check to see what’s wrong
  • so award 44 is not in bold that is it’s not enclosed in a strong tag.
Visual imagery of explanation above
Image11 by Blog Owner
a picture of bob the builder asking “Can we fix it?”
Bob, Bob the builder.

Yes, we can Bob. Run along now don’t deny those children their cartoons.😤

Now that Bob has gone, to fix it we’ll simply get all the <p> tags and use regex expressions to get just the forty-fourth award. To do that in Python all one needs is the re module and we have already imported that.

There you have it we saved the day 🦸. Unto the next task. Getting the list of winners

We are basically doing the same thing we did to get the award list. Just that the pattern to follow this time is WINNER

Visual imagery for explanation above
Image 12 by Blog owner
  • Or so we would have thought if we didn’t just have to squash some annoying bugs we were getting the award list. Who would have thought that the official GRAMMY website had bugs? I guess you could say their pitch isn’t perfect 😂
  • Anyways a manual check revealed the following:

1. There’s a winner without the word “WINNER”

Visual imagery for explanation above
Image 13 by Blog owner

2. There’s a winner with the word “WINNNER”

Visual imagery for explanation above
Image 14 by Blog owner

3. There’s an award that two people tied for, and the two people have the word “TIE” instead

Visual imagery for explanation above
Image 15 by Blog owner

4. There’s an award that is also tied but this time the people involved contain the word ‘Tie’

Visual imagery for explanation above
Image 16 by Blog owner

With all of that clarified. Let’s get the list of winners. With all the work we just did our names better be on that list 😤

Above the sent_tokenize function from nltk was used. For a detailed explanation of how it works click on nltk. The number of winners should be 86 now if we completed our task correctly

It would be cool if we could get the number of each award in each category in an automated way but after a manual inspection of the website. The HTML shows that there’s no automated way to get that info. But that didn’t stop me 😤 and just like my African parents fought lions on their way to school. I checked the entire page and counted each award in each category manually. Don’t worry I crossed checked my data on the internet just to be sure I was accurate 😂

Finally, now that we have gotten the data we were scraping for. Let’s compile our data into data frames and save it as CSV files on our systems

Summary and Conclusion

  • In this project, using python, we first imported the needed modules that are requests, BeautifulSoup, re, nltk, pandas.
  • We then downloaded the page contents of the 2022 GRAMMYs List with the requests module
  • Afterwhich we scraped the page, using the BeautifulSoup module, for the Categories, Awards, and Winners of the 2022 GRAMMYs
  • Lastly, we stored our found information in a data frame then saved it to our system as a CSV file using the pandas module
  • Note: To get the files, check the references for a link to my Github page. Go to the page and download it from there

Thank you for following me along on this journey. Your name may not be on the grammy list but surely you’re a winner. If you need me, I would be listening to Tyler’s new song, “Come on, Let’s go”. Bye🎧

References and Future Work

References:

  1. https://www.grammy.com/news/2022-grammys-complete-winners-nominees-nominations-list
  2. https://en.wikipedia.org/wiki/List_of_Grammy_Award_categories
  3. https://github.com/kingtroga/scraping-the-web-for-details-on-the-2022-grammys

Future work:

  1. There’s still room for improvement in the data I collected. The data could be cleaned more
  2. Using the ‘no_of_awards’ list one could perform hierarchical indexing on a new data frame that would contain the Categories, Awards, and Winners together.

--

--

Tari Yekorogha

A 19-year-old Christian boy with a laptop and a dream to break into the tech world.