Introduction

We’ve all seen our fair share of superhero movies. They’re so popular nowadays that it’s impossible to go on social media without seeing a new trailer for the next superhero movie coming to theaters soon. Whether you’re team Marvel or team DC, there’s no denying that superhero movies have been, are, and will continue to be a huge hit. But what makes these movies so good? The focus of my data collection, exploration, and analysis stems from the common traits of the top 200 superhero movies. These traits include runtime, genre, rating, and more.

Figure

Complete code and data can be found at this GitHub repo.

Data Collection

Tools

Python, and specifically the Requests and Beautiful Soup packages, were used to web scrape this collection of superhero movies. Because I’m not gaining any monetary value or praise from the data I scraped, it is ethical for me to use this data as I strive to increase my data analysis skills. I implemented good scraping practices by not changing or ignoring any information found on the website as I created the dataframe. I also would like to thank the creators of the website for providing this information. Below is how I web scraped and created the movies dataframe.

Step 1: Create Beautiful Soup Object

After choosing which website you want to scrape, create a url object.

url = "https://www.imdb.com/list/ls074940992/"

Creating a beautiful soup (bs) object of the request text allows you to search for and extract the information you want.

r0 = requests.get(url)
bs = BeautifulSoup(r0.text)

Step 2: Extract Wanted Tags

Knowing what html code contains the information you want can be tricky. To search for this specific code, right click on the webpage you’re scraping and click ‘inspect.’ Enable curser inspection and explore the parts of the webpage you are interested in scraping. The information you want is likely inside of a tag, most likely a nested tag. I like to think of these tags as Russian nesting dolls. In the example below, the first tag we’re interested in is a div tag. Next, we navigate to a h3 tag. Finally, the title of the movie is found in an a tag. The title was found in a tag within a tag within a tag, just like smaller dolls are nested within bigger dolls. If needed, specify the tag by the class as seen below.

# extract the div tag 
div = bs.find_all("div", {"class": "lister-item-content"})

Method 1

Loop through to pull wanted information from tags using list comprehension.

# extract title
titles = [d.find('h3').find('a').text for d in div]

# extract year
year = [d.find('h3').find('span',{"class":"lister-item-year text-muted unbold"}).text.strip('()') for d in div]

# extract genre
genre = [d.find('p').find('span',{"class":"genre"}).text for d in div]

Method 2

Sometimes the above method of extracting the information will throw an error. This is most likely because there are missing values that the code doesn’t know what to do with. In this case, create a function to assign a None value to those missing values. Call this function to perform the extraction.

# create function to extract rating
def pull_rating(d):
  try:
    r = d.find('p').find('span',{"class":"certificate"}).text
  except: r = None
  return r
  
# call function to extract rating 
ratings_pulled = [pull_rating(d) for d in div]

# create function to extract runtime
def pull_runtime(d):
  try:
    r = d.find('p').find('span',{"class":"runtime"}).text
  except: r = None
  return r
  
# call function to extract runtime
runtimes_pulled = [pull_runtime(d) for d in div]

Method 3

Below is another method of how to extract wanted information using a for loop. When extracting the gross for each movie, I had to index to the tag I wanted since there were multiple tags named the same thing, having the same class. I created an empty list, then looped through all the div tags to pull the gross, then added or appended those values to the empty list.

# for loop to extract gross
gross = []
for d in div:
  try:
    gross.append(d.find_all('p',{"class":"text-muted text-small"})[2].find_all('span',{'name':'nv'})[1]['data-value'])
  except:
    gross.append(None)
gross

Step 3: Create a Dataframe

Once you have all the information you want, it’s time to put it into a dataframe. Doing this allows you to explore the data. Create a dataframe, then add however many columns you want with their corresponding values.

# create dataframe of movies 1-100
df = pd.DataFrame (titles, columns = ['Title'])
df['Year'] = year
df['Rating'] = ratings_pulled
df['Runtime'] = runtimes_pulled
df['Genre'] = genre
df['Gross'] = gross

You now have an easy-to-read dataframe that allows you to have fun with exploring the data!

Conclusion

Figure

In this post, I explained how I web scraped in order to compile a dataframe of 200 superhero movies. Web scraping can be tricky, but as long as you know where to look, you’ll be able to access the information needed. I plan to use exploratory data analysis to find any trends or similaries among my dataset of movies. What makes a great superhero movie? Let’s find out!

If you have any questions, comments, or concerns, please leave them in the comment section below. I’d also love to know, what’s your favorite superhero movie?