The data set I've picked contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. In this report, I'll be wrangling and analyzing the the tmdb data set.
The data set has two columns for budget and revenue. The second columns for each are the adjusted values in terms of 2010 dollars accounting for inflation over time.
The data set also has three columns with delimiters "|" that seperates various values.
- numpy
- pandas
- matplotlib.pyplot
- seaborn
The modules listed in the section above can be downloaded in the anaconda IDE (recommended software to run the ipynb files) using conda install module_name
or the conventional pip install module_name
The recommended way to run the ipynb files is by setting up a virtual environment with conda and running the files in a jupyter notebook. Click here to learn how to set up and manage virtual environments with conda.
The html files that contains all the necessary codes and findings are also available in the main
branch
The files in this repo currently have no bugs.
- id: This column won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- imdb_id: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- popularity: This column will be very useful in the analysis as it can be considered an independent variable for the revenue (dependent variable) as the revenue can depend on the popularity.
- budget: The budget here isn't updated so I'll be using the adjusted budget
- revenue: The revenue here isn't updated so I'll be using the adjusted revenue
- original_title: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- cast: This column will probably be useful in the analysis.
- homepage: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- director: This column will probably be useful in the analysis.
- tagline: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- keywords: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- overview: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- runtime: This column will be very useful in the analysis as it can be considered an independent variable for the revenue (dependent variable) as the revenue can depend on the runtime.
- genres: This column will probably be useful for the analysis.
- production_companies: This column will probably be useful for the analysis.
- release_date: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
- vote_count: This column won't be useful as the average vote is giving us the required vote information and we don't need to know the amount of people that voted.
- vote_average: This column will be very useful in the analysis as it can be considered an independent variable for the revenue (dependent variable) as the revenue can depend on the vote average.
- release_year: This will be useful in the analysis to see the most-liked movies from year to year.
- budget_adj: This is very important to the analysis as it can be considered an independent variable that the revenue generated can depend on
- revenue_adj: This is probably the most important column here as most of this analysis will be based on it.
In this analysis, I'll be answering the following questions:
- Which genres are most popualar from year to year
- Properties associated with movies that have high revenues
- What level of popularity generates the highest revenue?
- How does runtime affect the revenue generated?
- How does vote average affect the revenue generated?
- How does budget affect the revenue generated?
- How does the genre affect the revenue generated?
My analysis will start with data wrangling in which I'll be accessing the dataset and cleaning the errant data columns and rows. After that, I'll be doing some exploratory data analysis to check out the trends the data has with scatter plots and bar charts.
- Thriller, Western, and War are the most popular genres year to year
- Properties associated with movies that have high revenues include:
- High Level of popularity
- Long Runtimes
- High Vote Averages
- High Budget