TMBD Movie Data Analysis

The data set I've picked contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. In this report, I'll be wrangling and analyzing the the tmdb data set.

The data set has two columns for budget and revenue. The second columns for each are the adjusted values in terms of 2010 dollars accounting for inflation over time.

The data set also has three columns with delimiters "|" that seperates various values.

Required Modules

numpy
pandas
matplotlib.pyplot
seaborn

Installations

The modules listed in the section above can be downloaded in the anaconda IDE (recommended software to run the ipynb files) using conda install module_name or the conventional pip install module_name

Setup

The recommended way to run the ipynb files is by setting up a virtual environment with conda and running the files in a jupyter notebook. Click here to learn how to set up and manage virtual environments with conda.

The html files that contains all the necessary codes and findings are also available in the main branch

Known Bugs

The files in this repo currently have no bugs.

Dataset Columns

id: This column won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
imdb_id: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
popularity: This column will be very useful in the analysis as it can be considered an independent variable for the revenue (dependent variable) as the revenue can depend on the popularity.
budget: The budget here isn't updated so I'll be using the adjusted budget
revenue: The revenue here isn't updated so I'll be using the adjusted revenue
original_title: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
cast: This column will probably be useful in the analysis.
homepage: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
director: This column will probably be useful in the analysis.
tagline: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
keywords: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
overview: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
runtime: This column will be very useful in the analysis as it can be considered an independent variable for the revenue (dependent variable) as the revenue can depend on the runtime.
genres: This column will probably be useful for the analysis.
production_companies: This column will probably be useful for the analysis.
release_date: This column also won't be useful in the analysis as it's too specific to a movie and isn't a numerical data type that can be averaged out.
vote_count: This column won't be useful as the average vote is giving us the required vote information and we don't need to know the amount of people that voted.
vote_average: This column will be very useful in the analysis as it can be considered an independent variable for the revenue (dependent variable) as the revenue can depend on the vote average.
release_year: This will be useful in the analysis to see the most-liked movies from year to year.
budget_adj: This is very important to the analysis as it can be considered an independent variable that the revenue generated can depend on
revenue_adj: This is probably the most important column here as most of this analysis will be based on it.

Questions for Analysis

In this analysis, I'll be answering the following questions:

Which genres are most popualar from year to year
Properties associated with movies that have high revenues

What level of popularity generates the highest revenue?
How does runtime affect the revenue generated?
How does vote average affect the revenue generated?
How does budget affect the revenue generated?
How does the genre affect the revenue generated?

Method of Analysis

My analysis will start with data wrangling in which I'll be accessing the dataset and cleaning the errant data columns and rows. After that, I'll be doing some exploratory data analysis to check out the trends the data has with scatter plots and bar charts.

Summary of Findings

Thriller, Western, and War are the most popular genres year to year
Properties associated with movies that have high revenues include:

High Level of popularity
Long Runtimes
High Vote Averages
High Budget

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
tmbd_movie_data_analysis.html		tmbd_movie_data_analysis.html
tmbd_movie_data_analysis.ipynb		tmbd_movie_data_analysis.ipynb
tmdb-movies.csv		tmdb-movies.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TMBD Movie Data Analysis

Required Modules

Installations

Setup

Known Bugs

Dataset Columns

Questions for Analysis

Method of Analysis

Summary of Findings

About

Releases

Packages

Languages

Braim016/tmbd-movie

Folders and files

Latest commit

History

Repository files navigation

TMBD Movie Data Analysis

Required Modules

Installations

Setup

Known Bugs

Dataset Columns

Questions for Analysis

Method of Analysis

Summary of Findings

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages