Develop an advanced predictive model to forecast a film's box office revenue with precision and confidence. Utilizing a myriad of parameters, including budget, cast, genre, and past performance, our task is to leverage the power of machine learning to unravel the intricacies of box office dynamics and provide actionable insights for studios and filmmakers.
With the extensive data from the TMDB_5000 dataset from Kaggle, numerous recommendation systems are built. However, the true potential of the dataset remains largely untapped. Our initiative aims to harness this wealth of information to predict a film's expected revenue by leveraging a multitude of parameters and innovative feature engineering techniques, ultimately empowering stakeholders to make more informed decisions in the ever-evolving landscape of the entertainment industry.
This section contains detailed information about the approach, experimentation results, and inferences derived from the project. I have created a blog explaining the approach and execution. Please visit my blog:
Frontend | Backend | ML Library | MLOps Tools | Deployment | Version Control |
---|---|---|---|---|---|
- Formatted complex structure to simple and trainable data.
- Assigned Scores to special categorical features like crew, hero, heroine with many unique values, based on the cumulative popularity and weighted rating of their previous work to numerically determine their impact on revenue/footfall.
- Used One-hot encoding for normal categorical features with fewer unique values.
- Used log-normal transformation to handle skewed data and outliers.
- Normalized data with StandardScaler.
To predict expected revenue, we introduced a novel approach by considering footfall (number of tickets sold) as a target metric. While revenue is subject to various external factors such as ticket prices and distribution deals, footfall provides a more consistent and direct measure of a movie's popularity and audience engagement.
expected revenue = predicted footfall * current avg_ticket_price
Model | Best Model |
---|---|
RandomForestRegressor | |
DecisionTreeRegressor | |
GradientBoostingRegressor | |
LinearRegression | |
XGBRegressor | XGBRegressor |
CatBoostRegressor | |
AdaBoostRegressor |
Metric | Value |
---|---|
RMSE | 0.012 |
neg_mean_squared_error | -0.00024 |
Parameter | Value |
---|---|
colsample_bytree | 0.30000000000000004 |
learning_rate | 0.11 |
max_depth | 4 |
n_estimators | 444 |
- Method: RandomizedSearchCV
All the experiment results and models are logged in MLflow for a clearer understanding and detailed inference: View here
Home Page | Form Page | Result |
git clone https://github.com/uvaishnav/BoxOfficePrediction.git
conda create -n boxoffice python=3.9 -y
conda activate boxoffice
pip install -r requirements.txt
python app.py
open up you local host and port
git clone https://github.com/uvaishnav/BoxOfficePrediction.git
conda create -n boxoffice python=3.9 -y
conda activate boxoffice
pip install -r requirements.txt
4. Create a Kaggle Account and get the kaggle.json file and store it in .kaggle folder in your system (For data_ingestion pipeline)
For model evaluation pipeline,
- Connect repository to dagshub
- Get mlflow uri and credentials
- UPdate config.yaml file with your mlflow uri
- Then add these variables(credentials from dagshub) to your environment
export MLFLOW_TRACKING_URI= your mlflow uri
export MLFLOW_TRACKING_USERNAME= your username
export MLFLOW_TRACKING_PASSWORD= your password
dvc init
dvc repro
My.Movie.2-720p30.mov
Update the Dockerfile
as needed and build the Docker image. You need to install Docker Desktop first.
docker build -t boxoffice .
- Create an account in heroku and create an app.
- In your GitHub repository, navigate to
Settings
->Secrets and Variables
->Actions
. Add the secret keys according to your main.yaml file in workflow
HEROKU_API_KEY
HEROKU_APP_NAME
HEROKU_EMAIL
The buld will hapen and a new version of your project is deployed every time you make changes and push to github.
Our current model predicts expected revenue based on factors like budget, cast, release month, and genres.
We can enhance its utility by optimizing cast selection and release timing. By analyzing historical data, we can identify optimal combinations of actors and crew members that synergize well, thereby maximizing revenue potential. Additionally, refining our model to recommend the best release windows can help avoid high competition periods and leverage seasonal trends, further boosting a filmβs success.
- TMDB_5000 dataset from Kaggle
- 247wallst.com for preparing ticket prices dataset
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.