Skip to content

Predict the success of an anime using data science and machine learning (regression + classification)

License

Notifications You must be signed in to change notification settings

ztjhz/SC1015-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

SC1015 DSAI Project: AniFame

Our Motivation:

  • Animes are outlets of relaxation and escape for people of all ages. However, while anime viewers love watching anime, studios are experiencing difficulties in making profits for many of the anime they produced due to high costs.
  • According to Eric (2015), an average 13-episode anime season costs around $2 million USD, and many animes cannot recoup this expense. In order to make it sell, anime advertisements, events and merchandise are essential to studios’ profit margin. All this depends on the popularity of the anime with anime viewers.
  • Hence, it is important to know whether the anime that a studio is producing will be profitable, hence allowing studios to maximise their profits and ensure their survivability in the industry.

MAL

Project Goal:

  • This project aims to maximize studios’ profits on animes they produce by estimating 'mean' rating of animes and predicting 'success' probability before production, hence giving studios the ability to fine-tune the animes before production.

Dataset Used:

We used MyAnimeList API to scrap anime from 2000 to 2021, cleaned and processed it for Exploratory Data Analysis and Machine Learning.

Note: Some datasets are scraped but are not included in the final project (e.g. the various ranking datasets)

Jupyter Notebooks:

Note: Some Jupyter Notebooks are used but are not included in the final project (e.g. anomaly detection, helpers, scraper)

Slide Deck:



Overview of DataScience Pipeline:

  • Used MAL API to recursively scrap thousands of anime data from 2000 to 2021
  • Removing useless features, handling missing values
  • Json conversion and manipulation
  • Feature engineering and generation
  • Creating 'genres' time series data
  • Export data as csv
  • One-hot Encoding

Explored, visualized, and generated insights for the following:

  • 'genres' + 'genres' time series
  • 'studios'
  • 'mean' rating vs 'source', 'media_type', 'nsfw', 'rating', 'genre', and 'studios'
  • Relationship between 'mean', 'rank', 'popularity', 'positive_viewership_fraction', and 'negative_viewership_fraction'
  • num_episodes' and 'average_episode_duration' overview trend
  • 'start_season_season'

Models:

  • Linear Regression
  • Lasso Regression
  • Ridge Regression (Best)

Metrics:

  • Explained Variance (R^2)
  • Mean Squared Error, Root Mean Squared Error

Models:

  • LinearSVC
  • Decision Tree
  • Random Forest (Best - 4th version)

Metrics:

  • TPR, TNR, Confusion Matrix
  • Precision, Recall (TPR), F-score
  • Out-of-bag score
  • ROC AUC score
  • K-fold cross validation standard deviation

6. Key Insights & Recommendations:

Studios should:

  • Focus on quality over quantity of animes
  • Broadcast animes regardless of season
  • Not focus on producing animes that generate more positive views through fan-service
  • Produce anime movie franchises

Important features that determine the success of an anime:

  • ‘average_episode_duration’
  • ‘num_episodes’
  • ‘source_manga’
  • ‘media_type_movie’
  • ‘rating_pg_13’


What we learnt from this project:

Data collection:

  • Scraping data using API calls

Data cleaning and preprocessing:

  • Feature Engineering & Feature generation
  • JSON manipulation techniques
  • Generating time-series data

EDA & Visualization:

  • Visualization plots with large number of datapoints
    • By reducing the data point size,
    • By reducing the opacity of data points, or
    • By introducing random sampling
  • ‘genres’ time-series EDA

Machine Learning:

  • Machine Learning Models:
    • Ridge Regression, Lasso Regression, Random Forest, LinearSVC
  • Classification Performance Metrics:
    • F-score (Precision & Recall), out-of-bag (obb) score, ROC AUC score


Contributions:

Data Collection: Jing Qiang and Jing Hua
Data cleaning and preprocessing: Jing Qiang, Jing Hua, and YinFeng
EDA and visualization: Jing Qiang and Jing Hua
Regression: Jing Hua
Classification: Jing Qiang
Presentation Script: Jing Qiang
Presentation Voice Over + Editing: Jing Hua
Slides Deck: Jing Qiang, Jing Hua, YinFeng
GitHub ReadMe: Jing Qiang

Did but not included in the final product:

  • Ranking dataset EDA: YinFeng
  • Anomaly Detection: Jing Qiang, YinFeng

References:

About

Predict the success of an anime using data science and machine learning (regression + classification)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published