Skip to content

SFU CMPT 353 (Computational Data Science) - Data Analysis on Predicting Adult Wage with Machine Learning (pandas, sklearn, etc)

Notifications You must be signed in to change notification settings

j-shim/CMPT353GroupProject

Repository files navigation

The Data Hunters

This is data analysis on predicting adult wage (and partial works on other data) by Sean Nam and June Shim as a group project for CMPT 353: Computational Data Science (Summer 2019) from Simon Fraser University, Burnaby, British Columbia, Canada.

In this project, we applied machine learning techniques with several python libraries such as pandas and sklearn in order to answer/predict given questions.

Detailed instructions for data analysis on Usage section below.

Setup

You will need Git, Python 3.5+, Jupyter Notebook installed on your machine.

Cross-platform Install with Anaconda (Windows, macOS, Linux) - Recommended

  • Select Python 3.5+ Installer
  • Download Git here

Debian/Ubuntu based Linux with APT

Open up a Terminal and type:

sudo apt update
sudo apt install python3 python3-dev python3-pip git
pip3 install scipy matplotlib bokeh pandas statsmodels scikit-learn scikit-image numexpr jupyter

macOS with Homebrew Package Manager

Open up a Terminal and type:

brew update
brew install python3 git
pip3 install scipy matplotlib bokeh pandas statsmodels scikit-learn scikit-image numexpr jupyter

If you need to install additional packages, install with pip3:

pip3 install <package-to-install>

Cloning this repository onto your local machine

Open up a Terminal, cd to your preferred directory and type:

git clone [email protected]:jys2/the-data-hunters.git

Note: If git clone fails, confirm that your SSH Key is set up and registered properly.

Usage

adult-wage/

This is our main topic/focus of data analysis. The question we want to answer here is to predict whether an adult's yearly income is greater than $50,000 USD, based on many features/information about the person.

Open by typing in Terminal:

jupyter-notebook adult-wage-prediction.ipynb

and run the cells step by step.

movie-wikidata/

This is our first attempt on data analysis; however, we decided to change our datasets/questions since the prediction scores were not the best, and it was difficult to extract useful insights from the data.

The question we want to answer here is to predict review scores of movies based on various features such as casts, directors and plots.

predict_review_by_plots_*.py

Step 1: Extract/Clean data and save to gzipped json

# Running the program without arguments will display usage:
$ python3 predict_review_by_plots_step1_join_tables.py
Usage: python3 program.py <input_directory> <output_json_gz>
  e.g. python3 program.py data plots.json.gz
# Running below will produce plots.json.gz, which is a result of
# joining Pandas Dataframe and dropping unnecessary columns.
# Note: First argument must be 'data' which is an input folder, and
# second argument must be .json.gz extension
python3 predict_review_by_plots_step1_join_tables.py data plots.json.gz

Step 2a: Analyze data by rounding review scores

  • Running the program without arguments will display similar message as above:
# Assuming the output of Step 1 is plots.json.gz,
python3 predict_review_by_plots_step2a_rounding.py plots.json.gz

Step 2b: Analyze data with regression

  • Run the program similar to Step 2a:
# Assuming the output of Step 1 is plots.json.gz,
python3 predict_review_by_plots_step2b_regression.py plots.json.gz
RT_casts_and_directors.ipynb

Open by typing in Terminal:

jupyter-notebook RT_casts_and_directors.ipynb

and run the cells step by step.

unused-data-analysis/ (credit card fraud detection)

The question we want to answer here is to predict whether a credit card transaction is fraud or not, based on information about the transaction.

  • Note that this data is not suitable for our project, as the data (features) is already processed with PCA. Data is archived into this folder.

Open by typing in Terminal:

jupyter-notebook creditcard.ipynb

and run the cells step by step.

Authors

Acknowledgments

About

SFU CMPT 353 (Computational Data Science) - Data Analysis on Predicting Adult Wage with Machine Learning (pandas, sklearn, etc)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages