This is data analysis on predicting adult wage (and partial works on other data) by Sean Nam and June Shim as a group project for CMPT 353: Computational Data Science (Summer 2019) from Simon Fraser University, Burnaby, British Columbia, Canada.
In this project, we applied machine learning techniques with several python libraries such as pandas and sklearn in order to answer/predict given questions.
Detailed instructions for data analysis on Usage section below.
You will need Git, Python 3.5+, Jupyter Notebook installed on your machine.
Cross-platform Install with Anaconda (Windows, macOS, Linux) - Recommended
- Select Python 3.5+ Installer
- Download Git here
Open up a Terminal and type:
sudo apt update
sudo apt install python3 python3-dev python3-pip git
pip3 install scipy matplotlib bokeh pandas statsmodels scikit-learn scikit-image numexpr jupyter
macOS with Homebrew Package Manager
Open up a Terminal and type:
brew update
brew install python3 git
pip3 install scipy matplotlib bokeh pandas statsmodels scikit-learn scikit-image numexpr jupyter
If you need to install additional packages, install with pip3:
pip3 install <package-to-install>
Open up a Terminal, cd
to your preferred directory and type:
git clone [email protected]:jys2/the-data-hunters.git
Note: If git clone
fails, confirm that your SSH Key is set up and registered properly.
This is our main topic/focus of data analysis. The question we want to answer here is to predict whether an adult's yearly income is greater than $50,000 USD, based on many features/information about the person.
Open by typing in Terminal:
jupyter-notebook adult-wage-prediction.ipynb
and run the cells step by step.
This is our first attempt on data analysis; however, we decided to change our datasets/questions since the prediction scores were not the best, and it was difficult to extract useful insights from the data.
The question we want to answer here is to predict review scores of movies based on various features such as casts, directors and plots.
Step 1: Extract/Clean data and save to gzipped json
# Running the program without arguments will display usage:
$ python3 predict_review_by_plots_step1_join_tables.py
Usage: python3 program.py <input_directory> <output_json_gz>
e.g. python3 program.py data plots.json.gz
# Running below will produce plots.json.gz, which is a result of
# joining Pandas Dataframe and dropping unnecessary columns.
# Note: First argument must be 'data' which is an input folder, and
# second argument must be .json.gz extension
python3 predict_review_by_plots_step1_join_tables.py data plots.json.gz
Step 2a: Analyze data by rounding review scores
- Running the program without arguments will display similar message as above:
# Assuming the output of Step 1 is plots.json.gz,
python3 predict_review_by_plots_step2a_rounding.py plots.json.gz
Step 2b: Analyze data with regression
- Run the program similar to Step 2a:
# Assuming the output of Step 1 is plots.json.gz,
python3 predict_review_by_plots_step2b_regression.py plots.json.gz
Open by typing in Terminal:
jupyter-notebook RT_casts_and_directors.ipynb
and run the cells step by step.
The question we want to answer here is to predict whether a credit card transaction is fraud or not, based on information about the transaction.
- Note that this data is not suitable for our project, as the data (features) is already processed with PCA. Data is archived into this folder.
Open by typing in Terminal:
jupyter-notebook creditcard.ipynb
and run the cells step by step.
- Sean Nam - [email protected] / GitHub
- June Shim - [email protected] / GitHub
- README template adapted from https://gist.github.com/PurpleBooth/109311bb0361f32d87a2
- movie-wikidata data source - Gregory Baker (Instructor for the course)
- adult-wage data source - http:https://archive.ics.uci.edu/ml/datasets/Adult
- unused-data-analysis (creditcard.csv) - https://www.kaggle.com/mlg-ulb/creditcardfraud