STAT-405-605-Group-Project

Intro

This course introduces students to the statistical programming language, R, and how to use it in statistical and data science problems. The course traces the data science pipeline from importing data into R, exploring and visualizing data, applying a variety of statistical methods, and communicating results. Important computational tools for data science (e.g. databases, web scraping, and big data) and good programming practice are integrated throughout the course.

Team

Undergraduate: Jian Ruan, Zach Yu, Carrie Hashimoto, Sophia Lyu
Graduate: Patrick Yee

Datasets

Primary Dataset

New York City Open Data - Motor Vehicle Collisions (2013 - 2022)

Secondary Dataset

New York City Weather Data
New York Times Articles

Version

1.0

What: Explore the primary dataset (>1M rows) and secondary dataset (weather).
Why: Understand basic characteristics of dataset.
How: Generate basic plots (pie chart, bar graph, regression, residual plot, normal Q-Q plot, cumulative linear plot etc).
Who:
- Graph Generation: All team members (x6 basic plots)
- Report LatTex Writing + Submission: Patrick

2.0

What: Analyze trends and patterns in data.
Why: To identify significant change in pattern.
How: Generate more sophisticated ggplots (GIS map, density plot, ribbon plot, bar graph, pie chart, dot plot etc.)
- Who:
  - Graph Generation: All team members (x6 ggplots)
  - Report LatTex Writing + Submission: Patrick

3.0

What: Identify the effect of COVID on car crash occurrences.
Why: To understand the behavior change in city level transportation.
How: Compare and contrast GIS distribution before and after COVID, add secondary dataset (NYT articles) for linear regression analysis, and identify contributing factors across time.
Who:
- Graph Generation: All team members (x4 COVID plots)
- Report LatTex Writing + Submission: Jian

4.0 (Current Draft)

What: Integrate data into SQL database for storing & accessing large volumes of data.
Why: To enable scalability and easier data manipulation.
How: Using dplyr with SQL queries to significantly reduce the size of the data.frames to generate plots in 3.0.
Who:
- Graph Generation: All team members (x5 SQL plots)
- Report LatTex Writing + Submission: Jian

5.0

What: Produce & optimize the killer plot. Possibly Machine Learning Tools
Why: To set the main theme of final presentation.
How: Using creative ways to combine & synthesize key insights mined from primary & secondary dataset.
Who:
- Graph Generation: All team members
- Report LatTex Writing + Submission: TBD

Rehearsal

What: Each team member gets ready for their presentation & prepare for possible Q&A,
Why: Practice makes perfect.
How: Write down presentation scripts & practice in time as a group.
Who:
- All team members

Final Presentation

What: 10min presentation + 2min Q&A
Why: To share results with general audience & why the project question is worthwhile to study.
How: Each team member presents around 2min.
- quality of exploration/analysis (15%)
- format/quality of plots/tables (20%)
- killer plot (20%)
- format of presentation (introduction, methods, analysis, conclusion, etc.) (25%)
- overall delivery of the slides/presentation (how well you are able to express your research) (20%)
Who:
- Patrick
- Jian
- Zach
- Carrie
- Sophia

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
draft_1		draft_1
draft_2		draft_2
draft_3		draft_3
draft_4		draft_4
final_example		final_example
killer_plot		killer_plot
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
STAT-405-605-Group-Project.Rproj		STAT-405-605-Group-Project.Rproj
final_slide_group4.Rmd		final_slide_group4.Rmd
heatmapgrid.csv		heatmapgrid.csv
killer_plot_shiny.Rmd		killer_plot_shiny.Rmd
slides_py.Rmd		slides_py.Rmd
training_data.csv		training_data.csv
xgb_results.png		xgb_results.png
xgb_trees_data.csv		xgb_trees_data.csv

Aphanmiz/STAT-405-605-Group-Project

Folders and files

Latest commit

History

Repository files navigation

STAT-405-605-Group-Project

Intro

Team

Datasets

Primary Dataset

Secondary Dataset

Version

1.0

2.0

3.0

4.0 (Current Draft)

5.0

Rehearsal

Final Presentation

About

Resources

Stars

Watchers

Forks

Languages