Stroke Prediction Analysis

In this project, a dataset containing 11 clinical features for patients that classifies whether or not they have had a stroke will be analyzed. The purpose of this project is to derive insight on characteristics and statistics regarding these patients, to create a machine learning model that can determine whether a patient is at high risk of having a stroke, and determine which factors influence whether a patient has had a stroke.

Communication

Initial team meeting via Zoom. Discussed project requirements and assigned individual responsibilities. Weekly team meetings via Zoom occur twice per week during our scheduled class time. Team communication via Slack as needed to update team members of progress and to ask for assistance.

Group Members

Caitlin Bishop, Alex Borden, Andrew Carlson, Brandon Castro

Resources

Data Source: The healthcare-dataset-stroke-data.csv from the Kaggle Website, credit to the author of the dataset fedesoriano.

Tools: Jupyter Notebook, Visual Studio Code, Python, Pandas, Numpy, Seaborn, MatPlotLib, Supervised Machine Learning Binary Classification Model, PostgreSQL, and Tableau.

Technologies Used

Data Cleaning and Analysis

Jupyter Notebook, along with Python's Pandas, Numpy, and seaborn libraries will be used to clean the data and perform an exploratory/statistical analysis.

Database Storage

Data will be stored and queried from in a PostgreSQL database.

Machine Learning

Python's scikit-learn will be employed to create a supervised machine learning binary classification model using the stroke patient data csv file. The goal is to create a model that is able to determine whether or not a patient is at a high risk of having a stroke based on various characteristics of the patient.

Dashboard

Our dashboard will be hosted on Tableau Public software, to create a fully functioning and interactive dashboard and story to visualize and present data/findings.

Stroke Prediction Analysis Dashboard Link

Roles:

Caitlin Bishop: GitHub/Data Cleaning/Exploratory Analysis/Presentation
Alex Borden: Technology/Dashboard
Andrew Carlson: Machine Learning Model
Brandon Castro: SQL-based Database

Segment 1

Presentation

Link to Presentation on Google Slides

Selected topic
- Stroke Prediction Analysis
Reason they selected the topic
- Stroke prediction was the topic chosen because of our common background/interest in the healthcare field.
Description of the source of data
- The data contains 11 clinical features regarding medical patients including patient id, gender, age, hypertension status, heart disease status, marital status, employment type, residence type, average glucose levels, body mass index(BMI), and smoking status. There is also a target vector that states whether or not a given patient has had a stroke.
Questions they hope to answer with the data
- Can the classification model determine whether or not a patient could have a stroke?
- What factors influence whether or not a stroke would occur the most?
- Through our analysis, can we find who is most susceptible to getting a stroke?

GitHub Repository

README.md
Description of the communication protocols

Database Integration

Plan for storing data in a PostgreSQL database:

Create a table in pgAdmin4 for which the csv file will be uploaded into.
Create two other tables from the main table, one for biological characteristics of patients, and one for demographic.
Perform queries to gather statistical insight on the data.

Machine Learning Model

The csv dataset will be read in as a Pandas dataframe and will be used for the machine learning model. The output for the model will be the prediction of whether or not the patient had a stroke. As mentioned above, the goal is to create a model that is able to determine whether or not a patient is at a high risk of having a stroke based on the features of the patient in the dataset. If the output for a patient state that they had a stroke, then the patient may have a high risk of having a stroke according to their features.

Segment 2

Machine Learning Model

See the gradient_boosting_model.ipynb file in the machine_learning folder for a description of data preprocessing, feature engineering, feature selection, data splitting for training/testing of model, and gradient boosting model creation.

Database

The csv file was imported into a postgreSQL database as a table using the following SQL query:

Below is a portion of the resulting table:

Story & Dashboard

We will be utilizing Tableau Public to create a story-based dashboard in combination with an interactive dashboard.

Stroke Prediction Analysis Story Link

Here is a sneak peak of the story points we will using inside Tableau.

BMI Calculator

Age vs Stroke

Stroke Percentage of Marital Status

Interactive Dashboard

Our interactive dashboard created in Tableau includes 8 different correlations for identifying trends in the stroke dataset includes the following:

Averages
Age & Stroke
Gender & Work Type
Heart Disease and Hypertension
Impact of Marriage
Impact of Residence Type
Impact of Smoking Status
BMI & Glucose Calculators

This dashboard is fully functional with a Gender & Work Type bar chart that identifies trends of stroke predictions in Male and Females.

Stroke Prediction Analysis Dashboard Link

Segment 3

Machine Learning Model

See the gradient_boosting_model.ipynb file in the machine_learning folder for optimization methods used for the model, results after optimization, determination of feature importances, and conclusion of the machine learning analysis.

Story & Dashboard

Finalized the Dashboard/Story and added Exploratory Analysis and Machine Learning to show importance within the Tableau story.

Exploratory Analysis

Machine Learning

Finalized Dashboard

Database

SQL Script tables ERD_DB_creation.sql

Below are portions of the resulting tables.

Biological Features Table:

Demographic Features Table:

Segment 4

Final Presentation

Link to Presentation on Google Slides

Name		Name	Last commit message	Last commit date
Latest commit History 268 Commits
DataProcessing		DataProcessing
Database		Database
Machine_Learning		Machine_Learning
exploratory_data_analysis		exploratory_data_analysis
images		images
resources		resources
.gitignore		.gitignore
README.md		README.md
requirements.md		requirements.md
technology.md		technology.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stroke Prediction Analysis

Communication

Group Members

Resources

Technologies Used

Data Cleaning and Analysis

Database Storage

Machine Learning

Dashboard

Roles:

Segment 1

Presentation

GitHub Repository

Database Integration

Machine Learning Model

Segment 2

Machine Learning Model

Database

Story & Dashboard

BMI Calculator

Age vs Stroke

Stroke Percentage of Marital Status

Interactive Dashboard

Segment 3

Machine Learning Model

Story & Dashboard

Exploratory Analysis

Machine Learning

Finalized Dashboard

Database

Segment 4

About

Releases

Packages

Contributors 4

Languages

bishopce16/stroke_prediction_analysis

Folders and files

Latest commit

History

Repository files navigation

Stroke Prediction Analysis

Communication

Group Members

Resources

Technologies Used

Data Cleaning and Analysis

Database Storage

Machine Learning

Dashboard

Roles:

Segment 1

Presentation

GitHub Repository

Database Integration

Machine Learning Model

Segment 2

Machine Learning Model

Database

Story & Dashboard

BMI Calculator

Age vs Stroke

Stroke Percentage of Marital Status

Interactive Dashboard

Segment 3

Machine Learning Model

Story & Dashboard

Exploratory Analysis

Machine Learning

Finalized Dashboard

Database

Segment 4

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages