Ready to be discharged: predicting hospital readmissions

Overview

This project addresses the challenge of hospital readmissions, focusing on diabetic patients, a significant issue that has a great impact on healthcare costs. It was developed for the "Machine Learning" course from the MSc in Data Science and Advanced Analytics, at NOVA IMS.
The project resolved on developing two models:

Binary Classification: Create a classification model that can accurately predict if a patient will be readmitted to the hospital within 30 days of being discharged.

Multiclass Classification: The second objective is to develop a multiclass classifier that predicts the timeframe of a patient’s readmission, with the classes being “No”, “<30 days”, “>30 days”.

Python Packages Used

Data Manipulation: pandas, numpy, math, scipy

Data Visualization: matplotlib, missingno, seaborn

Machine Learning: sklearn, boruta, imblearn

General Purpose: warnings

(Note: only models from scikit-learn could be used for this project.)

Dataset

The dataset consists of 29 features and 2 target features (one for binary, one for multiclass). It includes crucial features such as patient identifiers, demographic information, health-related details, historical healthcare utilization, admission specifics, vital signs, discharge information, length of hospital stay, lab tests, procedures, medications, and diagnostic codes.
For more detailed explanation of attributes, click here.
In the Datasets folder, there are two available datasets, the train (model training and validation) and test data (performance of trained models on unseen data).

Preprocessing

Steps taken:

Data Reading and Transformation: Utilized Pandas to read the train file, defining 'encounter_id' as the DataFrame index. Simplified variable names for clarity and categorized variables into metric and non-metric features.

Duplicated Values: Checked for duplicated rows to ensure data integrity.

Data Visualization: Plotted feature distributions using histograms and box plots to identify outliers. Applied transformations to stabilize distributions and normalize data.

Outlier Handling: Employed Visual Inspection method to identify and remove outliers based on box plots criteria.

Missing Values: Handled missing values by dropping features with more than 50% missing data. Replaced missing values in specific features based on metadata insights and assumptions. Imputed missing values for remaining categorical variables using mode.

Feature Selection: Dropped irrelevant features such as 'country' and 'patient_id' to streamline the dataset.

Feature Engineering

Steps taken:

Mapping Categorical Features: Categorized and mapped variables like 'age', diagnosis features, 'discharge_disposition', and 'admission_source' to simplify and extract relevant information.

Handling Missing Values: After mapping, missing values were handled. New features like 'recurrency', 'patient_severity', 'medication_change_ratio', 'number_prior_visits', 'lab_tests_to_medications_ratio', and 'age_times_medications' were created based on insights from the data.

Data Exploration: Plotted histograms to analyze distributions of new features and check for transformations. Also, assessed correlations between new and existing features.

Encoding Categorical Variables: Encoded categorical features using Label Encoder.

Data Scaling: Utilized robust scaler to scale the data, maintaining distribution and handling outliers effectively.

Binary Classification

Feature Selection

Techniques Used: Employed various feature selection techniques including Mutual Information, Boruta, Recursive Feature Elimination (RFE), Ridge Regression, Select K Best, LASSO Regression, and Chi-Square.

Optimal Features Determination: Selected 14 features based on Select K Best and RFE with Cross-Validation, ensuring improved model performance and interpretability.

Feature Sets: Derived two feature sets based on consensus from multiple feature selection methods, ensuring robustness.

Modelling Approach

Model Exploration: Tested multiple models including Logistic Regression, Decision Trees, Random Forest, Hist Gradient Boosting, K-Nearest Neighbors, Naive Bayes, Support Vector Machines, and Neural Networks to identify promising candidates.

Handling Class Imbalance: Utilized Random Over Sampler and Random Under Sampler to address class imbalance.

Scalability: Explored Min-Max, Z-score, and Robust Scaler for feature normalization.

Model Selection: Narrowed down to Random Forest, Hist Gradient Boosting, and Stacking models for further evaluation.

Performance Assessment

Evaluation Metrics: Utilized F1 score as the primary metric for model evaluation, considering its significance in identifying incorrectly classified cases.

Model	F1	Mean Accuracy	Train Score	Test Score
Random Forest	0.70	0.760	0.789	0.695
Hist Gradient Boosting	0.66	0.685	0.693	0.658
Stacking (NN + ET)	0.64	0.684	0.690	0.644

Conclusion: For model selection, we opted for a Stacking model comprising a Neural Network classifier and an Extra Trees classifier due to its effectiveness in achieving balanced performance across metrics.

Multiclass Classification

Feature Selection

Techniques Used: Employed Mutual Information, Boruta, RFE, Select K Best, LASSO Regression, and Ridge Regression for feature selection.

Optimal Features Determination: Selected 19 features based on cross-validation scores and feature importance rankings from multiple techniques.

Modelling Approach

Model Selection: Tested models including Decision Tree, Random Forest, AdaBoost, Neural Networks, and Naïve Bayes.

Handling Class Imbalance: Utilized Random Under Sampler to address class imbalance and improve computational efficiency.

Consideration for Model Selection: Chose models based on their interpretability, ability to handle complex relationships, and computational efficiency.

Performance Assessment

Evaluation Metrics: Utilized Macro-Averaging for fair evaluation of all classes, considering equal weight for each class.

Model	F1	Mean Accuracy	Train Score	Test Score
AdaBoost	0.60	0.536	0.597	0.544
Random Forest	0.59	0.530	0.672	0.590

Conclusion: Adaptive Boosting slightly outperformed Random Forest with a mean accuracy score of 0.5970 against 0.59059.

Conclusion

Based on the analysis conducted for both binary and multiclass classification tasks, we arrived at some significant conclusions. Firstly, our chosen models demonstrated promising predictive capabilities, with probabilities of patient readmission calculated at 8% for multiclass classification and 40% for binary classification. These probabilities were derived from a careful assessment of true positives and false positives within relevant classes.

Moreover, the features selected for model inclusion align well with our initial research objectives, emphasizing factors such as hospital length of stay, medication, and associated diagnoses. Notably, the 'recurrency' feature emerged as particularly influential, reflecting the heightened risk associated with repeated patient admissions.

Despite these promising outcomes, our study encountered certain limitations. We are almost certain, that we can achive better scores by engineering better features from the existing dataset. We faced some challenges in picking the right tools and techniques for our study, and the process of analyzing the data was time-consuming. Despite these challenges, our models performed similarly well.

TODO in future:

Improve feature engineering by adding more features

Improve final model scores

Add requirements.yml file to readme

Add installation and usage part to readme

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Datasets		Datasets
Notebooks		Notebooks
README.md		README.md
attribute_table.png		attribute_table.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ready to be discharged: predicting hospital readmissions

Overview

Python Packages Used

Dataset

Preprocessing

Steps taken:

Feature Engineering

Steps taken:

Binary Classification

Feature Selection

Modelling Approach

Performance Assessment

Multiclass Classification

Feature Selection

Modelling Approach

Performance Assessment

Conclusion

TODO in future:

About

Releases

Packages

Contributors 2

Languages

Joana-Goncalves/ML_Hospital_Readmissions

Folders and files

Latest commit

History

Repository files navigation

Ready to be discharged: predicting hospital readmissions

Overview

Python Packages Used

Dataset

Preprocessing

Steps taken:

Feature Engineering

Steps taken:

Binary Classification

Feature Selection

Modelling Approach

Performance Assessment

Multiclass Classification

Feature Selection

Modelling Approach

Performance Assessment

Conclusion

TODO in future:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages