Skip to content

Joana-Goncalves/ML_Hospital_Readmissions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ready to be discharged: predicting hospital readmissions

Overview

This project addresses the challenge of hospital readmissions, focusing on diabetic patients, a significant issue that has a great impact on healthcare costs. It was developed for the "Machine Learning" course from the MSc in Data Science and Advanced Analytics, at NOVA IMS.
The project resolved on developing two models:

  • Binary Classification: Create a classification model that can accurately predict if a patient will be readmitted to the hospital within 30 days of being discharged.
  • Multiclass Classification: The second objective is to develop a multiclass classifier that predicts the timeframe of a patient’s readmission, with the classes being “No”, “<30 days”, “>30 days”.
  • Python Packages Used

  • Data Manipulation: pandas, numpy, math, scipy
  • Data Visualization: matplotlib, missingno, seaborn
  • Machine Learning: sklearn, boruta, imblearn
  • General Purpose: warnings

  • (Note: only models from scikit-learn could be used for this project.)

    Dataset

    The dataset consists of 29 features and 2 target features (one for binary, one for multiclass). It includes crucial features such as patient identifiers, demographic information, health-related details, historical healthcare utilization, admission specifics, vital signs, discharge information, length of hospital stay, lab tests, procedures, medications, and diagnostic codes.
    For more detailed explanation of attributes, click here.
    In the Datasets folder, there are two available datasets, the train (model training and validation) and test data (performance of trained models on unseen data).

    Preprocessing

    Steps taken:

  • Data Reading and Transformation: Utilized Pandas to read the train file, defining 'encounter_id' as the DataFrame index. Simplified variable names for clarity and categorized variables into metric and non-metric features.
  • Duplicated Values: Checked for duplicated rows to ensure data integrity.
  • Data Visualization: Plotted feature distributions using histograms and box plots to identify outliers. Applied transformations to stabilize distributions and normalize data.
  • Outlier Handling: Employed Visual Inspection method to identify and remove outliers based on box plots criteria.
  • Missing Values: Handled missing values by dropping features with more than 50% missing data. Replaced missing values in specific features based on metadata insights and assumptions. Imputed missing values for remaining categorical variables using mode.
  • Feature Selection: Dropped irrelevant features such as 'country' and 'patient_id' to streamline the dataset.
  • Feature Engineering

    Steps taken:

  • Mapping Categorical Features: Categorized and mapped variables like 'age', diagnosis features, 'discharge_disposition', and 'admission_source' to simplify and extract relevant information.
  • Handling Missing Values: After mapping, missing values were handled. New features like 'recurrency', 'patient_severity', 'medication_change_ratio', 'number_prior_visits', 'lab_tests_to_medications_ratio', and 'age_times_medications' were created based on insights from the data.
  • Data Exploration: Plotted histograms to analyze distributions of new features and check for transformations. Also, assessed correlations between new and existing features.
  • Encoding Categorical Variables: Encoded categorical features using Label Encoder.
  • Data Scaling: Utilized robust scaler to scale the data, maintaining distribution and handling outliers effectively.
  • Binary Classification

    Feature Selection

  • Techniques Used: Employed various feature selection techniques including Mutual Information, Boruta, Recursive Feature Elimination (RFE), Ridge Regression, Select K Best, LASSO Regression, and Chi-Square.
  • Optimal Features Determination: Selected 14 features based on Select K Best and RFE with Cross-Validation, ensuring improved model performance and interpretability.
  • Feature Sets: Derived two feature sets based on consensus from multiple feature selection methods, ensuring robustness.
  • Modelling Approach

  • Model Exploration: Tested multiple models including Logistic Regression, Decision Trees, Random Forest, Hist Gradient Boosting, K-Nearest Neighbors, Naive Bayes, Support Vector Machines, and Neural Networks to identify promising candidates.
  • Handling Class Imbalance: Utilized Random Over Sampler and Random Under Sampler to address class imbalance.
  • Scalability: Explored Min-Max, Z-score, and Robust Scaler for feature normalization.
  • Model Selection: Narrowed down to Random Forest, Hist Gradient Boosting, and Stacking models for further evaluation.
  • Performance Assessment

    Evaluation Metrics: Utilized F1 score as the primary metric for model evaluation, considering its significance in identifying incorrectly classified cases.

    Model F1 Mean Accuracy Train Score Test Score
    Random Forest 0.70 0.760 0.789 0.695
    Hist Gradient Boosting 0.66 0.685 0.693 0.658
    Stacking (NN + ET) 0.64 0.684 0.690 0.644

    Conclusion: For model selection, we opted for a Stacking model comprising a Neural Network classifier and an Extra Trees classifier due to its effectiveness in achieving balanced performance across metrics.

    Multiclass Classification

    Feature Selection

  • Techniques Used: Employed Mutual Information, Boruta, RFE, Select K Best, LASSO Regression, and Ridge Regression for feature selection.
  • Optimal Features Determination: Selected 19 features based on cross-validation scores and feature importance rankings from multiple techniques.
  • Modelling Approach

  • Model Selection: Tested models including Decision Tree, Random Forest, AdaBoost, Neural Networks, and Naïve Bayes.
  • Handling Class Imbalance: Utilized Random Under Sampler to address class imbalance and improve computational efficiency.
  • Consideration for Model Selection: Chose models based on their interpretability, ability to handle complex relationships, and computational efficiency.
  • Performance Assessment

    Evaluation Metrics: Utilized Macro-Averaging for fair evaluation of all classes, considering equal weight for each class.

    Model F1 Mean Accuracy Train Score Test Score
    AdaBoost 0.60 0.536 0.597 0.544
    Random Forest 0.59 0.530 0.672 0.590

    Conclusion: Adaptive Boosting slightly outperformed Random Forest with a mean accuracy score of 0.5970 against 0.59059.

    Conclusion

    Based on the analysis conducted for both binary and multiclass classification tasks, we arrived at some significant conclusions. Firstly, our chosen models demonstrated promising predictive capabilities, with probabilities of patient readmission calculated at 8% for multiclass classification and 40% for binary classification. These probabilities were derived from a careful assessment of true positives and false positives within relevant classes.

    Moreover, the features selected for model inclusion align well with our initial research objectives, emphasizing factors such as hospital length of stay, medication, and associated diagnoses. Notably, the 'recurrency' feature emerged as particularly influential, reflecting the heightened risk associated with repeated patient admissions.

    Despite these promising outcomes, our study encountered certain limitations. We are almost certain, that we can achive better scores by engineering better features from the existing dataset. We faced some challenges in picking the right tools and techniques for our study, and the process of analyzing the data was time-consuming. Despite these challenges, our models performed similarly well.

    TODO in future:

  • Improve feature engineering by adding more features
  • Improve final model scores
  • Add requirements.yml file to readme
  • Add installation and usage part to readme