Skip to content

eeliuqin/Osteoporosis-Analysis-and-Prediction-on-NHANES-Data

Repository files navigation

TABLE OF CONTENTS

Introduction

Osteoporosis, the most common bone disease, occurs when bone mineral density and bone mass decrease, or there are changes in bone structure and strength. In 2010, an estimated 10.2 million people in the United States aged 50 and over had osteoporosis, and an estimated 43.3 million others had low bone mass [1]. However, it is a silent disease that most people with osteoporosis do not know they have it until they break a bone. Therefore, accurate prediction of osteoporosis is of great public health benefit.

To my best of knowledge, most osteoporosis prediction studies either focus on specific laboratory examination [2], or on specific patient groups such as postmenopausal women [3]. A study using more general data and targeting a wider population is needed.

Updated: I created a Osteoporosis Prediction Web App based on methods selected in this project. Code can be found here.

App Demo

Objective

Design a method to predict whether someone has osteoporosis based on age, gender, race, BMI, smoking, alcohol consumption, sleep duration, arthritis, liver condition, and whether a parent has osteoporosis.

Note: Age, gender, race, BMI, smoking, alcohol consumption were inspired by a study of hypertension prediction using NHANES dataset [4]. Sleep duration, arthritis, liver condition, parental osteoporosis were inspired by [5], [6], [7], and [8], respectively.

Language and Tools

Task Technique Tools/Packages
Data pre-processing merging data, handling missing values and outliers, oversampling pandas, numpy, imblearn
Data visualization boxplots, histgrams, bar graphs matplotlib, seaborn
Data modeling Logistic Regression, SVM, Random Forest, Neural Networks sklearn, tensorflow
Language & Environment Python; Jupyter Notebook

Data Source

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. Datasets of this program are prepared and published through the Centers for Disease Control and Prevention (CDC) and available to the public.

This study focuses on NHANES data for the years 2013-2014, and 2017-March 2020 Pre-Pandemic, including the following:

  • Demographics Data: Age, Gender, Race
  • Examination Data: Body Measures (BMI)
  • Questionnaire Data: Osteoporosis, Cigarette Use, Alcohol Use, Sleep Disorders, Medical Conditions (Arthritis, Liver Condition)

Note: The target sample for Osteoporosis Questionnaire in 2013-2014 and 2017-2020 were participants aged 40+ and 50+, respectively. Osteoporosis assessment in NHANES for 2015-2016 was not completed, so it's not included in this study.

Data Processing

Cleaning and Merging the Data

  • Renamed variables based on the data documents. For example, renamed RIDAGEYR to Age, SLD010H to Sleep Duration.
  • Converted code values to corresponding text values. For example, "1" should be converted to "Mexican American" for variable Race and "Male" for variable Gender.
  • Full joined datasets by the respondent sequence number (SEQN, renamed to ID).

After merging, the data has 25735 rows x 10 columns, and the percentage of missing values is shown in the table:

Variable Missing (%)
BMI 13.8
Sleep Duration (Hours) 35.7
Smoking 38.6
Arthritis 42.0
Liver Condition 42.0
Heavy Drinking 53.5
Osteoporosis 65.9
Parental Osteoporosis 67.9

Handling Missing Values

Reasons of missing values:

  1. The Osteoporosis Questionnaire focused on 8802 people (aged 40+ in 2013-2014, aged 50+ in 2017-2020), while other data such as Demographic, focused on 25735 people aged 0-80.
  2. Some people didn't answer all questions.

For 1), just ignore the missing values from Osteoporosis to focus on aged 40+, because osteoporosis is rare in young people and more common among people over age 50 [8, 9].

For 2), this is the distribution of all data vs missing data:

Missing Data Distribution

They have similar distributions, removing missing data should not cause too much bias, besides, data imputation is not considered in this study as it leads to inaccracy and uncertainty.

Therefore, the study analyzed complete data only, with a dimension of 6144 rows x 11 columns.

EDA

The dataset is imbalanced with a 9:1 ratio of people with and without osteoporosis:

Overall Data

The prevalence of osteoporosis is associated differently with variables, such as:

Female vs. Male     Age
 
Race     BMI

Feature Engineering

Binning the following variables for higher mutual information scores:

  • BMI Group: Underweight (BMI < 18.5), Healthy Weight (18.5 <= BMI < 25), Overweight (25.0 <= BMI < 30), Obesity (30.0 or higher)
  • Sleep Duration Group: Less than 7 Hours, 7-9 Hours, More than 9 Hours

Handling Imbalanced Data

For such an imbalanced data (with osteoporosis: 9.9%, without osteoporosis: 90.1%), models probably have much poor predictive performance for the minority class (with osteoporosis) than the majority (without), however, correct prediction of the minority class is more important.

There are 3 options for addressing imbalanced data: Undersampling, Oversampling, and Combination of undersampling and oversampling. The main disadvantage of undersampling is that it will discard potentially useful data, so it will not be considered in this project. Oversampling does not cause any loss of information, and in some cases, may perform better than undersampling. However, oversampling often involves duplicating a small number of events, which leads to overfitting. To balance these concerns, some scenarios may require a combination of undersampling and oversampling to obtain the most realistic dataset and accurate results.

This project compared 2 oversampling methods (Adaptive Synthetic Sampling Approach (ADASYN)[10], Synthetic Minority Oversampling Technique (SMOTE) [11])and 1 combination method (SMOTETomek [12]), here is the performance metrics of Logistic Regression with original data only, after ADASYN, after SMOTE, and after SMOTETomek:

Oversampling Method Accuracy Precision Recall F1 Score AUC
SMOTE 0.781 0.263 0.783 0.394 0.852
SMOTETomek 0.777 0.261 0.792 0.393 0.851
ADASYN 0.765 0.250 0.792 0.380 0.851
Original Data 0.908 0.484 0.142 0.219 0.853

All 3 resampling methods sigfinicantly improved recall and F1 score, with SMOTE performing best. Therefore it's applied to the training dataset.

Model Selection

Benchmarking 4 popular classification algorithms:

Model Pros Cons
Logistic Regression easier to set up and train than other machine learning applications; very efficient when the dataset has features that are linearly separable fails to capture complex relationships; overfits on high dimensional data
Support Vector Machines (SVM) works well with a clear margin of separation; effective in high-dimensional spaces doesn't perform well when the dataset is large or has more noise
Random Forest works well with non-linear data; lower risk of overfitting not suitable for dataset with a lot of sparse features
Neural Networks works well with non-linear data with large number of inputs; fast predictions once trained works like a black box and not interpretable; computation is expensive and time consuming

Results

Predicted osteoporosis based on age, gender, race, BMI, smoking, alcohol, arthritis, liver condition, and parental osteoporosis with above models, Neural Networks performed best, and after optimization, it achieved sensitivity (recall) 72.6%, f1 score 44.5% and AUC 0.85.

ROC Curves

ROC Curves of Various Models

Performance Metrics

Model Accuracy Precision Recall F1 Score AUC
Neural Networks 0.835 0.321 0.726 0.445 0.850
SVM 0.813 0.289 0.726 0.414 0.840
Logistic Regression 0.781 0.263 0.783 0.394 0.852
Random Forest 0.883 0.350 0.340 0.344 0.816

Conclusions

This study focused on predicting osteoporosis based on age, gender, race, BMI, smoking, alcohol, arthritis, liver condition, and parental osteoporosis. The analyzed results showed that women had a higher risk of osteoporosis than men, and it increased with age. Additionally, osteoporosis was associated with underweight, arthritis, and parental osteoporosis. The predictive model with Neural Networks algorithm can be implemented in applications to assist professionals in identifying people with a high risk of developing osteoporosis.

References

[1] Wright NC, Looker AC, Saag KG, Curtis JR, Delzell ES, Randall S, Dawson-Hughes B. The recent prevalence of osteoporosis and low bone mass in the United States based on bone mineral density at the femoral neck or lumbar spine. J Bone Miner Res 29(11):2520–6. 2014.
[2] Theodoros Iliou, Christos-Nikolaos Anagnostopoulos, Ioannis M. Stephanakis, George Anastassopoulos, A novel data preprocessing method for boosting neural network performance: A case study in osteoporosis prediction, Information Sciences, Volume 380, 2017.
[3] S. K. Kim, T. K. Yoo, E. Oh and D. W. Kim, "Osteoporosis risk prediction using machine learning and conventional methods," 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka, Japan, 2013, pp. 188-191, doi: 10.1109/EMBC.2013.6609469.
[4] López-Martínez, Fernando, et al. "An artificial neural network approach for predicting hypertension using NHANES data." Scientific Reports 10.1 (2020): 1-14.
[5] Ochs-Balcom HM, Hovey KM, Andrews C, Cauley JA, Hale L, Li W, Bea JW, Sarto GE, Stefanick ML, Stone KL, Watts NB, Zaslavsky O, Wactawski-Wende J. Short Sleep Is Associated With Low Bone Mineral Density and Osteoporosis in the Women's Health Initiative. J Bone Miner Res. 2020 Feb;35(2):261-268. doi: 10.1002/jbmr.3879. Epub 2019 Nov 6. PMID: 31692127; PMCID: PMC8223077.
[6] What People With Rheumatoid Arthritis Need To Know About Osteoporosis
[7] Handzlik-Orlik G, Holecki M, Wilczyński K, Duława J. Osteoporosis in liver disease: pathogenesis and management. Ther Adv Endocrinol Metab. 2016 Jun;7(3):128-35. doi: 10.1177/2042018816641351. Epub 2016 Apr 6. PMID: 27293541; PMCID: PMC4892399.
[8] Does Osteoporosis Run in Your Family
[9] Juvenile Osteoporosis
[10] Haibo He, Yang Bai, E. A. Garcia and Shutao Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322-1328, doi: 10.1109/IJCNN.2008.4633969.
[11] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
[12] G. Batista, B. Bazzan, M. Monard, “Balancing Training Data for Automated Annotation of Keywords: a Case Study,” In WOB, 10-18, 2003.

About

Machine Learning Web App - Osteoporosis Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published