This library is under development. Please contact [email protected] or any of the contributors for more information.
- License: MIT
- Development Status: Pre-Alpha
- Homepage: https://github.com/MLBazaar/Cardea
- Documentation: https://MLBazaar.github.io/Cardea
Cardea is a machine learning library built on top of schemas that support electronic health records (EHR). The library uses a number of AutoML tools developed under The Human Data Interaction Project at Data to AI Lab at MIT.
Our goal is to provide an easy to use library to develop machine learning models from electronic health records. A typical usage of this library will involve interacting with our API to develop prediction models.
A series of sequential processes are applied to build a machine learning model. These processes are triggered using our following APIs to perform the following:
-
loading data using the automatic data assembler, where we capture data from its raw format into an entityset representation.
-
data labeling where we create label times that generates (1) the time index that indicates the timespan for which I create my features (2) the encoded labels of the prediction task. this is essential for our feature engineering phase.
-
featurization for which we automatically feature engineer our data to generate a feature matrix.
-
lastly, we build, train, and tune our machine learning model using the modeling component.
to learn more about how we structure our machine learning process and our data structures, read our documentation here.
If you want to be part of the Cardea community to receive announcements of the latest releases, ask questions, or suggest new features, please join our Slack Workspace!
The easiest and recommended way to install Cardea is using pip:
pip install cardea
This will pull and install the latest stable release from PyPi.
In this short tutorial we will guide you through a series of steps that will help you get Cardea started.
First, we download the dataset we will be working with. Here in this example, we are loading a pre-processed version of the Kaggle dataset: Medical Appointment No Shows.
We can use a helper function to download the data.
from cardea.data import download
data_path = download('kaggle')
Alternatively, you can download the dataset directly from the s3 bucket.
curl -O https://dai-cardea.s3.amazonaws.com/kaggle.zip && unzip -d kaggle kaggle.zip
Then, we instantiate a cardea instance by supplying the data_path
to the initializer and choosing the format of the data.
from cardea import Cardea
cardea = Cardea(data_path=data_path,
fhir=True)
To verify that the data has been loaded, you can find the loaded entityset by viewing cardea.entityset
which should output the following:
Entityset: kaggle
Entities:
Address [Rows: 81, Columns: 2]
Appointment_Participant [Rows: 6100, Columns: 2]
Appointment [Rows: 110527, Columns: 5]
CodeableConcept [Rows: 4, Columns: 2]
Coding [Rows: 3, Columns: 2]
Identifier [Rows: 227151, Columns: 1]
Observation [Rows: 110527, Columns: 3]
Patient [Rows: 6100, Columns: 4]
Reference [Rows: 6100, Columns: 1]
Relationships:
Appointment_Participant.actor -> Reference.identifier
Appointment.participant -> Appointment_Participant.object_id
CodeableConcept.coding -> Coding.object_id
Observation.code -> CodeableConcept.object_id
Observation.subject -> Reference.identifier
Patient.address -> Address.object_id
The output shown represents the entityset data structure where cardea.entityset
is composed of entities and relationships. You can read more about entitysets here.
From there, you can select the prediction problem you aim to solve by specifying the name of the function, which in return gives us the label_times
of the problem.
from cardea.data_labeling import appointment_no_show
label_times = cardea.label(appointment_no_show, subset=1000) # labeling only a subset of the data
label_times
summarizes for each instance in the dataset (1) what is its corresponding label of the instance and (2) what is the time index that indicates the timespan allowed for calculating features that pertain to each instance in the dataset.
identifier time label
0 5030230 2015-11-10 07:13:56 True
1 5122866 2015-12-03 08:17:28 False
2 5134197 2015-12-07 10:40:59 False
3 5134220 2015-12-07 10:42:42 True
4 5134223 2015-12-07 10:43:01 True
You can read more about label_times
here.
Then, you can perform the AutoML steps and take advantage of Cardea.
Cardea extracts features through automated feature engineering by supplying the label_times
pertaining to the problem you aim to solve
feature_matrix = cardea.featurize(label_times)
⚠️ Featurizing the data might take a while depending on the size of the data.
Once we have the features, we can now split the data into training and testing
y = feature_matrix.pop('label').values
X = feature_matrix.values
X_train, X_test, y_train, y_test = cardea.train_test_split(
X, y, test_size=0.2, shuffle=True)
Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model
cardea.set_pipeline('Random Forest')
cardea.fit(X_train, y_train)
y_pred = cardea.predict(X_test)
Finally, you can evaluate the performance of the model
cardea.evaluate(X_test, y_test, shuffle=True)
which returns the scoring metric depending on the type of problem
Accuracy 0.75
F1 Macro 0.5098
Precision 0.5183
Recall 0.5123
If you use Cardea for your research, please consider citing the following paper:
Sarah Alnegheimish; Najat Alrashed; Faisal Aleissa; Shahad Althobaiti; Dongyu Liu; Mansour Alsaleh; Kalyan Veeramachaneni. Cardea: An Open Automated Machine Learning Framework for Electronic Health Records. IEEE DSAA 2020.
@inproceedings{alnegheimish2020cardea,
title={Cardea: An Open Automated Machine Learning Framework for Electronic Health Records},
author={Alnegheimish, Sarah and Alrashed, Najat and Aleissa, Faisal and Althobaiti, Shahad and Liu, Dongyu and Alsaleh, Mansour and Veeramachaneni, Kalyan},
booktitle={2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)},
pages={536--545},
year={2020},
organization={IEEE}
}