Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instructions to load dataset to Amazon Athena #1268

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions mimic-iv/buildmimic/athena/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Load MIMIC-IV into Athena

## Access Data
Gain a username and password to access MIMIC-IV from https://mimic.physionet.org/

## Load CSV files to S3

Run the following commands from your desktop, Cloud9 or an EC2 instance.

The commands below assumes that the machine running them has:
* AWS CLI installed
* IAM role that has s3:CreateBucket and s3:PutObject permissions.

In the lines below, replace:
* `<region>` with the aws region where your resources will be locates (ie us-west-2)
* `<username>` and `<password>` with your physionet username and password.
* `<bucket>` with the name of your S3 bucket that will contain the data.

```bash
wget -r -N -c -np --user <username> --ask-password https://physionet.org/files/mimiciv/1.0/

export MIMICIV_BUCKET=<bucket>
aws configure set default.region <region>

aws s3 mb s3:https://$MIMICIV_BUCKET

cd physionet.org/files/mimiciv/1.0/

aws s3 cp core/admissions.csv.gz s3:https://$MIMICIV_BUCKET/csv/core/admissions/
aws s3 cp core/patients.csv.gz s3:https://$MIMICIV_BUCKET/csv/core/patients/
aws s3 cp core/transfers.csv.gz s3:https://$MIMICIV_BUCKET/csv/core/transfers/

aws s3 cp icu/chartevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/icu/chartevents/
aws s3 cp icu/d_items.csv.gz s3:https://$MIMICIV_BUCKET/csv/icu/d_items/
aws s3 cp icu/datetimeevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/icu/datetimeevents/
aws s3 cp icu/icustays.csv.gz s3:https://$MIMICIV_BUCKET/csv/icu/icustays/
aws s3 cp icu/inputevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/icu/inputevents/
aws s3 cp icu/outputevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/icu/outputevents/
aws s3 cp icu/procedureevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/icu/procedureevents/

aws s3 cp hosp/d_hcpcs.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/d_hcpcs/
aws s3 cp hosp/d_icd_diagnoses.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/d_icd_diagnoses/
aws s3 cp hosp/d_icd_procedures.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/d_icd_procedures/
aws s3 cp hosp/d_labitems.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/d_labitems/
aws s3 cp hosp/diagnoses_icd.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/diagnoses_icd/
aws s3 cp hosp/drgcodes.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/drgcodes/
aws s3 cp hosp/emar.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/emar/
aws s3 cp hosp/emar_detail.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/emar_detail/
aws s3 cp hosp/hcpcsevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/hcpcsevents/
aws s3 cp hosp/labevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/labevents/
aws s3 cp hosp/microbiologyevents.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/microbiologyevents/
aws s3 cp hosp/pharmacy.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/pharmacy/
aws s3 cp hosp/poe.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/poe/
aws s3 cp hosp/poe_detail.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/poe_detail/
aws s3 cp hosp/prescriptions.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/prescriptions/
aws s3 cp hosp/procedures_icd.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/procedures_icd/
aws s3 cp hosp/services.csv.gz s3:https://$MIMICIV_BUCKET/csv/hosp/services/
```

## Map schema

Use Athena to run the commands in schema.sql. Replace MIMICIV_BUCKET with your bucket name


## Create parquet data

Use Athena to run the commands in parquet.sql. Replace MIMICIV_BUCKET with your bucket name

31 changes: 31 additions & 0 deletions mimic-iv/buildmimic/athena/parquet.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CREATE DATABASE IF NOT EXISTS `mimiciv_parquet`;

CREATE TABLE "mimiciv_parquet".admissions WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/admissions/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".admissions;
CREATE TABLE "mimiciv_parquet".patients WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/patients/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".patients;
CREATE TABLE "mimiciv_parquet".transfers WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/transfers/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".transfers;

CREATE TABLE "mimiciv_parquet".chartevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/chartevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".chartevents;
CREATE TABLE "mimiciv_parquet".d_items WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/d_items/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".d_items;
CREATE TABLE "mimiciv_parquet".datetimeevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/datetimeevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".datetimeevents;
CREATE TABLE "mimiciv_parquet".icustays WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/icustays/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".icustays;
CREATE TABLE "mimiciv_parquet".inputevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/inputevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".inputevents;
CREATE TABLE "mimiciv_parquet".outputevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/outputevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".outputevents;
CREATE TABLE "mimiciv_parquet".procedureevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/procedureevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".procedureevents;

CREATE TABLE "mimiciv_parquet".d_hcpcs WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/d_hcpcs/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".d_hcpcs;
CREATE TABLE "mimiciv_parquet".d_icd_diagnoses WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/d_icd_diagnoses/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".d_icd_diagnoses;
CREATE TABLE "mimiciv_parquet".d_icd_procedures WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/d_icd_procedures/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".d_icd_procedures;
CREATE TABLE "mimiciv_parquet".d_labitems WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/d_labitems/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".d_labitems;
CREATE TABLE "mimiciv_parquet".diagnoses_icd WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/diagnoses_icd/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".diagnoses_icd;
CREATE TABLE "mimiciv_parquet".drgcodes WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/drgcodes/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".drgcodes;
CREATE TABLE "mimiciv_parquet".emar WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/emar/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".emar;
CREATE TABLE "mimiciv_parquet".emar_detail WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/emar_detail/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".emar_detail;
CREATE TABLE "mimiciv_parquet".hcpcsevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/hcpcsevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".hcpcsevents;
CREATE TABLE "mimiciv_parquet".labevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/labevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".labevents;
CREATE TABLE "mimiciv_parquet".microbiologyevents WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/microbiologyevents/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".microbiologyevents;
CREATE TABLE "mimiciv_parquet".pharmacy WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/pharmacy/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".pharmacy;
CREATE TABLE "mimiciv_parquet".poe WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/poe/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".poe;
CREATE TABLE "mimiciv_parquet".poe_detail WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/poe_detail/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".poe_detail;
CREATE TABLE "mimiciv_parquet".prescriptions WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/prescriptions/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".prescriptions;
CREATE TABLE "mimiciv_parquet".procedures_icd WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/procedures_icd/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".procedures_icd;
CREATE TABLE "mimiciv_parquet".services WITH (external_location = 's3:https://MIMICIV_BUCKET/parquet/services/',format = 'Parquet') AS SELECT * FROM "mimiciv_csv".services;
Loading