Skip to content

Commit

Permalink
Merge pull request #9 from yutanakamura-tky/KART-10_documentation
Browse files Browse the repository at this point in the history
KART-10: documentation
  • Loading branch information
yutanakamura-tky authored Jul 11, 2022
2 parents 3c4fa63 + 9dc9b1d commit f0752d9
Show file tree
Hide file tree
Showing 3 changed files with 798 additions and 20 deletions.
87 changes: 67 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,118 @@
# KART: Privacy Leakage Framework of Language Models Pre-trained with Clinical Records
This is an implementation of our arXiv preprint paper (https://arxiv.org/abs/2101.00036) "KART: Privacy Leakage Framework of Language Models Pre-trained with Clinical Records."
# KART: Parameterization of Privacy Leakage Scenarios from Pre-trained Language Models
This is an implementation of our paper "[KART: Parameterization of Privacy Leakage Scenarios from Pre-trained Language Models](https://arxiv.org/abs/2101.00036)."

## Usage
### 0. Requirements

- Python 3.6.4
- Make sure that `$HOME` is set to environment variable `$PYTHONPATH`.

### 1. Preparation
#### 1-1. Get Poetry
### 1. How to make MIMIC-III-dummy-PHI

<p align="center">
<img src="img/mimic_iii_dummy_phi.png">
</p>

We simulate privacy leakage from clinical records using **MIMIC-III-dummy-PHI**.

MIMIC-III-dummy-PHI is made by embedding pieced of dummy protected health information (PHI) in [MIMIC-III](https://www.nature.com/articles/sdata201635) corpus.

#### 1-1. Install dependencies

To install using `venv` module, use the following commands:

```sh
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py > ~/get-poetry.py
# Clone Repository
cd ~
python get-poetry.py --version 1.1.4
```
git clone [email protected]:yutanakamura-tky/kart.git
cd ~/kart

```sh
poetry config virtualenvs.in-project true
# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

#### 1-2. Clone Repository & Install Packages

To install using Poetry, use the following commands:

```sh
# Install Poetry
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py > ~/get-poetry.py
cd ~
python get-poetry.py --version 1.1.4
poetry config virtualenvs.in-project true

# Clone Repository
cd ~
git clone [email protected]:yutanakamura-tky/kart.git
cd ~/kart

# Activate virtual environment & install dependencies
poetry shell
poetry install
```

#### 1-3. Make MIMIC-III-dummy-PHI

#### 1-2. Get Necessary files

This repository requires two datasets to create MIMIC-III-dummy-PHI:

- MIMIC-III version 1.4 noteevents (`NOTEEVENTS.csv.gz`) ([here](https://physionet.org/content/mimiciii/1.4/))
- n2c2 2006 De-identification challenge training dataset "Data Set 1B: De-identification Training Set" (`deid_surrogate_train_all_version2.zip`) ([here](https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp))

Note that registration is necessary to download these datasets.

After downloading the datasets, extract them into `~/kart/corpus`:

```
cd ~/kart/src
bash make_mimic_iii_dummy_phi.sh
mv /path/to/NOTEEVENTS.csv.gz ~/kart/corpus
cd ~/kart/corpus
gunzip NOTEEVENTS.csv.gz
mv /path/to/deid_surrogate_train_all_version2.zip ~/kart/corpus
cd ~/kart/corpus
unzip deid_surrogate_train_all_version2.zip
```

#### 1-4. Get non-domain-specific uncased BERT-base model
#### 1-3. Make MIMIC-III-dummy-PHI

Run `make_mimic_iii_dummy_phi.sh`. Make sure that you are in the virtual environment:

```
cd ~/kart/src
bash get_google_bert_model.sh
bash make_mimic_iii_dummy_phi.sh
```

#### 1-5. Convert MIMIC-III to BERT pre-training data
### 2. How to pre-train BERT model
#### 2-1. Convert MIMIC-III to BERT pre-training data (tfrecords format)
```
cd ~/kart/src
bash make_pretraining_data.sh
```

#### 1-6. Pre-train BERT model from scratch
#### 2-2. Pre-train BERT model
To pre-train BERT model from scratch, use this command:
```
cd ~/kart/src
bash pretrain_bert_from_scratch.sh
```

#### 1-7. Pre-train BERT model from BERT-base-uncased
To pre-train BERT model from BERT-base-uncased model, use this command:
```
cd ~/kart/src
# Download BERT-base-uncased model by Google Research
bash get_google_bert_model.sh
bash pretrain_bert_from_bert_base_uncased.sh
```

## Citation
Please cite our arXiv paper:
Please cite our arXiv preprint:

```
@misc{kart,
Author = {Yuta Nakamura and Shouhei Hanaoka and Yukihiro Nomura and Naoto Hayashi and Osamu Abe and Shuntaro Yada and Shoko Wakamiya and Eiji Aramaki},
Title = {KART: Privacy Leakage Framework of Language Models Pre-trained with Clinical Records},
Title = {KART: Parameterization of Privacy Leakage Scenarios from Pre-trained Language Models},
Year = {2020},
Eprint = {arXiv:2101.00036},
}
Expand Down
Binary file added img/mimic_iii_dummy_phi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit f0752d9

Please sign in to comment.