Skip to content

samantamrityunjay/Automatic-assignment-of-ICD10-codes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic Assignment of ICD codes

Introduction

This repo contains codes for assignment of ICD codes to medical/clinical text. Data used here is the MIMICIII dataset. Different models have been tried from linear machine learning models to state of the art pretrained NLP model BERT.

Structure of the project

At the root of the project, you will have:

  • main.py: used for training and testing different models
  • requirements.txt: contains the minimum dependencies for running the project
  • w2vmodel.model: gensim word2vec model trained on MIMICIII discharge summaries
  • src: a folder that contains:
    • bert: contains utilities and files for pretrained bert model
    • cnn: contains utilities and files for CNN model
    • hybrid: contains utilities and files for the hybrid model (LSTM+CNN) model
    • rnn: contains utilities and files for LSTM and GRU models
    • ovr: contains utilities and files for different Machine Learning Models (like LR, SVM, NaiveBayes)
    • fit.py: training code for both LSTM and CNN models
    • test_results.py: inferencing code for trained model used for both LSTM and CNN models
    • utils.py: genearal utility codes used for all the models

Dependencies

The dependencies are mentioned in the requirements.txt file. They can be installed by:

pip install -r requirements.txt

How to use the code

Launch train.py with the following arguments:

  • train_path: path of the training data.
  • test_path: path of the test data
  • model_name: one of the 5 models implemented ['bert', 'hybrid', 'lstm', 'gru', 'cnn', 'ovr']. Default to 'bert'
  • icd_type: training on different types of icd labels, ['icd9cat', 'icd9code', 'icd10cat', 'icd10code']. Default to 'icd9cat'
  • epochs: number of epochs
  • batch_size: batch size, default to 16 (for bert model).
  • val_split: validation split of the training data, default = 2/7 (train:val:split = 5:2:3)
  • learning_rate: default to 2e-5 (for bert model)
  • w2vmodel: path for pretrained gensim word2vec model.

Example

python main.py --train_path train.csv --test_path test.csv --model_name cnn

Data

The data used for training can be downloaded from:

About

Project on the assignment of ICD codes to medical/clinical text

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages