Skip to content

Clinical textual similarity analysis. Also implemented in Spark and HPC

Notifications You must be signed in to change notification settings

yunxueh/NLP-EHR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP-EHR

Analysing clinical semantic textual similarity on big data.

Clinical Semantic Textual Similarity (ClinicalSTS) measures the degree of similarity between two sentences. The key challenges of ClinicalSTS are how to extract medical concepts and identify templates. This project has leveraged three deep learning models, they are fully connected neural network, Attention-based CNN (ABCNN) and LSTM. The comparison between the three models has illustrated that even their accuracy is quite identical, each of them has its own strength and weakness. The comparison between embedding methods has shown the influence of embedding methods. After partially removed templates, the performance of ABCNN is improved, it proves the template is a distraction for deep learning models.

Current three models are transplanted into a distributed system to cope with the big data challenge. This project has brought up two types of mechanism, one is data parallelism, and another is model parallelism. Data parallelism is deployed on Spark, model parallelism is on HPC (Spartan).

===The dataset is confidential, it is not allowed to share with public===

本项目为分析医疗系统中语义文本相似性,采用了全连接深度学习网络,ABCNN和LSTM。三种模型精度大致分布于80%~85%。由于样本数量级偏小,模型的选择不够明朗,最终采用集成学习(ensemble)的方式,将三种网络的输出整合在一起。

ABCNN.py -> 文件预处理,tokenizer,三种词嵌入,搭建ABCNN regression模型

BiLSTM.py -> 按照Siamese孪生卷积网络搭建双向LSTM模型

FuzzyMatch.py -> 用模糊算法大致处理原文本中的术语模版。

HPC文件中涵盖了在Spartan上放置模型的脚本,用.slurm文件安排不同模型布置在不同节点上,以达到效率最大化。

数据集由BioCreative/OHNLP所提供,不开放给公众。

About

Clinical textual similarity analysis. Also implemented in Spark and HPC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published