CN109065174B - Medical record theme acquisition method and device considering similarity constraint - Google Patents
Medical record theme acquisition method and device considering similarity constraint Download PDFInfo
- Publication number
- CN109065174B CN109065174B CN201810843072.0A CN201810843072A CN109065174B CN 109065174 B CN109065174 B CN 109065174B CN 201810843072 A CN201810843072 A CN 201810843072A CN 109065174 B CN109065174 B CN 109065174B
- Authority
- CN
- China
- Prior art keywords
- medical record
- similarity
- topic
- distribution
- documents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000009826 distribution Methods 0.000 claims abstract description 81
- 238000003745 diagnosis Methods 0.000 claims abstract description 21
- 238000004364 calculation method Methods 0.000 claims description 46
- 201000010099 disease Diseases 0.000 claims description 29
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 29
- 238000009795 derivation Methods 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 7
- 238000012952 Resampling Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 18
- 238000004458 analytical method Methods 0.000 description 8
- 206010012601 diabetes mellitus Diseases 0.000 description 7
- 208000002249 Diabetes Complications Diseases 0.000 description 6
- 206010012655 Diabetic complications Diseases 0.000 description 6
- 208000028659 discharge Diseases 0.000 description 4
- 229940079593 drug Drugs 0.000 description 4
- 239000003814 drug Substances 0.000 description 4
- 238000005065 mining Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 206010061818 Disease progression Diseases 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000003759 clinical diagnosis Methods 0.000 description 2
- 230000005750 disease progression Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000000366 juvenile effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention provides a medical record theme acquisition method and device considering similar constraints. The method comprises the following steps: calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold; and sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model. Therefore, the thinking process of determining the medical record text in the doctor diagnosis and treatment process can be well simulated, and the accuracy of obtaining the theme is facilitated.
Description
Technical Field
The invention relates to the technical field of data mining, in particular to a medical record theme acquisition method and device considering similar constraints.
Background
At present, a topic model is mostly applied to the aspect of evolution analysis of network public sentiment topics in the field of online social media, which is beneficial to effectively monitoring network public sentiment changes according to network topic distribution at different time periods, and even actively guiding the development direction of the network public sentiment changes. In addition, the topic model is also applied in a small amount in the field of clinical diagnosis and treatment, and aims to analyze the diagnosis and treatment rules between disease-medication and disease-symptoms in medical record documents, wherein the analysis process comprises the following steps: and (3) taking each medical record document as an independent sample to be input into the model, and obtaining a final theme analysis result through a large amount of training.
However, in the process of implementing the scheme of the invention, the inventor finds that: in one aspect, because of the similarity in disease progression between two patients of the same disease, the diagnostic protocols that physicians make for are affected by previous diagnostic protocols for similar patients. On the other hand, there are individual differences between two patients, such as constitution, sex, age, disease stage, etc., so that doctors give different diagnosis and treatment plans according to different patients. In the actual diagnosis and treatment process, two patients with similar physical conditions and diseases may exist, and the similar part of the diagnosis and treatment scheme of the two patients also exists. For example: diabetic patients may have multiple diabetic complications at the same time, but the diagnosis and treatment plans and disease progression of the same complications should have similarities.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a medical record theme acquisition method and device considering similar constraints, which are used for solving the technical problems in the related art.
In a first aspect, an embodiment of the present invention provides a medical record topic acquisition method considering similarity constraints, where the method includes:
calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model.
Optionally, calculating the similarity between any two medical record documents in the initial medical record comprises:
acquiring a plurality of similarity calculation factors of the medical record and weight values of the similarity calculation factors;
respectively calculating the numerical values of any two medical record documents about each similarity calculation factor;
and calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
Optionally, the similarity calculation factor comprises: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
Optionally, deriving the document-topic distribution and the topic-word distribution of each medical record document through the preset LDA model includes:
randomly assigning a theme number z to each word in each medical record document in the similarity constraint medical record set;
rescanning the similarity-constrained medical record set according to each wordResampling the topics so that the new topics meet GibbsSampling convergence;
and counting the topic-word co-occurrence frequency matrix in the corpus to obtain document-topic distribution and topic-word distribution.
Optionally, the preset LDA model includes:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
where θ rm={θm,1,θm,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lm,θn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
In a second aspect, an embodiment of the present invention provides an apparatus for acquiring a medical record topic in consideration of similarity constraints, where the apparatus includes:
the medical record set acquisition module is used for calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and the theme distribution derivation module is used for sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model and deriving the document-theme distribution and the theme-word distribution of the medical record documents through the preset LDA model.
Optionally, the medical record collection acquiring module includes:
the system comprises a weighted value acquisition unit, a similarity calculation unit and a matching unit, wherein the weighted value acquisition unit is used for acquiring a plurality of similarity calculation factors of medical records and weighted values of the similarity calculation factors;
the factor data calculation unit is used for calculating the numerical values of any two medical record documents about each similarity calculation factor;
and the similarity calculation unit is used for calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
Optionally, the similarity calculation factor comprises: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
Optionally, the topic distribution derivation module includes:
a topic numbering unit, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;
a topic iteration unit for rescanning the similarity constraint medical record set according to each wordResampling the topics so that the new topics meet GibbsSampling convergence;
and the theme distribution calculating unit is used for counting the theme-word co-occurrence frequency matrix in the corpus to obtain document-theme distribution and theme-word distribution.
Optionally, the preset LDA model includes:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
where θ rm={θm,1,θm,2,…,θm,Lm},Indicating that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lm,θn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
According to the technical scheme, the similarity of the two medical record documents is calculated, so that a plurality of medical record documents which are larger than or equal to the similarity threshold value can be screened from the initial medical record, and the similarity constraint medical record set formed by the plurality of medical record documents is used as the topic analysis document in the subsequent process. Therefore, the thinking process of determining the medical record text in the doctor diagnosis and treatment process can be well simulated, and the accuracy of obtaining the theme is facilitated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a medical record topic acquisition method considering similarity constraints according to an embodiment of the present invention;
FIG. 2 is a record of the course of disease in a medical record document;
FIG. 3 is a graph of the number of diabetic complications in a male patient;
FIG. 4 is a graph of the number of diabetic complications in a female patient;
FIG. 5 is a diagram of the relationship between the number of topics and the similarity constraint indication SIM with similarity thresholds of 0.5 and 0.6, respectively;
FIG. 6 is a diagram of the relationship between the number of topics and the similarity constraint indication SIM with similarity thresholds of 0.7 and 0.8, respectively;
FIG. 7 is a relationship between topic numbers and interaction information;
fig. 8 to fig. 10 are block diagrams of medical record topic acquisition apparatuses considering similarity constraints according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a medical record topic acquisition method considering similarity constraints according to an embodiment of the present invention. Referring to fig. 1, a medical record topic acquisition method considering similarity constraints includes:
101, calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and 102, sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model.
The following describes in detail steps of a medical record topic acquisition method considering similar constraints with reference to the accompanying drawings and embodiments.
Firstly, introducing 101, and calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold.
During the hospitalization, the patient will generate various detection records, such as admission record, discharge record, course record, consultation record, etc. If the similarity between the detection records is directly calculated, the calculation amount is greatly increased. For convenience of explanation, in this embodiment, the detection record before processing is referred to as an initial medical record.
To reduce the amount of calculation, only the similarity of the hospitalized diagnosis portion in the initial medical record is considered in this embodiment. In one embodiment, the similarity is calculated as the distance between any two initial medical records, and the medical record similarity constraint construction can be understood as collecting a medical record set with the distance between every two medical records being smaller than a certain threshold.
In practical applications, the initial medical record may also include various complications of a certain disease, for example, diabetes may cause various complications, as shown in table 1.
Table 1 diabetic complications example
As can be seen from the analysis of Table 1, patients of different ages have differences in the characteristics of diabetes and complications thereof; in addition, patients of different ages have different drug bearing capacities, so that the clinical diagnosis and treatment process has different aspects such as characterization, medication and the like. Therefore, the basic information of the patient needs to be considered when calculating the similarity of the medical record documents, and the patient name and age are taken into the similarity calculation factors of the medical record documents in the embodiment.
In one embodiment, the distance of the gender attribute between the same gender is set to 1, and the distance of the gender attribute between different genders is set to 0, as shown in the following equation:
wherein, sexi,sexjExpressed as the gender of the different individuals.
In one embodiment, the ages are divided into 4 age groups according to the international population age structure, which are: juvenile, 0-17 years old, denoted 1; young, 18-45 years old, expressed as 2; middle aged, 18-45 years old, expressed as 3; older, older than 59 years, was designated 4. Thus, the present embodiment can calculate the distance between the age groups of two patients, as represented by the following formula:
wherein, agei,agejIndicated as the ages of the two different persons, flagi,flagjIndicating the segments of different ages. And, the closer the two ages belong to the segment, the smaller the distance, and the farther the two ages belong to the segment, the larger the distance.
Considering the discrete textual description in the initial medical record, the distance between the diagnosis results in different initial medical records is calculated by using the Jaccard distance in this embodiment, as shown in the following formula:
wherein diai,diajRepresenting the discharge diagnostic boolean vector space for medical record i and medical record j, much of this document considers the conditions between diabetic complications.
For example: diai={123},diaj={234},diai∩diaj={2,3};diai∪diaj1,2,3,4, then d (dia)i,diaj)=2/4=0.5。
It should be noted that, in this embodiment, only the similarity calculation factors are considered, including: the distance of the gender attribute, the distance of the segment to which the age belongs, and the distance of the diagnosis result, when the application scene of the text theme acquisition method changes, the specific composition of the similarity calculation factor can be correspondingly adjusted, and the adjusted scheme also falls within the protection scope of the application.
In determining similarityAfter the factors are calculated, the weight adjustment adjusting parameters mu are respectively set1,μ2,μ3And calculating the similarity between any two initial medical records, as shown in the following formula:
sim(Ti,Tj)=μ1*d(sexi,sexj)+μ2*d(agei,agej)+μ3*d(diai,diaj)
(3)
μ1+μ2+μ3=1 (4)
0≤μ1,μ2,μ3≤1 (5)
and finally, comparing the similarity with a similarity threshold tau, screening out a plurality of initial medical records of which the similarity is greater than or equal to the similarity threshold, obtaining a similarity constraint medical record set consisting of the plurality of initial medical records, and recording the similarity constraint medical record set as D { (T)i,Tj)|i,j∈[1,M]}。
And then, introducing 102, namely sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model.
In this embodiment, the preset LDA model is obtained by improving the existing LDA model. In order to facilitate the technical person to better understand the preset LDA model, the basic principle of the LDA model is described first:
latent Dirichlet Allocation (LDA) is a topic model, which aims to find the topic of a document, including three layers of document, topic and word, and each document has a probability distribution related to its topic, and the words in the document are sampled by different topic distributions, as shown in the following formula (6):
Σ p (word | document) ═ Σ p (word | topic) × p (topic | document) (6)
Modeling medical record documents by utilizing an LDA (latent Dirichlet Allocation) model, wherein the total number M of the medical record documents is set, Nm clinical description words exist in the mth medical record document, and each word is expressed as omegam,nDocuments and sheets are sorted according to the existing bag of words model (bag of words)Words are represented as a document-topic distribution and a topic-word distribution. The subjects in the medical record texts can be understood as the general terms of clinical care means such as medication, observation, symptoms, operation and the like, and each medical record text is a polynomial distribution of a plurality of subjects, namely each medical record text is formed by combining a plurality of steps in the clinical care process.
In the related art, the steps of the LDA model generating the medical record text are shown in table 2.
It can be understood that, because each topic is a polynomial distribution of a plurality of words, each clinical care step comprises a plurality of clinical practical operations, and the document-topic distribution and the topic-word distribution both conform to the dirichlet parameter as alpha and beta prior distributions, the LDA model can well simulate the thinking process of a doctor making a case history text in the diagnosis and treatment process.
Based on the above analysis, the LDA model reasoning aims at: calculating unknown parameters in LDA model through current test document setAnd according toA topic-word distribution and a document-topic distribution are calculated. In fact, the topic-word distribution and the document-topic distribution can be directly deduced in the calculation process without calculating
In practical application, the parameter inference algorithm of the LDA model comprises Gibbs sampling and EM variation. Two methods are described below.
First, the core idea of Gibbs Sampling is the markov monte carlo (MCMC) method, in which only the parameter value of one dimension is changed during each iteration until convergence and the parameter value to be estimated is output. According to dirichlet parameter estimation, reasoning can obtain:
wherein:a document-subject distribution is represented that,a distribution of the subject-words is represented,representing wordsThe distribution is the probability of k, i is a data pair (m, n) representing the nth word in the mth document.
Since there are a total of K topics, K iterations are required, and the training steps are shown in table 3:
second, the EM variational algorithm consists in finding suitable parameters that maximize the observed topic-word distribution probability in the text set, similar to the maximum likelihood estimation problem. The EM variational algorithm is divided into two iteration steps:
the variation E-step considers the difficult derivation of the posterior probability p (w | alpha, beta) formula in the original step, introduces the variation parameter (gamma,) An approximate posterior probability distribution q (theta, z | gamma,)。
the variation M-step maximizes the approximation function L (gamma,β). Wherein the prior dirichlet distribution parameters (α, β) determine a topic-word distribution and a document-topic distribution θ, w represents words and z represents topics.
Because the iteration goal of the LDA model is to maximize the occurrence probability p (Z, W | alpha, beta) of the words, the data characteristics of the diabetes course record can be effectively met, and the topic distribution of similar medical records can be greatly different, so that the medical records can not be effectively statistically analyzed according to the topic distribution of the medical records.
In order to establish a topic model satisfying the medical record similarity constraint, this embodiment achieves this goal by changing the Gibbs sampling convergence condition policy.
Considering that a plurality of time-ordered disease course records exist in each medical record, similarity calculation of medical record documents should consider similarity between different disease course record sets in each medical record document, that is, similarity restricts document-subject distribution of different disease course record sets of each medical record document in the medical record set D to be as similar as possible.
Let TmThe medical record with the number m including LmIndividual course record, the subject set of which is expressed as theta rm={θm,1,θm,2,…,θm,Lm}. Course record topic set theta gamma with two case history documentsm,θrnThe case history similarity constraint can be calculated by using the mean value of the distribution distance of every two subjects, as follows:
wherein d (θ)m,Lm,θn,Ln) Expressed as the Euclidean distance, dis (θ r), between two diseases and the vectorm,θrn) Larger indicates lower similarity.
The maximum objective function may be modified as:
in the embodiment, a Gibbs-EM iteration method is adopted for carrying outLDAModel derivation, which distributes document-topic αmModified to normally distribute mumAnd obtaining a preset LDA model:
wherein, mumkRepresents the probability that the medical record document m belongs to the topic k, since μ is considered to bemFollowing a standard normal distribution, the improved maximum objective function is expressed as follows:
in addition, in the embodiment, the document theme distribution alpha is fixed in advance in the sampling processmThen the Gibbs-EM iterative function expression is:
wherein,representing the number of words i with k as the subject in the similarity constraint medical record set, since the original α is replaced by normal distribution, the formula (14) can be derived by a stochastic gradient descent method, and the model training process is as shown in table 4:
and then, sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model.
Therefore, on the basis of analyzing the influence of text mining on medical diagnosis and the modeling process and the reasoning method of the potential dirichlet topic model, the embodiment of the invention designs the preset LDA model based on medical record similarity constraint. The preset LDA model not only considers similarity constraint among different medical record documents, but also determines a medical text topic modeling target, a reasoning process and a model related measurement index, so that the set LDA model can clearly reflect the focus point and the disease evolution process of each diagnosis and treatment stage, and the scientificity, effectiveness and accuracy of medical record topic mining are favorably improved.
An LDA model and a preset LDA model (hereinafter referred to as Medical Record basis content digital dictionary Allocation, MRS-LDA) of the present application are used to perform a comparative experiment to illustrate the effectiveness and superiority of the Medical Record topic acquisition method considering similar constraints, which is provided by the embodiment of the present invention.
The initial medical records are the medical records of the patients in the endocrinology department in the first subsidiary hospital of the university of medical science in Anhui province, including the admission records of 1294 patients in total from 2015 to 2017, and each medical record document mainly comprises admission records, disease course records (shown in figure 2), consultation records, discharge records and the like. The ratio 648:646 of the number of medical record documents of male and female patients is approximately the same.
Referring to fig. 3 and 4, in the diabetic patients who were admitted to the first subsidiary hospital of the medical university of Anhui, it was judged that patients of different ages and different sexes were significantly different in the number of complications that they had at the same time according to the hospitalization diagnosis. The number of diabetic complications suffered by the old people is greatly increased compared with that suffered by people in other age groups, more middle-aged people suffer from 3 to 5 kinds of complications at the same time, the young people suffer from diabetes, but no more complications occur, and the number of diabetic patients suffered by children is small.
In the embodiment, the sex, age and admission diagnosis of the patient in the admission record are selected as the basis of medical record similarity constraint calculation data, and the disease course record of the doctor during the patient hospitalization period is utilized to perform relevant topic analysis. In the experimental process, the following treatment can be carried out, including:
(1) by using a python crawler method, text records of all stages such as admission records, discharge records, disease course records and the like are divided from 1294 patient medical record documents in an HTML format, and required patient information, diagnosis results and disease course record texts are separated.
(2) Constructing a dictionary and stopping a word stock. The research content of the invention is that the medical record text contains a large number of words which are irrelevant to the text, and 12599 words are manually extracted as stop words to be added to a stop word bank after counting the frequency of each word appearing in the medical record. Meanwhile, the disease name of ICD10 China is added as a supplementary feature to be added to the dictionary.
(3) And performing word segmentation and stop word removal operations by using the dictionary and the stop word bank by using the jieba word segmentation in python as a word segmentation tool.
Considering that in medical record document topic mining, the influence of topic quantity on text topic modeling and the quantity of similar medical records brought by different similarity thresholds are different, in this embodiment, the similarity threshold and the topic quantity are adjustment parameters, the value range of the medical record similarity threshold τ is 0.5-0.8, the topic quantity K is 7, 10, 13, 15, 20, and 30, and the PMI-Score and medical record similarity constraint of the model are respectively calculated under the above parameters.
Referring to fig. 5 and fig. 6, the MRS-LDA model and the LDA model are compared in similarity constraint results under different theme parameters and different similarities, where the abscissa is the number K of themes and the ordinate is the similarity constraint index SIM. The comparison analysis MRS-LDA model has obvious advantages in medical record similarity constraint. When the topic similarity thresholds are consistent, the medical record similarity constraint has an unobvious reduction along with the increase of the number of the topics, but the MRA-LDA model still has a greater advantage in the aspect of medical record similarity constraint indexes than the LDA model.
Referring to fig. 7, the results of the interaction information (PIM-Score) between the MRS-LDA model and the LDA model are compared under different theme parameters and different similarity thresholds, where the abscissa is the number of themes K and the ordinate is the metric PIM-Score. When the number of subjects K is 15, the MRS-LDA model is superior to the LDA model in the PIM-Score metric index, and is better than the LDA model when the medical record similarity threshold is 0.5.
Through comparison experiments, the MRS-LDA model has good performance on similarity constraint measurement indexes, and under the condition of the same medical record similarity threshold and the same number of subjects, the distance between the subject distributions of similar medical records obtained by the MRS-LDA model is smaller, so that the existing association between the similar medical records can be better described. That is to say, the constraint condition of similarity of medical records is added when the objective function is constructed, so that the topic distribution among similar medical records is relatively close, the method and the device can be suitable for a use scene of medical record topic mining, and the accuracy is relatively high.
In a second aspect, an embodiment of the present invention provides an apparatus for acquiring a medical record topic in consideration of similarity constraints, and referring to fig. 8, the apparatus includes:
a medical record set obtaining module 801, configured to calculate similarity between any two medical record documents in an initial medical record, and obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and a topic distribution derivation module 802, configured to sequentially input each medical record document in the similarity-constrained medical record set into a preset LDA model, and derive document-topic distribution and topic-word distribution of each medical record document through the preset LDA model.
Optionally, referring to fig. 9, the medical record collection acquiring module 801 includes:
a weight value obtaining unit 901, configured to obtain a plurality of similarity calculation factors of a medical record and a weight value of each similarity calculation factor;
a factor data calculating unit 902, configured to calculate a numerical value of each similarity calculation factor of any two medical record documents respectively;
and the similarity calculation unit 903 is configured to calculate the similarity between any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
Optionally, the similarity calculation factor comprises: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
Optionally, referring to fig. 10, the topic distribution derivation module 802 includes:
a topic numbering unit 1001, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;
a topic iteration unit 1002, configured to rescan the similarity-constrained medical record set according to the similarity constraint setResampling the topics so that the new topics meet GibbsSampling convergence;
and a topic distribution calculation unit 1003, configured to count a topic-word co-occurrence frequency matrix in the corpus to obtain a document-topic distribution and a topic-word distribution.
Optionally, the preset LDA model includes:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
where θ rm={θm,1,θm,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lm,θn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
It should be noted that the medical record topic acquisition device considering similar constraints provided in the embodiment of the present invention is in a one-to-one correspondence relationship with the above method, and the implementation details of the above method are also applicable to the above device, and the above system is not described in detail in the embodiment of the present invention.
In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.
Claims (6)
1. A medical record topic acquisition method considering similarity constraint is characterized by comprising the following steps:
calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model;
deriving document-subject distribution and subject-word distribution of each medical record document through the preset LDA model comprises:
randomly assigning a theme number z to each word in each medical record document in the similarity constraint medical record set;
rescanning the similarity-constrained medical record set according to each wordResampling the topics so that the new topics meet Gibbs Sampling convergence; wherein,representing wordsProbability of distribution being k;
counting a topic-word co-occurrence frequency matrix in a corpus to obtain document-topic distribution and topic-word distribution;
the preset LDA model comprises:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
where θ rm={θm,1,θm,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lm,θn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses; theta rn={θn,1,θn,2,…,θn,LnIndicates that each medical record document includes LnRecording the individual disease course; thetan,LnDenotes the L thnSubject matter of individual course records;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
representing the number of words i with the topic k in the similarity constraint medical record set; -a priori dirichlet distribution parameters (α, β); the total number of words in the medical record set is V; preset LDA model μmkRepresenting the probability that the medical record document m belongs to the topic k.
2. The medical record topic acquisition method of claim 1, wherein calculating the similarity between any two medical record documents in the initial medical record comprises:
acquiring a plurality of similarity calculation factors of the medical record and weight values of the similarity calculation factors;
respectively calculating the numerical values of any two medical record documents about each similarity calculation factor;
and calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
3. The medical record topic acquisition method as recited in claim 2, wherein the similarity calculation factors comprise: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
4. An apparatus for obtaining a subject of a medical record considering similarity constraint, the apparatus comprising:
the medical record set acquisition module is used for calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
the theme distribution derivation module is used for sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model and deriving document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model;
the topic distribution derivation module comprises:
a topic numbering unit, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;
a topic iteration unit for rescanning the similarity constraint medical record set according to each wordResampling the topics so that the new topics meet GibbsSampling convergence; wherein,representing wordsProbability of distribution being k;
the topic distribution calculating unit is used for counting topic-word co-occurrence frequency matrixes in the corpus to obtain document-topic distribution and topic-word distribution;
the preset LDA model comprises:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
where θ rm={θm,1,θm,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lm,θn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses; theta rn={θn,1,θn,2,…,θn,LnIndicates that each medical record document includes LnRecording the individual disease course; thetan,LnDenotes the L thnSubject matter of individual course records;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
representing the number of words i with the topic k in the similarity constraint medical record set; -a priori dirichlet distribution parameters (α, β); the total number of words in the medical record set is V; preset LDA model μmkRepresenting the probability that the medical record document m belongs to the topic k.
5. The medical record topic acquisition device of claim 4, wherein the medical record collection acquisition module comprises:
the system comprises a weighted value acquisition unit, a similarity calculation unit and a matching unit, wherein the weighted value acquisition unit is used for acquiring a plurality of similarity calculation factors of medical records and weighted values of the similarity calculation factors;
the factor data calculation unit is used for calculating the numerical values of any two medical record documents about each similarity calculation factor;
and the similarity calculation unit is used for calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
6. The medical record topic acquisition method device as claimed in claim 5, wherein the similarity calculation factors comprise: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810843072.0A CN109065174B (en) | 2018-07-27 | 2018-07-27 | Medical record theme acquisition method and device considering similarity constraint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810843072.0A CN109065174B (en) | 2018-07-27 | 2018-07-27 | Medical record theme acquisition method and device considering similarity constraint |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109065174A CN109065174A (en) | 2018-12-21 |
CN109065174B true CN109065174B (en) | 2022-02-18 |
Family
ID=64836831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810843072.0A Active CN109065174B (en) | 2018-07-27 | 2018-07-27 | Medical record theme acquisition method and device considering similarity constraint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109065174B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046339A (en) * | 2018-12-24 | 2019-07-23 | 北京字节跳动网络技术有限公司 | Determine method, apparatus, storage medium and the electronic equipment of document subject matter |
CN109871434B (en) * | 2019-02-25 | 2019-12-10 | 内蒙古工业大学 | Public opinion evolution tracking method based on dynamic incremental probability graph model |
CN110517789B (en) * | 2019-08-30 | 2023-06-16 | 深圳市汇健医疗工程有限公司 | Digital composite operating room with multiple image devices |
CN111370086A (en) * | 2020-02-27 | 2020-07-03 | 平安国际智慧城市科技股份有限公司 | Electronic case detection method, electronic case detection device, computer equipment and storage medium |
CN111430037B (en) * | 2020-03-30 | 2024-04-09 | 讯飞医疗科技股份有限公司 | Similar medical record searching method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102317786A (en) * | 2007-04-18 | 2012-01-11 | 特提斯生物科学公司 | Diabetes correlativity biological marker and method of application thereof |
CN103365978A (en) * | 2013-07-01 | 2013-10-23 | 浙江大学 | Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN107613520A (en) * | 2017-08-29 | 2018-01-19 | 重庆邮电大学 | A kind of telecommunication user similarity based on LDA topic models finds method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1949105A4 (en) * | 2005-10-11 | 2009-06-17 | Tethys Bioscience Inc | Diabetes-associated markers and methods of use thereof |
-
2018
- 2018-07-27 CN CN201810843072.0A patent/CN109065174B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102317786A (en) * | 2007-04-18 | 2012-01-11 | 特提斯生物科学公司 | Diabetes correlativity biological marker and method of application thereof |
CN103365978A (en) * | 2013-07-01 | 2013-10-23 | 浙江大学 | Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model |
CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
CN107613520A (en) * | 2017-08-29 | 2018-01-19 | 重庆邮电大学 | A kind of telecommunication user similarity based on LDA topic models finds method |
Also Published As
Publication number | Publication date |
---|---|
CN109065174A (en) | 2018-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109065174B (en) | Medical record theme acquisition method and device considering similarity constraint | |
CN109036577B (en) | Diabetes complication analysis method and device | |
Caballero Barajas et al. | Dynamically modeling patient's health state from electronic medical records: A time series approach | |
CN109460473B (en) | Electronic medical record multi-label classification method based on symptom extraction and feature representation | |
Fang et al. | Feature Selection Method Based on Class Discriminative Degree for Intelligent Medical Diagnosis. | |
CN107578798B (en) | Method and system for processing electronic medical record | |
CN117744654A (en) | Semantic classification method and system for numerical data in natural language context based on machine learning | |
CN116364299B (en) | Disease diagnosis and treatment path clustering method and system based on heterogeneous information network | |
Wang et al. | A framework and its empirical study of automatic diagnosis of traditional Chinese medicine utilizing raw free-text clinical records | |
Ma et al. | Constructing a semantic graph with depression symptoms extraction from twitter | |
CN112037909A (en) | Diagnostic information rechecking system | |
CN113555077A (en) | Suspected infectious disease prediction method and device | |
CN111524570B (en) | Ultrasonic follow-up patient screening method based on machine learning | |
Zou et al. | Modeling electronic health record data using an end-to-end knowledge-graph-informed topic model | |
Bhattacharya et al. | Identifying patterns of associated-conditions through topic models of Electronic Medical Records | |
Chuan | Classifying eligibility criteria in clinical trials using active deep learning | |
CN114191665A (en) | Method and device for classifying man-machine asynchronous phenomena in mechanical ventilation process | |
CN113360643A (en) | Electronic medical record data quality evaluation method based on short text classification | |
CN112329461A (en) | Similar medical record determination method, computer equipment and computer storage medium | |
CN108831560B (en) | Method and device for determining medical data attribute data | |
Tai et al. | Mental disorder detection and measurement using latent Dirichlet allocation and SentiWordNet | |
Kongburan et al. | Enhancing predictive power of cluster-boosted regression with text-based indexing | |
RU2723674C1 (en) | Method for prediction of diagnosis based on data processing containing medical knowledge | |
Bania | Heterogenous ensemble learning framework for sentiment analysis on COVID-19 Tweets | |
Maheswari et al. | SENTIMENT ANALYSIS IN MELANOMA CANCER DETECTION USING ENSEMBLE LEARNING MODEL. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |