CN117648923B - Chinese spelling error correction method suitable for medical context - Google Patents

Chinese spelling error correction method suitable for medical context Download PDF

Info

Publication number
CN117648923B
CN117648923B CN202410120343.5A CN202410120343A CN117648923B CN 117648923 B CN117648923 B CN 117648923B CN 202410120343 A CN202410120343 A CN 202410120343A CN 117648923 B CN117648923 B CN 117648923B
Authority
CN
China
Prior art keywords
sentence
corrected
chinese
chinese characters
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410120343.5A
Other languages
Chinese (zh)
Other versions
CN117648923A (en
Inventor
高敏
陈恩红
刘昌春
蒋浚哲
张凯
王慕秋
李京秀
宋雪莉
丁蓓蓓
张梦云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Provincial Hospital First Affiliated Hospital Of Ustc
Original Assignee
Anhui Provincial Hospital First Affiliated Hospital Of Ustc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Provincial Hospital First Affiliated Hospital Of Ustc filed Critical Anhui Provincial Hospital First Affiliated Hospital Of Ustc
Priority to CN202410120343.5A priority Critical patent/CN117648923B/en
Publication of CN117648923A publication Critical patent/CN117648923A/en
Application granted granted Critical
Publication of CN117648923B publication Critical patent/CN117648923B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to the field of artificial intelligence, in particular to a Chinese spelling error correction method suitable for medical contexts, which comprises the steps of converting sentences into Chinese character label sequences, inputting the Chinese character label sequences into a BERT pre-training Chinese language model to obtain context information characteristics, and carrying out linear transformation on the context information characteristics to enable the dimensions of the context information characteristics to be aligned with a word list; before calculating each positionNormalized confidence of each candidate item, obtain the front of each positionConfidence of individual candidates; before calculating each positionThe candidate items correspond to the visual similarity and the voice similarity of the Chinese characters and the input Chinese characters, and the two are weighted to obtain the similarity; front of each position is calculated by fusing similarity and confidenceComprehensive weights of the individual candidates; and taking the Chinese character with the highest comprehensive weight at each position as the Chinese character after error correction. The invention solves the problem of similar character errors by modeling the visual similarity and the voice similarity of the Chinese characters.

Description

Chinese spelling error correction method suitable for medical context
Technical Field
The invention relates to the field of artificial intelligence, in particular to a Chinese spelling error correction method suitable for medical contexts.
Background
With the population growth and aging of China, the number of medical staff is greatly increased, so that doctors can take more time for taking a doctor, and the doctor cannot concentrate on other works, such as medical record writing, prescription making and the like. The heavy working pressure greatly improves the probability of misspelling of doctors in work, thereby causing information transmission deviation and even accidents. For example, misspelling of the drug name may result in the patient obtaining the wrong drug; careless mistakes of disease names may cause misdiagnosis; errors in the registration of surgical procedures can also severely affect the therapeutic effect. The automatic spelling error correction system can find out the wrong words in the words and propose modification suggestions, helps medical staff reduce spelling errors, improves the accuracy of medical records, saves time of doctors and applies more energy to treatment.
Some researchers have attempted to make spelling corrections in a medical environment based on deep learning methods. This spelling error correction method mainly requires the construction of neural network models that learn and understand complex patterns of language, including context, grammar structure, semantic meaning, etc., through a large amount of training data. The input misspelled text is encoded by the encoder into a fixed length vector, important information in the text is captured, and the decoder then generates the correct spelled text based on this vector. In the training process, the neural network model continuously adjusts internal parameters by comparing the generated text with the true correct text, so that the generated text is more similar to the true correct text.
These neural network models rely on local context information for prediction because the design of such neural network models makes it difficult to handle long-range dependence problems, and the models may not adequately understand phrases or sentences with rich meaning in the context, resulting in an inability to correct context-dependent errors. For example, hyperthyroidism in a condition description is wrongly written as hypothyroidism, and although a correct disease name can be presumed from specific symptoms in the condition description, it cannot correct spelling errors because the model cannot sufficiently understand the relevance and semantic meaning in the context information.
On the other hand, for spelling errors of similar chinese characters, for example, sinus rhythm errors are written as sinus heart rate, the rhythm and rate have the same pronunciation, and both heart rate and rhythm have respective practical meanings, existing deep-learning neural network models have difficulty in dealing with such complex nonlinear relationships, so that the models may not make a correct prediction in the face of morphology-similar or pronunciation-similar spelling errors.
Disclosure of Invention
In order to solve the problems, the invention provides a Chinese spelling error correction method suitable for medical contexts.
The method comprises the following steps:
dividing sentences to be corrected into Chinese characters as units to obtain Chinese characters, the first/>The individual Chinese characters are/>,/>Will/>Mapping each Chinese character through a word list to obtain a sequence, and adding/> before the sequenceAfter the sequence add/>Obtaining the Chinese character label sequence/>, of the sentence to be corrected
Step two, the Chinese character label sequenceInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>Contextual information feature/>Dimension conversion to/>Obtain confidence prediction/>
Step three, defining confidence predictionChinese characters/>, corresponding toConfidence prediction of (1) is Chinese character confidence prediction/>Prediction/>, of Chinese character confidence coefficientBefore selecting all values from big to smallThe value is taken as the/>, in the sentence to be correctedThe candidate Chinese character probability sets at the positions are normalized, wherein the first/>, in the sentences to be corrected, of the candidate Chinese character probability sets are correctedFirst/>, at the individual locationsNormalized confidence of each candidate Chinese character is/>
Step four, calculating the fourth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the first sentence in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters
Step five, calculating the fifth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the first sentence in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters
Step six, based on voice similaritySimilarity to vision/>Calculating Chinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese charactersBased on similarity/>Normalized confidence/>Calculating the/>, in the sentence to be correctedFirst/>, at the individual locationsComprehensive weight/>, of candidate Chinese charactersAccording to the comprehensive weight/>Calculating the/>, in the sentence to be correctedChinese character/>, corrected at each position
Further, the step two is to sequence the Chinese character labelsInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>Specifically refers to the Chinese character label sequence/>Inputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>
Wherein,Representing feature extraction operations through the BERT pre-trained chinese language model.
Further, the step two is to characterize the context informationDimension conversion to/>Obtain confidence prediction/>In particular to the context information feature/>Performing dimension conversion to obtain confidence prediction/>
Wherein,Representing linear transformation operations, confidence prediction/>Is/>
Further, normalizing the confidence in step threeThe calculation method of (1) is as follows:
Wherein, Representing confidence prediction/>, of Chinese charactersThe vectors of (2) are ordered from large to small according to the numerical valueA value.
Further, the fourth step specifically includes: the phonetic sequence of each Chinese character is formed by the phonetic and tone codes of the Chinese character, and the first sentence of the sentences to be corrected is definedChinese characters/>The pinyin sequence of (2) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity between candidate Chinese characters
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsThe vocabulary index of the candidate Chinese characters,Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsPinyin sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
Further, the fifth step specifically includes: defining the first sentence in the sentence to be correctedChinese characters/>The ideographic description sequence of (1) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsThe vocabulary index of the candidate Chinese characters,Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsIdeographic description sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
Further, the ideographic description sequence specifically refers to:
Splitting each Chinese character by taking a single character as a unit to obtain an internal character forming part, and combining the split residual strokes and the nearest single character as an internal character forming part for the Chinese characters which cannot be completely split into the single characters;
Splitting each internal character forming part continuously according to the sequence of the Chinese character writing rule until individual strokes are obtained;
According to the splitting sequence, constructing an ideographic description tree of the Chinese characters with a tree structure, wherein the root node of the ideographic description tree is the structural information code for describing the relative positions of the internal character forming components obtained by the first splitting, the leaf nodes are the stroke codes of single strokes, and the middle nodes are the structural information codes for describing the relative positions of the internal character forming components or strokes;
the ideographic description sequence of the Chinese characters is a sequence obtained by traversing the ideographic description tree.
Further, the traversal ideographic tree specifically refers to: and traversing the ideographic description tree according to the preamble traversing sequence.
Further, the sixth step specifically includes calculating the first sentence in the sentence to be correctedChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese characters
Wherein,An adjustment factor for adjusting the voice similarity and the visual similarity;
Combining the similarity and the normalized confidence to obtain the first sentence in the sentence to be corrected First/>, at the individual locationsComprehensive weight/>, of candidate Chinese characters
Then the first sentence in the sentence to be correctedChinese character/>, corrected at each positionThe method comprises the following steps:
Wherein, Representing a function of selecting the maximum value in brackets/(Representing a function that converts comprehensive weights into vocabulary indices,/>Representing a decoding function that converts the vocabulary index into a corresponding chinese character.
One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:
The spelling error correction method provided by the invention is based on the context confidence and the Chinese character similarity, and introduces the BERT pre-training Chinese language model, thereby introducing background knowledge in the pre-training process, and enabling the input sentence to be encoded on the basis, thereby integrating the current context characteristics and solving the problem that partial correct but unsuitable words are difficult to identify. Meanwhile, the text structure, namely the visual similarity and the voice similarity of the Chinese characters are modeled, so that the model is helped to recognize similar wrongly written characters, and the problem of similar character errors is solved.
Drawings
FIG. 1 is a schematic diagram of two internal word forming components according to an embodiment of the present invention in a left-to-right relationship;
FIG. 2 is a schematic diagram of two internal word forming components according to an embodiment of the present invention in a top-down relationship;
FIG. 3 is a schematic diagram of three internal word forming components according to an embodiment of the present invention in a left-to-right relationship;
FIG. 4 is a schematic diagram of three internal word forming components according to an embodiment of the present invention in a top-down relationship;
FIG. 5 is a schematic diagram showing the outside-in relationship of two internal word forming components according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of two internal letter components with three sides surrounding and lower opening relationship according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of two internal letter components with three sides surrounding and upper opening relationship according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of two internal letter components with three-sided surrounding and right opening relationship according to an embodiment of the present invention;
FIG. 9 is a schematic diagram showing two inner letter components in a left-top to right-bottom two-sided surrounding relationship according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of two internal word forming components according to an embodiment of the present invention in a top right to bottom left two-sided surrounding relationship;
FIG. 11 is a schematic diagram showing two internal word forming parts according to an embodiment of the present invention in a surrounding relationship from bottom left to top right;
FIG. 12 is a schematic diagram of a partially overlapping relationship of two internal word forming components according to an embodiment of the present invention;
fig. 13 is a schematic diagram of a Chinese character "by" ideographic tree according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the drawings and detailed embodiments, and before the technical solutions of the embodiments of the present invention are described in detail, the terms and terms involved will be explained, and in the present specification, the components with the same names or the same reference numerals represent similar or identical structures, and are only limited for illustrative purposes.
The invention corrects sentences input by the user and outputs the most probable correction result. Specifically, firstly, word segmentation is carried out on an input sentence to obtain a sequence of single Chinese characters, and then a special identifier is added to obtain a Chinese character label sequence of the sentence to be corrected; inputting a Chinese character label sequence of a sentence to be corrected into the BERT pre-training Chinese language model to obtain context information characteristics given by the BERT pre-training Chinese language model, and performing linear transformation on the context information characteristics to enable the dimensions of the context information characteristics to be aligned with a word list; before calculating each positionNormalized confidence of each candidate item, get the front/>, of each locationConfidence of individual candidates; calculate the front/>, for each locationThe candidate items correspond to the visual similarity and the voice similarity of the Chinese characters and the input Chinese characters, and weight the two to obtain the front/>, of each positionSimilarity of the individual candidates; front/>, of each position is calculated by fusing similarity and confidenceComprehensive weights of the individual candidates; and taking the Chinese character with the highest comprehensive weight at each position as the Chinese character after error correction.
The method provided by the invention specifically comprises the following steps:
1. word segmentation of sentences
In order for the BERT pre-trained chinese language model to be able to process sentences composed of characters, it is necessary to word the sentences.
Sentences to be correctedDividing by taking Chinese characters as a unit, and obtaining/>, obtained by dividingMapping each Chinese character through a word list to obtain a sequence/>,/>。/>Representing the number of Chinese characters included in the sentence to be corrected,/>Representing the/>, in the sentence to be correctedChinese characters/>And mapping to obtain the Chinese character digital label. The vocabulary is a table for mapping Chinese characters into numbers, and the corresponding number range of the vocabulary of the BERT pre-training Chinese language model is 0-21127
Based on the format requirement of BERT pre-trained Chinese language model, in sequencePlus/>Tagging followed by/>Marking to obtain the Chinese character label sequence of the sentence to be corrected
2. Acquiring features containing contextual information
The BERT pre-trained Chinese language model can model context semantic information and sequence Chinese character labelsInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>
Wherein,Representing extraction of feature operations, contextual information features by BERT pre-trained Chinese language modelsIs a dimension/>Vector of/>Is the output dimension of the BERT pre-trained chinese language model.
In order to calculate the probability of possible Chinese characters at each position of the sentence to be corrected in the subsequent steps, the context information needs to be characterizedPerforming dimension conversion to obtain confidence prediction/>
Wherein,Representing linear transformation operations, confidence prediction/>Is/>
3. Calculating confidence
Defining confidence predictionsThe corresponding sentence to be corrected in the sentence/>Chinese characters/>Confidence prediction of (1) is Chinese character confidence prediction/>Chinese character confidence prediction/>For length/>Each value in the vector represents a Chinese character corresponding to a certain value in the vocabulary of the BERT pre-trained Chinese language model as the/>Chinese characters/>Is a probability of (2).
Prediction of confidence level of Chinese charactersAll values in the vector of (a) are sorted from big to small according to the numerical value, and then the front/> isselectedThe individual values form a set as the/>, in the sentence to be correctedCandidate Chinese character probability set at each position, wherein Chinese characters corresponding to the candidate Chinese character probability set are the most likely to be predicted by the BERT pre-trained Chinese language model to be used as the/> in the sentence to be correctedChinese characters/>/>Chinese characters, and the first/>, in the candidate Chinese character probability setChinese characters corresponding to the individual values are the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters. Normalizing the candidate Chinese character probability set, and carrying out/>, in the sentence to be correctedFirst/>, at the individual locationsNormalized confidence/>, of each candidate chinese characterThe method comprises the following steps:
Wherein, Representing confidence prediction/>, of Chinese charactersThe vectors of (2) are ordered from large to small according to the numerical valueA value.
The invention only considers Chinese character confidence prediction when calculating normalized confidenceMedium value is the largest/>The individual values are candidates, not all values in the vocabulary. The purpose of this design is to draw confidence gaps because of the Chinese confidence prediction/>The values of the first few candidate values are relatively close, and if normalization is performed by using all the values in the vocabulary, the calculated normalized confidence of each candidate value will be too close.
4. Calculating speech similarity
The pronunciation of Chinese characters can be directly represented by their corresponding pinyin and tone. The invention judges the phonetic similarity of Chinese characters by converting Chinese characters into corresponding phonetic sequences, and the phonetic sequences of each Chinese character are defined as phonetic and tone coding compositions. In this embodiment, the tone encoding is digital. For example, the pinyin sequence of the Chinese character "medical" is "yi1", where "1" represents the encoding of the first tone. Defining the first sentence in the sentence to be correctedChinese characters/>The pinyin sequence of (2) is/>
Calculating the first sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsThe vocabulary index of the candidate Chinese characters,Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsPinyin sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
5. Calculating visual similarity
The invention adopts ideographic character description sequence to represent the visual information of Chinese characters, wherein the ideographic character description sequence comprises a plurality of structural information codes for describing the relative positions of character forming components in the Chinese characters and Chinese character stroke codes, and the strokes are ordered according to the writing rule of the Chinese characters. Based on the Chinese character structure information and the Chinese character stroke information, each ideographic description sequence can be in one-to-one correspondence with the Chinese characters described by the ideographic description sequence.
The invention describes the relative positions of character forming components in Chinese characters based on twelve structural information codes, thereby accurately representing the structural information of the Chinese characters. Fig. 1 shows a schematic view of two internal letter components in a left-to-right relationship, fig. 2 shows a schematic view of two internal letter components in a top-to-bottom relationship, fig. 3 shows a schematic view of three internal letter components in a left-to-right relationship, fig. 4 shows a schematic view of three internal letter components in a top-to-bottom relationship, fig. 5 shows a schematic view of two internal letter components in an outside-to-inside relationship, fig. 6 shows two internal letter components in a three-sided surrounding, a schematic view of an opening relationship below, fig. 7 shows two internal letter components in a three-sided surrounding, a schematic view of an opening relationship above, fig. 8 shows two internal letter components in a three-sided surrounding, a schematic view of an opening relationship to the right, fig. 9 shows two internal letter components in a two-sided surrounding relationship from the top left to the bottom, fig. 10 shows two internal letter components in a two-sided surrounding relationship from the top to the right, fig. 11 shows two internal letter components in a two-sided surrounding relationship from the top to the left, and fig. 12 shows two internal letter components in a partially overlapping relationship.
And for Chinese characters which cannot be completely split into individual characters, combining the split residual strokes and the nearest individual character as an internal character forming part, wherein the nearest individual character refers to the individual character closest to the split residual strokes in writing positions. And continuing to split each internal character forming part according to the sequence of the Chinese character writing rule until separate strokes are obtained. According to the splitting sequence, an ideographic description tree of the Chinese characters with a tree structure is constructed, the root node of the ideographic description tree is the structural information code for describing the relative positions of the internal character forming components obtained by splitting for the first time, the leaf nodes are the stroke codes of single strokes, and the middle nodes are the structural information codes for describing the relative positions of the internal character forming components or strokes. The ideographic description sequence of the Chinese characters is a sequence obtained by traversing the ideographic description tree.
In this embodiment, the ideographic tree is traversed sequentially by the preamble traversal. Fig. 13 shows the ideographic tree of the chinese character "by". Splitting the Chinese character 'from' for the first time to obtain a first internal character forming component '冂' and a second internal character forming component 'earth', wherein the first internal character forming component '冂' and the second internal character forming component 'earth' are in a partially overlapped relation corresponding to fig. 12, so that the root node of the ideographic description tree of the Chinese character 'from' is the structural information code of the partially overlapped relation corresponding to fig. 12; splitting the first internal word forming part 冂 to obtain a stroke I and a stroke I, wherein the stroke I and the stroke I are corresponding to the relation from left to right in FIG. 1; splitting the second internal character forming component 'soil' to obtain a third internal character forming component 'ten' and a stroke 'one', wherein the third internal character forming component 'ten' and the stroke 'one' are in a corresponding up-down relation of FIG. 2; splitting the third internal word forming part 'ten' to obtain a stroke 'one' and a stroke 'I', wherein the stroke 'one' and the stroke 'I' are corresponding partial overlapping relations in FIG. 12. To this end, the Chinese character "from" is completely split into individual strokes, and an ideographic tree as shown in fig. 13 is obtained, and the upper part of fig. 13 is a sequence obtained by performing a preface traversal on the ideographic tree of the Chinese character "from".
Defining the first sentence in the sentence to be correctedChinese characters/>The ideographic description sequence of (1) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsAnd the ideographic description sequence of the candidate Chinese characters.
6. Sentence correction
Calculating the first sentence in the sentence to be correctedChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese characters
Wherein,To adjust the adjustment factor of the voice similarity and the visual similarity.
Combining the similarity and the normalized confidence to obtain the first sentence in the sentence to be correctedFirst/>, at the individual locationsComprehensive weight/>, of candidate Chinese characters
Then the first sentence in the sentence to be correctedChinese character/>, corrected at each positionThe method comprises the following steps:
Wherein, Representing a function of selecting the maximum value in brackets/(Representing a function that converts comprehensive weights into vocabulary indices,/>Representing slaveAnd selecting the comprehensive weight with the largest value, and calculating the vocabulary index of the comprehensive weight.
Chinese character after error correctionIs the/>, in the corrected sentenceChinese characters corresponding to the positions and correcting the errorPossibly with Chinese characters/>The same or different, and the corrected Chinese characters/>And Chinese characters/>The same indicates the/>, in the corrected sentenceThe individual positions are unmodified; corrected Chinese character/>And Chinese characters/>Inequality indicates the/>, of sentences to be correctedThe positions are modified.
Each corrected Chinese character is composed in sequenceI.e. an error corrected sentence.
The above embodiments are merely illustrative of the preferred embodiments of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims (9)

1. A method of chinese spelling error correction for medical contexts, comprising the steps of:
dividing sentences to be corrected into Chinese characters as units to obtain Chinese characters, the first/>The individual Chinese characters are/>,/>Will/>Mapping each Chinese character through a word list to obtain a sequence, and adding/> before the sequenceAfter the sequence add/>Obtaining the Chinese character label sequence/>, of the sentence to be corrected
Step two, the Chinese character label sequenceInput into BERT pre-trained Chinese language model to obtain context information featuresContextual information feature/>Dimension conversion to/>Obtain confidence prediction/>
Step three, defining confidence predictionChinese characters/>, corresponding toConfidence prediction of (1) is Chinese character confidence prediction/>Prediction/>, of Chinese character confidence coefficientBefore selecting all values from big to smallThe value is taken as the/>, in the sentence to be correctedThe candidate Chinese character probability sets at the positions are normalized, wherein the first/>, in the sentences to be corrected, of the candidate Chinese character probability sets are correctedFirst/>, at the individual locationsNormalized confidence of each candidate Chinese character is/>
Step four, calculating the fourth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters
Step five, calculating the fifth sentence in the sentence to be corrected based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters
Step six, based on voice similaritySimilarity to vision/>Calculating Chinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese charactersBased on similarity/>Confidence with normalizedCalculating the/>, in the sentence to be correctedFirst/>, at the individual locationsComprehensive weight/>, of candidate Chinese charactersAccording to the comprehensive weight/>Calculating the/>, in the sentence to be correctedChinese character/>, corrected at each position
2. The method of claim 1, wherein in step two, the sequence of Chinese character labels is used for correcting the spelling of Chinese charactersInputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>Specifically refers to the Chinese character label sequence/>Inputting the information into the BERT pre-training Chinese language model to obtain the contextual information feature/>
Wherein,Representing feature extraction operations through the BERT pre-trained chinese language model.
3. A method of chinese spelling correction for medical context as recited in claim 1 wherein in step two the contextual information is characterized byDimension conversion to/>Obtain confidence prediction/>In particular to the context information feature/>Performing dimension conversion to obtain confidence prediction/>
Wherein,Representing linear transformation operations, confidence prediction/>Is/>
4. The method of claim 1, wherein the confidence level is normalized in step threeThe calculation method of (1) is as follows:
Wherein, Representing confidence prediction/>, of Chinese charactersThe vectors of (2) are ordered from large to small according to the numerical valueA value.
5. The method of claim 1, wherein the step four specifically comprises: the phonetic sequence of each Chinese character is formed by the phonetic and tone codes of the Chinese character, and the first sentence of the sentences to be corrected is definedChinese characters/>The pinyin sequence of (2) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSpeech similarity/>, between candidate Chinese characters
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsVocabulary index of candidate Chinese characters,/>Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsPinyin sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
6. The method of claim 1, wherein the fifth step comprises: defining the first sentence in the sentence to be correctedChinese characters/>The ideographic description sequence of (1) is/>Calculating the/>, in the sentence to be corrected, based on the edit distance algorithmChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsVisual similarity/>, between candidate Chinese characters
Wherein,Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsVocabulary index of candidate Chinese characters,/>Representing a decoding function for converting a vocabulary index into a corresponding Chinese character,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsCandidate Chinese characters,/>Representing the/>, in the sentence to be correctedFirst/>, at the individual locationsIdeographic description sequence of candidate Chinese characters,/>Representing an edit distance calculation function,/>Representing absolute value operations,/>Representing a maximizing function.
7. The method for correcting spelling errors in chinese language for medical context according to claim 6, wherein the sequence of ideographic descriptions is specifically:
Splitting each Chinese character by taking a single character as a unit to obtain an internal character forming part, and combining the split residual strokes and the nearest single character as an internal character forming part for the Chinese characters which cannot be completely split into the single characters;
Splitting each internal character forming part continuously according to the sequence of the Chinese character writing rule until individual strokes are obtained;
According to the splitting sequence, constructing an ideographic description tree of the Chinese characters with a tree structure, wherein the root node of the ideographic description tree is the structural information code for describing the relative positions of the internal character forming components obtained by the first splitting, the leaf nodes are the stroke codes of single strokes, and the middle nodes are the structural information codes for describing the relative positions of the internal character forming components or strokes;
the ideographic description sequence of the Chinese characters is a sequence obtained by traversing the ideographic description tree.
8. The method of claim 7, wherein the traversing ideographic description tree specifically refers to: and traversing the ideographic description tree according to the preamble traversing sequence.
9. The method of claim 1, wherein the sixth step comprises calculating the first sentence in the sentence to be correctedChinese characters/>And the/>, in the sentence to be correctedFirst/>, at the individual locationsSimilarity/>, between candidate Chinese characters
Wherein,An adjustment factor for adjusting the voice similarity and the visual similarity;
Combining the similarity and the normalized confidence to obtain the first sentence in the sentence to be corrected First/>, at the individual locationsComprehensive weight/>, of candidate Chinese characters
Then the first sentence in the sentence to be correctedChinese character/>, corrected at each positionThe method comprises the following steps:
Wherein, Representing a function of selecting the maximum value in brackets/(Representing a function that converts comprehensive weights into vocabulary indices,/>Representing a decoding function that converts the vocabulary index into a corresponding chinese character.
CN202410120343.5A 2024-01-29 2024-01-29 Chinese spelling error correction method suitable for medical context Active CN117648923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410120343.5A CN117648923B (en) 2024-01-29 2024-01-29 Chinese spelling error correction method suitable for medical context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410120343.5A CN117648923B (en) 2024-01-29 2024-01-29 Chinese spelling error correction method suitable for medical context

Publications (2)

Publication Number Publication Date
CN117648923A CN117648923A (en) 2024-03-05
CN117648923B true CN117648923B (en) 2024-05-10

Family

ID=90045479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410120343.5A Active CN117648923B (en) 2024-01-29 2024-01-29 Chinese spelling error correction method suitable for medical context

Country Status (1)

Country Link
CN (1) CN117648923B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310443A (en) * 2020-02-12 2020-06-19 新华智云科技有限公司 Text error correction method and system
CN112530597A (en) * 2020-11-26 2021-03-19 山东健康医疗大数据有限公司 Data table classification method, device and medium based on Bert character model
CN113657098A (en) * 2021-08-24 2021-11-16 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN113673228A (en) * 2021-09-01 2021-11-19 阿里巴巴达摩院(杭州)科技有限公司 Text error correction method, text error correction device, computer storage medium and computer program product
CN113935317A (en) * 2021-09-26 2022-01-14 平安科技(深圳)有限公司 Text error correction method and device, electronic equipment and storage medium
CN114881006A (en) * 2022-03-30 2022-08-09 医渡云(北京)技术有限公司 Medical text error correction method and device, storage medium and electronic equipment
CN115081430A (en) * 2022-05-24 2022-09-20 中国科学院自动化研究所 Chinese spelling error detection and correction method and device, electronic equipment and storage medium
CN115114919A (en) * 2021-03-19 2022-09-27 富士通株式会社 Method and device for presenting prompt information and storage medium
CN115862040A (en) * 2022-12-12 2023-03-28 杭州恒生聚源信息技术有限公司 Text error correction method and device, computer equipment and readable storage medium
CN116522905A (en) * 2023-07-03 2023-08-01 腾讯科技(深圳)有限公司 Text error correction method, apparatus, device, readable storage medium, and program product

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11593560B2 (en) * 2020-10-21 2023-02-28 Beijing Wodong Tianjun Information Technology Co., Ltd. System and method for relation extraction with adaptive thresholding and localized context pooling

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310443A (en) * 2020-02-12 2020-06-19 新华智云科技有限公司 Text error correction method and system
CN112530597A (en) * 2020-11-26 2021-03-19 山东健康医疗大数据有限公司 Data table classification method, device and medium based on Bert character model
CN115114919A (en) * 2021-03-19 2022-09-27 富士通株式会社 Method and device for presenting prompt information and storage medium
CN113657098A (en) * 2021-08-24 2021-11-16 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN113673228A (en) * 2021-09-01 2021-11-19 阿里巴巴达摩院(杭州)科技有限公司 Text error correction method, text error correction device, computer storage medium and computer program product
CN113935317A (en) * 2021-09-26 2022-01-14 平安科技(深圳)有限公司 Text error correction method and device, electronic equipment and storage medium
CN114881006A (en) * 2022-03-30 2022-08-09 医渡云(北京)技术有限公司 Medical text error correction method and device, storage medium and electronic equipment
CN115081430A (en) * 2022-05-24 2022-09-20 中国科学院自动化研究所 Chinese spelling error detection and correction method and device, electronic equipment and storage medium
CN115862040A (en) * 2022-12-12 2023-03-28 杭州恒生聚源信息技术有限公司 Text error correction method and device, computer equipment and readable storage medium
CN116522905A (en) * 2023-07-03 2023-08-01 腾讯科技(深圳)有限公司 Text error correction method, apparatus, device, readable storage medium, and program product

Also Published As

Publication number Publication date
CN117648923A (en) 2024-03-05

Similar Documents

Publication Publication Date Title
CN110489760A (en) Based on deep neural network text auto-collation and device
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN113779972B (en) Speech recognition error correction method, system, device and storage medium
CN106776548A (en) A kind of method and apparatus of the Similarity Measure of text
CN111985234B (en) Voice text error correction method
CN113283236A (en) Entity disambiguation method in complex Chinese text
CN114386399A (en) Text error correction method and device
CN116910272B (en) Academic knowledge graph completion method based on pre-training model T5
CN111428104A (en) Epilepsy auxiliary medical intelligent question-answering method based on viewpoint type reading understanding
JP2020030367A (en) Voice recognition result formatted model learning device and its program
CN117648923B (en) Chinese spelling error correction method suitable for medical context
CN114372140A (en) Layered conference abstract generation model training method, generation method and device
CN114863948A (en) CTCATtention architecture-based reference text related pronunciation error detection model
CN114548053A (en) Text comparison learning error correction system, method and device based on editing method
CN114511084A (en) Answer extraction method and system for automatic question-answering system for enhancing question-answering interaction information
US20240346950A1 (en) Speaking practice system with redundant pronunciation correction
CN111046663B (en) Intelligent correction method for Chinese form
CN106339367B (en) A kind of Mongolian auto-correction method
US11817079B1 (en) GAN-based speech synthesis model and training method
CN117852528A (en) Error correction method and system of large language model fusing rich semantic information
BE1022627B1 (en) Method and device for automatically generating feedback
CN111274826A (en) Semantic information fusion-based low-frequency word translation method
CN116956944A (en) Endangered language translation model method integrating syntactic information
CN113486160B (en) Dialogue method and system based on cross-language knowledge
CN110955768B (en) Question-answering system answer generation method based on syntactic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant