Background technology Chinese whole sentence input method (abbreviation whole sentence input method) is a method of importing Chinese character by the mode of whole sentence, and it allows the user needn't select just can import a whole word to every single word when input.Whole sentence input method uses Chinese language model (abbreviation language model) to predict each Chinese character of importing possibly in the sentence intelligently usually.That is to say that language model will be decoded into most probable Chinese character string to user's input string.This process is also referred to as " decode procedure " or " search procedure ".The form of expression of user's input string can be the articulation type (as phonetic) of word, also can be the ways of writing (as stroke) of word.
In Chinese whole sentence input method, two kinds of mistakes may take place usually: 1, the input error of user in input process (as incorrect insertion, deletion, replacement, or wrong order), user error be called for short; 2, whole sentence input method mistake that input string decoding is produced (as carried out wrong cutting or selected be not the Chinese character that the user will import) is called for short system mistake or decoding error.
1, user error: user error can be divided into following two classes.Narration respectively below.
(1) fuzzy sound mistake: such mistake mostly occurs in the heavier people of accent or dialect, and particularly those are from the people of southern china." fuzzy sound " mistake has:
Or the like.
Existing input method generally is to use option to the correcting method of fuzzy sound mistake.Provide a fuzzy sound input option as the purple light input method, when the user is provided with this option, input method will be carried out fuzzy matching when user's input Pinyin string.As the user can import with " zong " " in ", import " always " with " zhong ".Fuzzy sound matching candidate word is listed in correct matching candidate word.
For calculating and the limited equipment such as PDA(Personal Digital Assistant) of storage resources, when this processing mode can cause the user to import, but candidate's Chinese character is too much, reduces the precision and the speed of input.
(2) input error: user's input error is refered in particular in user's input process wrong letter/stroke input.Comprise:
Imported unnecessary letter/stroke.For example: " he " is defeated to be " hei ", and " a Pie Dian " is defeated to be " a Pie Dian Dian ";
Omitted a letter/stroke.For example: " hei " is defeated to be " he ", and " a Pie Dian Dian " is defeated to be " a Pie Dian ";
Input is wrong letter/stroke.For example: " nai " is defeated to be " bai ", and " a Shu Pie Dian " is defeated to be " Pie Dian one by one ".
Certainly, above-mentioned mistake may be a kind of incessantly in whole sentence input process (be " ping " as will " bin " defeated, " Shu one one by one " is defeated to be " Shu Shu Shu one by one ").
Existing input method as the purple light input method, allows the pinyin string of user in any location updating input.It with the user once continuous being presented on the input frame of pinyin string of input, and allow the user cursor to be moved on the position of wanting with directionkeys.On cursor position, the user can correct input error, as: add the letter of omitting or delete incorrect or unnecessary letter.
But for some display resolution small device, as PDA(Personal Digital Assistant), the whole pinyin string of explicit user once to import is unpractical.In addition, for touch panel device, give me a little bit selecting with pen and put than many easily with directionkeys.
2, system mistake: system mistake has system's cutting mistake and system decodes mistake.Narration respectively below.
(1) cutting mistake: the selection of system's cutting mistake and cutting algorithm is closely related.For example, maximum matching algorithm is not having can not to decode " Henan " from " henan " under the user intervention, and the smallest match algorithm is not having can not to decode " safety " from " pingan " under the user intervention.Generally speaking, the cutting mistake of whole sentence input method generation can be divided into following a few class:
A word is cut into two words.For example: " elder generation " (xian) is cut into " Xi'an " (xi ' an);
Two words are merged into a word.For example, " Xi'an " (xi ' an) be merged into " elder generation " (xian);
Two words are divided into two words in addition by mistake.For example, " Henan " (he ' nan) is divided into " very peace " (hen ' an) by mistake.
For the cutting mistake, we are example with the purple light input method on the PC, and it is with " ' " expression phonetic separator, allow the user manually to edit input string and revise wrong cutting.But the user is the incorrect phonetic separator of deletion earlier, inserts the phonetic separator in correct position then.According to top said, this method is not suitable for this class small screen device of PDA equally.
Microsoft's whole sentence input method has used another method to come cutting mistake and input error are handled.It allows the user directly cursor to be moved on the Chinese character wrong in the sentence and to this word to edit, rather than pinyin string is edited.On cursor position, the user needs the Chinese character of first deletion error, re-enters correct phonetic then.The problem of this method is the modification time of a little input error (having struck a letter " g " as leakage) similar with modification (as " bin " changed into " ping ") institute's time spent of big input error.
(2) mistake is selected in word select: since the precision of language model generally all can not reach 100% correct, commonplace in the polyphone and the phonetically similar word problem that add Chinese, therefore the sentence that utilizes the Chinese whole sentence input method of language model when decoding, to provide sometimes, the user wishes to import not every Chinese character all the time, and we claim that this decoding error is that mistake is selected in word select.Imported the phonetic of " I will buy machine " as the user, the possibility of result is " I will sell and ".
Select mistake in this case at Chinese character, existing whole sentence input method can provide following solution usually: 1) at first the user selects word wrong in the sentence; 2) whole sentence input method will show the Chinese character candidate that other are possible; 3) user need therefrom to select the word of input, and whole sentence input method upgrades selected word.When having a plurality of words mistake, it is all correct up to all Chinese characters that the user need repeat above-mentioned steps.Generally speaking, word wrong in the sentence is many more, and user's manual modification institute's time spent is long more.
The whole sentence input method that also has provides " the step manual modification " done according to the user, corrects the function of other erroneous words automatically.For example, in Microsoft's spelling input method, if the user will " and " in " reaching " word when changing " machine " into, whole sentence input method will change second word " its " into " device " automatically.But this mode only relates to the automatic modification of the related words of a speech inside at every turn, and can not influence or revise word beyond the related term automatically.
Summary of the invention the objective of the invention is to propose a kind of Chinese whole sentence input method, to solve the existing Chinese whole sentence input method shortcoming more loaded down with trivial details to bug patch, makes input in Chinese more quick and easy.
The Chinese whole sentence input method that the present invention proposes may further comprise the steps:
(1) utilizes Chinese language model that the pinyin string of user input or stroke string are decoded and obtain text strings;
(2) by the user above-mentioned decoding text strings being confirmed, if confirm this decoded result, then is correct Chinese whole sentence, if do not confirm this decoded result, then the user makes amendment to any one mistake in the text strings, then according to the local corrigendum of being modified for of user;
(3) whole sentence pinyin string or stroke string through above-mentioned modification and corrigendum are decoded, be consistent before and after the literal that user's modification is crossed in the decoded text strings, the method for decoding comprises the steps:
(a) according to language model, generate a lexical tree, set the searching route of a sky, and it is stored in the array of path, with the pointed lexical tree tree root of lexical tree;
(b) according to from left to right order, whole sentence pinyin string or stroke string after the search corrigendum;
(c) from the array of path, take out a searching route, if the user has selected literal according to phonetic or stroke, be a new route then with the pathway permutations of taking out, if the not selected literal of user, be one or more new route with the pathway permutations of taking out then, revise corresponding lexical tree pointer and search information in the new route simultaneously according to lexical tree;
(d) new route that above-mentioned every displacement is obtained is judged, if arrived the leaf of lexical tree, then according to above-mentioned language model, presses the accumulative total logarithm probability that all literal occur in the following formula calculating path:
Accumulative total logarithm probability=former accumulative total logarithm probability+current logarithm probability
Logarithm probability=lnP (w
n| w
N-2, w
N-1)
Then again lexical tree pointed lexical tree tree root,
In the following formula, w
nRepresent the speech in the lexical tree leaf of this path indication, current speech in the literal that correspondence decodes out, w
N-1And w
N-2Two speech of current speech front in the literal of representing respectively to decode out, P (w
n| w
N-2, w
N-1) expression speech string (w
N-2, w
N-1, w
n) probability of occurrence;
(e) repeating step (c) and (d), all paths in the array of path all displacement finish;
(f) new route after all displacements is sorted from high to low by the accumulative total logarithm probability that calculates,, in the array of path, keep the high path of accumulative total logarithm probability according to the number of path pool-size;
(g) repeating step (b)~(f) is handled up to whole sentence pinyin string or stroke string;
(4) repeating step (2) and (3) are until obtaining the Chinese whole sentence that the user confirms.
Mistake in the text strings in the said method is user error or system mistake, and user error wherein is fuzzy sound mistake or input error, and system mistake wherein is that mistake is selected in cutting mistake or word select.
The method that the user makes amendment to the mistake in the above-mentioned text strings and correct the part is had four kinds.
First kind is that the user makes amendment and local corrigendum to fuzzy sound mistake, comprises the steps:
(1) user selects wrong literal, shows the phonetic corresponding with this literal;
(2) user selects correct phonetic from the corresponding fuzzy sound menu of the phonetic of above-mentioned wrong literal.
Second kind is that the user makes amendment to input error and local corrigendum, comprises the steps:
(1) user selects wrong literal, shows phonetic or the stroke corresponding with this literal;
(2) phonetic or stroke are made amendment, it is become and correct corresponding phonetic of literal or stroke.
The third is that the user makes amendment to the cutting mistake and local corrigendum, comprises the steps:
(1) user selects two or more adjacent wrong literal;
(2) mobile cutting symbol position in the continuous phonetic transcription string of the wrong literal that the user selectes makes the continuous phonetic transcription of wrong literal conspire to create continuous phonetic transcription string into correct literal.
The 4th kind to be the user to word select select that mistake is made amendment and local corrigendum, comprises the steps:
(1) user selects wrong literal, shows to have other all literal of identical phonetic or stroke with this literal;
(2) user selects correct literal from above-mentioned all literal.
The Chinese whole sentence input method that the present invention proposes, its advantage is:
1, the present invention provides the method for bluring the fast automatic reselection procedure of sound for those because accent or dialect custom can not accurately be risked the user of phonetic;
2, the present invention provides alter mode easily for the input error in the user's modification input process;
3, the present invention provides easily automatically heavily cutting correcting mode for cutting mistake that system decodes caused;
4, the present invention selects mistake for the word select that system decodes caused the corrigendum mode of selecting candidate fast is provided;
5, the present invention makes full use of the information that the user is contained in revising each time, carries out re-decoding automatically, thereby can revise other possible mistakes in the sentence fast, has improved correct mistakes efficient and accuracy.
Embodiment
The Chinese whole sentence input method that the present invention proposes, its flow process at first utilize Chinese language model that the pinyin string of user's input or stroke string are decoded and obtain text strings as shown in Figure 1; By the user above-mentioned decoding text strings being confirmed, if confirm this decoded result, then is correct Chinese whole sentence, if do not confirm this decoded result, then the user makes amendment to any one mistake in the text strings, then according to the local corrigendum of being modified for of user; To decoding, be consistent before and after the literal that user's modification is crossed in the decoded text strings through whole sentence pinyin string or the stroke string revising and correct; Repeat said process, until obtaining the Chinese whole sentence that the user confirms.
Mistake in the text strings in the said method is user error or system mistake, and user error wherein is fuzzy sound mistake or input error, and system mistake wherein is that mistake is selected in cutting mistake or word select.
The method that the user makes amendment to the mistake in the above-mentioned text strings and correct the part is had four kinds.
First kind is that the user makes amendment and local corrigendum to fuzzy sound mistake, comprises the steps:
(1) user selects wrong literal, shows the phonetic corresponding with this literal;
(2) user selects correct phonetic from the corresponding fuzzy sound menu of the phonetic of above-mentioned wrong literal.
Second kind is that the user makes amendment to input error and local corrigendum, comprises the steps:
(1) user selects wrong literal, shows phonetic or the stroke corresponding with this literal;
(2) phonetic or stroke are made amendment, it is become and correct corresponding phonetic of literal or stroke.
The third is that the user makes amendment to the cutting mistake and local corrigendum, comprises the steps:
(1) user selects two or more adjacent wrong literal;
(2) mobile cutting symbol position in the continuous phonetic transcription string of the wrong literal that the user selectes makes the continuous phonetic transcription of wrong literal conspire to create continuous phonetic transcription string into correct literal.
The 4th kind to be the user to word select select that mistake is made amendment and local corrigendum, comprises the steps:
(1) user selects wrong literal, shows to have other all literal of identical phonetic or stroke with this literal;
(2) user selects correct literal from above-mentioned all literal.
In the said method, to through modification and the flow process of the whole sentence pinyin string of corrigendum or the method that stroke string is decoded as shown in Figure 6, according to language model, generate a lexical tree, set the searching route of a sky, and it is stored in the array of path, with the pointed lexical tree tree root of lexical tree; According to order from left to right, whole sentence pinyin string or stroke string after the search corrigendum; From the array of path, take out a searching route, if the user has selected literal according to phonetic or stroke, be a new route then with the pathway permutations of taking out, if the not selected literal of user, be one or more new route with the pathway permutations of taking out then, revise corresponding lexical tree pointer and search information in the new route simultaneously according to lexical tree; The new route that every displacement obtains is judged, if arrived the leaf of lexical tree, then, press the accumulative total logarithm probability that all literal occur in the following formula calculating path: accumulative total logarithm probability=former accumulative total logarithm probability+current logarithm probability according to above-mentioned language model
Logarithm probability=lnP (w
n| w
N-2, w
N-1)
Then again lexical tree pointed lexical tree tree root.In the following formula, w
nRepresent the speech in the lexical tree leaf of this path indication, current speech in the literal that correspondence decodes out, w
N-1And w
N-2Two speech of current speech front in the literal of representing respectively to decode out, P (w
n| w
N-2, w
N-1) expression speech string (w
N-2, w
N-1, w
n) probability of occurrence, estimate by existent method;
Repeating the whole displacements in above-mentioned all paths in the array of path finishes; New route after all displacements is sorted from high to low by the accumulative total logarithm probability that calculates,, in the array of path, keep the high path of accumulative total logarithm probability according to the number of path pool-size; Repeating said process handles up to whole sentence pinyin string or stroke string.
With Chinese whole sentence input method the process to decoding through the whole sentence pinyin string or the stroke string of modification and corrigendum in the inventive method is described below based on phonetic.The example of lexical tree as shown in Figure 7, by the syllable tissue, wherein Ф represents sky, with any phonetic coupling.Press the direction of arrow among the figure, go to leaf from tree root, can obtain a pinyin string, this pinyin string is corresponding to the speech of being preserved in the corresponding leaf, and this pinyin string is mated in order to the input Pinyin string with the user.The vocabulary in the language model formed in the speech that comprises in all leaves.
Being chosen to be " state " by the user with user's input " zhong guo ren min " and wherein " guo " is example, and search procedure is described.
When (1) search begins, have only a dead circuit footpath, the tree root of lexical tree pointed lexical tree wherein, the program variable of record accumulative total logarithm probability is clearly 0.
(2) according to from left to right order, whole sentence pinyin string or the stroke string after (3) and (4) search corrigendum set by step.
(3) from the array of path, take out a searching route, if the user has selected literal according to phonetic or stroke, be a new route then with the pathway permutations of taking out, if the not selected literal of user, be one or more new route with the pathway permutations of taking out then, revise corresponding lexical tree pointer and search information in the new route simultaneously according to lexical tree; Such as, (a) as if lexical tree pointed tree root in the current path, and current first phonetic " zhong " that mating, check indication under the lexical tree tree root, has only a direction with user input " zhong " is complementary, go down along this direction, there are two to need not to mate the direction that any syllable (Ф) can arrive the leaf node in addition, therefore be this pathway permutations three new routes, that node of " zhong " indication in the lexical tree pointed lexical tree of article one new route wherein, in the lexical tree pointed lexical tree of second new route " in " that leaf node, " loyalty " that leaf node in the lexical tree pointed lexical tree of the 3rd new route; (b) if the node of " zhong " indication in the current lexical tree pointed lexical tree, and the user imports directed towards user selected " state ", whether check has with " state " coupling in the pairing leafy node of follow-up node of " zhong " indication node in the lexical tree, find " China ", " " center "; " loyalty "; " in " and " loyalty " 5 leaves in have only one can mate with state; be a new path so with this pathway permutations; that node of " guo " indication in the lexical tree pointed lexical tree; and be a Ф after " guo ", therefore directly with pointed " China " leaf node.
(4) new route that above-mentioned every displacement is obtained is judged, if arrived the leaf of lexical tree, then according to above-mentioned language model, presses the accumulative total logarithm probability that all literal occur in the following formula calculating path:
Accumulative total logarithm probability=former accumulative total logarithm probability+current logarithm probability
Logarithm probability=ln P (w
n| w
N-2, w
N-1)
Then again lexical tree pointed lexical tree tree root, in the following formula, w
nRepresent the speech in the lexical tree leaf of this path indication, current speech in the literal that correspondence decodes out, w
N-1And w
N-2Two speech of current speech front in the literal of representing respectively to decode out, P (w
n| w
N-2, w
N-1) expression speech string (w
N-2, w
N-1, w
n) probability of occurrence, estimate by existent method; Such as in certain paths, treated to second syllable, lexical tree pointed " state " leaf node, and the preceding continuous speech that this path keeps be " in ", and accumulative total logarithm probability is-2.99, the probability of speech string " in, state " is P (state | in)=0.1, add up so the logarithm probability=-2.99+ln 0.1=(2.99)+(2.30)=-5.29.
(5) repeating step (3) and (4), all paths in the array of path all displacement finish.
(6) new route after all displacements is sorted from high to low by the accumulative total logarithm probability that calculates,, in the array of path, keep the high path of accumulative total logarithm probability according to the number of path pool-size.Such as, when searching phonetic " ren ", we obtain some such paths and add up the logarithm probability accordingly
Path (a): " in, state, people " ,-8.34
Path (b): " in, state, ren " ,-5.29
Path (c): " loyalty, state, people " ,-10.56
Path (d): " loyalty, state, ren " ,-8.60
Path (e): " China, people " ,-5.10
Path (f): " China, ren "-3.78
Path (g): " Chinese "-4.90
And the array capacity is 5, will be retained in path (f), (g), (e), (b), (d) in the array of path so.
(7) handle up to whole sentence pinyin string repeating step (2)~(6), obtains " Chinese people " at last.
The method of above-mentioned four kinds of modifications and local corrigendum on mobile device such as PDA, mobile phone etc., needs a user interface, as showing the viewing area of at least 6 characters (corresponding the longest phonetic is as " zhuang ") width, as Fig. 2.
Introduce embodiments of the invention below in conjunction with accompanying drawing.
Shown in Figure 3 is the embodiment that revises fuzzy sound mistake and carry out local corrigendum:
The user selects wrong literal, shows the phonetic corresponding with this literal; The user selects correct phonetic from the corresponding fuzzy sound menu of the literal phonetic of above-mentioned wrong literal.
For example, after the input of user error " yizangpiao ", system decodes is " hundred million hide ticket ".Then, the user selects " Tibetan " word, will show " zang " in the phonetic zone.Then the user pins the phonetic zone, at this moment will eject a floating menu that " zan ", " zhan " and " zhang " three options are arranged.After the user selected " zhang ", system was updated to " yizhangpiao " with original pinyin string.For having the automatic pinyin string system of decoding function again, system will decode again to the pinyin string after upgrading, and obtain Chinese character string " ticket ".
Shown in Figure 4 is the embodiment that revises input error and carry out local corrigendum:
The user selects wrong literal, shows phonetic or the stroke corresponding with this literal; Phonetic or stroke are made amendment, it is become and correct corresponding phonetic of literal or stroke.
For example, after the input of user error " nirushuo ", system decodes is " you are as saying ".At this moment, the user selects " you " word, and system will show " ni " in the phonetic zone.Then but the user clicks this pinyin string and makes it to become editing mode.The user changes " bi " into " ni " then, and clicks this pinyin string to finish this modification.Because " bi " is an effective pinyin string, therefore, system is updated to " birushuo " with original pinyin string.For having the automatic pinyin string system of decoding function again, system will decode again to the pinyin string after upgrading, obtain Chinese character string " such as ".
Shown in Figure 5 is the embodiment that revises the cutting mistake and carry out local corrigendum:
The user selects two or more adjacent wrong literal; Mobile cutting symbol position in the continuous phonetic transcription string of wrong literal makes the continuous phonetic transcription of wrong literal conspire to create continuous phonetic transcription string into correct literal.
For example, user's input string is " henansheng ", and the cutting of system mistake is for " hen ' an ' sheng " and export Chinese character string " very peaceful ".At this moment, the user has selected " very peace " with pen.System will obtain left word pinyin string " hen " and right word pinyin string " an ", and these two polyphones be connect with a phonetic separation in the centre and to be called a string " hen ' an ".System judges with subalgorithm whether this string exists other slit mode.Subalgorithm moves to left one with the phonetic separator from current location, and both pinyin string became " he ' nan ", then this string was decoded with the local solution code calculation.At this moment decoding is correct, and then decoded result returns to system.System will change original pinyin string into " he ' nan ' sheng ".For having the automatic pinyin string system of decoding function again, input string is decoded again obtains " Henan Province ".
The comprehensive example of revising various mistakes and carrying out local corrigendum
For example, behind user's input Pinyin string " suijienishuijiaoshichishuijiao ", the decoded result of system mistake is " a year muddy water teacher sleeping and eating are felt ".
At this moment, the user with first word in the sentence " year " choose, wish blur the sound selection, system prompt has " sui " and " shui ", the user selects " shui ", and system will lock to be revised and decoding again automatically, obtains result's " hydrolysis muddy water teacher sleeping and eating feel ".The user selectes first Chinese character, and system lists all Chinese characters with identical phonetic " shui ", and the user therefrom locks Chinese character " who ", and system will lock and revise and decoding again automatically this moment, and the result is " whose muddy water teacher sleeping and eating feel ".
After the user was " jiao " to the phonetic error correction of second word, system will lock revised and is decoded as again automatically " who makes you sleeping and eating in bed feel ".
The user changes the last character in the sentence " feel " into " dumpling " then, and system will lock modification and to whole pinyin string re-decoding, the output result is correct Chinese character string " who makes you eat boiled dumplings in bed ".
Therefore, though always have the word of 8 mistakes in the sentence, the user only need make four modifications just can obtain correct sentence.The problem of if there is no fuzzy sound mistake and input error, then twice modification can be revised 8 mistakes, and efficient improves greatly.