CN102543071A

CN102543071A - Voice recognition system and method used for mobile equipment

Info

Publication number: CN102543071A
Application number: CN2011104241817A
Authority: CN
Inventors: 王海坤; 何婷婷; 王智国; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: Jilin Kexun Information Technology Co ltd
Priority date: 2011-12-16
Filing date: 2011-12-16
Publication date: 2012-07-04
Anticipated expiration: 2031-12-16
Also published as: CN102543071B

Abstract

The invention provides a voice recognition system and a voice recognition method used for individual equipment. The voice recognition method includes the steps of of: acquiring user voice input and recognizing the input voice on the basis of a limited grammar recognition network so as to acquire a first recognition result; in response to the first recognition result which does not satisfy the recognition acceptability condition, recognizing the voice input on the basis of a large vocabulary continuous voice recognition network to acquire a second recognition result; selecting a preferential result from the first recognition result and the second recognition result to be used as the final decoding result for the voice input. The embodiment of the invention provides a new voice recognition method and a new voice recognition device which are capable of supporting intelligent response to continuous voice inputs and quick response to a short voice command under uniform system interfaces.

Description

The speech recognition system and the method that are used for mobile device

Technical field

Relate generally to field of voice signal of the present invention relates to the method and apparatus that identification is carried out in a kind of input to user speech that is used for mobile device especially.

Background technology

Realize man-machine between hommization, intelligentized effectively mutual, make up man-machine communication's environment of efficient natural, become the active demand of current information technical application and development.Popularizing along with wireless communication networks in recent years particularly; Various intelligentized portable mobile apparatus have been brought into play more and more important effect in people's life, increasing man-machine interaction requires a kind of interactive means of the more efficient natural that is directed against small screen device newly.Voice have just been brought into play more and more important effect as the interactive means of natural human nature.For example the user hopes and can control mobile device like " phoning Wang Zhiguo " through phonetic entry when being inconvenient to call driving etc., or when short message editing etc. needs literal to import, hopes directly to realize through phonetic entry and speech recognition.

Multiple speech recognition technology has been proposed at present.For example; " Token Passing:A Simple Conceptual Model for Connected Speech Recognition Systems " people such as S.J.Young; Technical Report CUED/F-INFENG/TR38; Cambridge University Engineering Dept, 1989, in a kind of speech recognition system based on limited grammer network is disclosed.This system can realize the identification of precise and high efficiency for brief voice command, yet under the common situation of arbitrarily saying, often can not work.

For example, at people such as Aubert X. " Large Vocabulary Continuous Speech Recognition of Wall Street Journal Corpus. ", Proc.ICASSP ' 94; Adelaide; Australia, Vol.II, pp.129-132; 1994, in the speech recognition system based on large vocabulary continuous speech recognition network is disclosed.Yet a shortcoming of this speech recognition system is and need in the huge search volume that is made up of extensive acoustic model and general language model, searches for optimal path that the response quick and precisely that brief voice command needs often can not get ensureing.

Therefore, need a kind of new audio recognition method that is used for mobile device and system, it can be implemented in balance between accuracy and the efficient of speech recognition, and the response quick and precisely to brief voice command is provided, and the speech recognition support that random theory is provided.

Summary of the invention

To achieve these goals, embodiments of the invention have proposed a kind of new audio recognition method and device, and it supports the intelligent response of continuous speech input is reached the quick response to brief voice command.

According to an aspect of the present invention, a kind of audio recognition method that is used for mobile device is provided, has comprised: obtained the user speech input; Discern said phonetic entry to obtain first recognition result based on limited grammer recognition network; But do not satisfy the identification acceptable conditions in response to first recognition result, at local side or through carrying out based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition to obtain second recognition result to the server end transmission of speech signals; And select preferred person in said first and second recognition results as the final decoded result of said phonetic entry.

According to another aspect of the present invention, a kind of speech recognition system that is used for mobile device is provided, has comprised: deriving means is used to obtain the user speech input; First recognition device is used for discerning said phonetic entry to obtain first recognition result based on limited grammer recognition network; Second recognition device, but be used for not satisfying the identification acceptable conditions in response to first recognition result, based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition to obtain second recognition result; And the definite device of decoding, be used for selecting the final decoded result of the preferred person of said first and second recognition results as said phonetic entry.

Scheme according to the present invention has following characteristics:

The user can realize the identification to the various types of voice input command under the integrated system interface,

Can respond the free speech recognition of arbitrarily saying of user,

Can quick and precisely respond brief voice command identification,

Can realize accurately identification to the related customizing messages of local mobile device.

Description of drawings

In conjunction with the drawings with reference to down in the face of the detailed description of embodiment of the present invention, above-mentioned and other characteristics of the present invention will be more obvious.In the accompanying drawings,

Fig. 1 schematically shows the process flow diagram of the method for the speech recognition that is used for mobile device according to an embodiment of the invention;

Fig. 2 shows example restricted grammer recognition network according to an embodiment of the invention;

But Fig. 3 shows the decision flow chart whether recognition result of judging phonetic entry according to an embodiment of the invention satisfies the identification acceptable conditions;

Fig. 4 shows the process flow diagram based on the improved Viterbi searching method of the continuous speech recognition of large vocabulary continuous speech recognition network that is used for according to a preferred embodiment of the present invention;

Fig. 5 schematically shows the process flow diagram that multifactorial evaluation recognition result according to an embodiment of the invention is confirmed the final decoded result of phonetic entry;

Fig. 6 shows the block diagram of the speech recognition system that is used for mobile device according to an embodiment of the invention;

Fig. 7 shows the schematic block diagram that can realize the mobile device of embodiments of the invention therein.

In the accompanying drawings, identical or corresponding label is represented identical or corresponding part.

Embodiment

Hereinafter, will carry out detailed description to the audio recognition method that is used for mobile device of the present invention with device through embodiment with reference to accompanying drawing.Should be appreciated that providing these embodiment only is in order those skilled in the art can be understood better and then to realize the present invention, and be not to limit scope of the present invention by any way.

To be example explanation the present invention mainly hereinafter with individual mobile phone, but but the present invention can be used for the equipment of various support voice input functions, and be not limited to mobile phone.For example, the present invention can also be used for PDA(Personal Digital Assistant), multimedia music player, flat computer or the like.

In mobile device, bear the responsibility of personal assistant along with mobile device is increasing, exist various needs through voice and the mutual situation of equipment usually.In some cases, the user possibly expect that voice command through brief comes the operation of controlling mobile equipment.For example, can come the launching or finish of various application on the controlling mobile equipment through voice command.Such as, the user possibly hope that " phoning Zhang San " through voice command launches the call to Zhang San, wherein Zhang San can be one of contact person in the address list on this mobile device.In other cases, the user possibly hope to use more naturally the mode of random theory to come to carry out alternately with equipment.For example hope to let equipment send for contact person Zhang San in the address list at 7 o'clock and have corresponding contents the short message of " tonight company have a meeting to 3 building meeting rooms " at 7 o'clock through phonetic entry " tell Zhang San company's tonight to 3 building meeting rooms meetings ".Obviously in order to realize the correct execution to user's various types of voice input command, its most important condition is exactly to want correct its voice content of identification.

Usually often only can handle brief voice command based on the speech recognition system of limited grammer recognition network, then can not cope well to the situation of random theory.Speech recognition system based on large vocabulary continuous speech recognition network then is not suitable for the quick response to brief voice command again on the contrary.Present voice application is normally to concrete application program, and corresponding recognition system was selected according to applied environment again by system after the user at first selected to get into the program of appointment.For example; The user often at first gets into command control program when the function that the enforcement voice are called, system's utilization is based on the speech recognition system response user's of limited grammer recognition network brief dialing order then; Like " phoning Zhang San ", " making a phone call " etc. to Zhang San.For another example when needs such as short message editing input realize that random speech transcription is used; The user correspondingly selects the speech recognition system based on large vocabulary continuous speech recognition network by system according to applied environment after selecting to get into the short message application program, response the continuous of user freely imported.This through the concrete application program of chosen in advance, the man-machine interaction mode of launching phonetic function again seems not to be very natural human nature.

To above-mentioned situation; Embodiments of the invention have proposed a kind of new audio recognition method and device; It adopts the mixing recognition network; Promptly based on the recognition network of limited grammer and can support the large vocabulary continuous speech recognition network arbitrarily said, realized under the integrated system interface the identification of the precise and high efficiency of brief voice command and the transcription that continuous speech is imported.Thereby embodiments of the invention have improved the convenience of user's use based on the speech recognition of the personal assistant instrument of mobile device.

Fig. 1 schematically shows the process flow diagram of the method 100 of the speech recognition that is used for mobile device according to an embodiment of the invention.

In step S110, obtain the user speech input.The user can obtain various forms of user speech inputs under unified system interface, comprise brief voice command or any statement of arbitrarily saying.Can adopt any voice signal tracking technique known or following exploitation to obtain the user speech input.Can carry out digital sample to continuous voice signal, obtain the digitized forms of phonetic entry.

Alternatively, can carry out pre-service to phonetic entry.In a preferred embodiment, in order to improve the robustness of system, can do the pre-service of front end noise reduction to the primary speech signal that collects.For example, at first, continuous voice signal is divided into independently voice snippet and non-voice segment through voice signal is carried out short-time energy and short-time zero-crossing rate analysis.Through technology such as Wiener filterings voice snippet is carried out voice subsequently and strengthen, further eliminate the noise in the voice signal, improve follow-up system this Signal Processing ability.

Alternatively, can also carry out acoustic feature to phonetic entry extracts.Consider and still have the irrelevant redundant information of a large amount of speech recognitions in the voice signal after the noise reduction process; Directly will cause the decline of operand increase and recognition accuracy to its identification; Can from the speech energy signal, extract the effective phonetic feature of identification for this reason; And deposit in the characteristic buffer area, to characterize the user speech input.In a preferred embodiment, extract the MFCC characteristic of voice.For example, the long 25ms frame of window is moved every frame speech data of 10ms and do short-time analysis, obtain MFCC parameter and single order second order difference thereof, amount to 39 dimensions.One section phonetic entry can be quantified as the characteristic sequence O of one 39 dimensions.In other embodiments, can also adopt PLP characteristic (Perceptual linear predictive) or TANDEM characteristic etc., the characteristic of extracting phonetic entry is to characterize phonetic entry.For fear of fuzzy main points of the present invention, known voice signal tracking technique, preconditioning technique and Feature Extraction Technology are no longer detailed at this.

In addition, should be appreciated that, of the present invention original or can be stored in the storer through pretreated user speech input or its signature identification, and be not limited to any specific memory form.

In step S120, based on limited grammer recognition network recognizing voice input, to obtain first recognition result to said phonetic entry.

Limited grammer recognition network can define and be stored in the equipment in advance.Limited grammer recognition network is mainly used in the support of realization to brief voice command, and the grammer of its support is simple relatively, comprises such as excuses relevant with limited order speech such as " send short messages to * * * ", " phone * * * ".Preferably, limited grammer recognition network defines the customized information relevant with mobile device, for example with equipment on application, information in address book of supporting or the like relevant.Fig. 2 shows example restricted grammer recognition network according to an embodiment of the invention.Based on this limited grammer recognition network, recognition category is like brief voice commands such as " sending short messages to Zhang Fei ", " phoning Wang Zhiguo " rapidly and accurately.

Can realize recognition result through following step based on limited grammer recognition network search phonetic entry.Step one: be written into systematic parameters such as acoustic model and limited grammer network.Alternatively, any time before can the actual identification when method 100 beginning (for example initialization) or in execution in step 120, be written into systematic parameters such as acoustic model and limited grammer recognition network.Wherein the limited recognition network of grammer has reflected all kinds of simple voice command that speech recognition system of the present invention is supported, and is for example as shown in Figure 2.Acoustic model is used to simulate the RP characteristic of character, adopts the field of speech recognition HMM based on transition probability and transmission probability (hidden Markov) model commonly used in the present embodiment.Should be appreciated that the present invention can also use such as neural network (Neural Network mode) and wait other acoustic models.Step two: generate search network based on acoustic model according to limited grammer recognition network.Step three: in the search volume of said search network definition, search is corresponding to the optimal path of the phonetic entry of obtaining among the step S110.For example, can extract each speech frame according to phonetic entry.Use the Viterbi search, each speech frame to extracting calculates its optimum historical path probability corresponding to current all live-vertexs.Utilize dynamic programming thought to search for, when searching the last frame speech vector, recall from final state and just to obtain optimum solution sign indicating number status switch and corresponding historical path probability according to time sequencing.About the Viterbi algorithm for example can be in detail referring to the paper " Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm " of .J.Viterbi; IEEE Transactions on Information Theory; Vol.IT-13; Pp.260-269, April 1967, repeat no more at this.Other searching methods known now or exploitation in the future also are feasible, and scope of the present invention is not limited to the searching method that uses the Viterbi algorithm.

Should be appreciated that; In the search volume of limited grammer recognition network definition; The preferred coupling path that might search user speech input (for example; The user speech input is the brief voice command that meets limited grammer); Obtain first recognition result of said phonetic entry, perhaps also maybe its preferred coupling path that searches unreasonable (the decoded result path score of for example under the situation that the user arbitrarily says, utilizing limited grammer to discern be often very low), so can not get effective recognition result of said phonetic entry.

In a simplified embodiment, if the coupling path that the input of current user speech is found is reasonable, also promptly obtained first recognition result in the grammer limited network, then with the decoded result of said recognition result as the user speech input, method 100 finishes.Otherwise, promptly do not find the rational matching path, then method 100 advances to step S140, changes the identification again based on large vocabulary continuous speech recognition network over to, to obtain second recognition result to the user speech input.

In a preferred embodiment, method 100 also comprises step S130, but judges whether first recognition result that in step S120, obtains based on limited grammer recognition network search satisfies the identification acceptable conditions.If but first recognition result satisfies acceptable conditions, then directly accept the decoded result of this first recognition result, thereby method 100 finishes as the user speech input.Can save recognition time like this, improve whole recognition efficiency.Can acceptable conditions if first recognition result is discontented, then method 100 advances to step S140, changes the identification again based on large vocabulary speech recognition network over to.

But Fig. 3 shows the judgement flow process whether first recognition result that obtains based on limited grammer recognition network search according to an embodiment of the invention satisfies the identification acceptable conditions.

In step S310: calculate likelihood probability mean value to average every frame voice in the recognition result of user speech input.

In step S320: judge the thresholding whether this frame mean value be provided with greater than system in advance, if not explain that then current recognition result is insincere, change step S360 over to, otherwise change step S330 over to.

In step S330: calculate the corresponding probability score of each identification character to the user speech input.

In step S340: whether the probability score of judging each character is greater than its corresponding thresholding.If explain that then current recognition result is credible, change step S350 over to, otherwise change step S360 over to.

In step S350: but judge that current recognition result satisfies acceptable conditions.

In step S360: judging that current recognition result is discontented can acceptable conditions.

Promptly require to judge that just current decoding meets the requirements during all greater than threshold value two probability scores

The thresholding that wherein frame mean value corresponding threshold and/or character are corresponding can be debugged on the magnanimity training data in advance by recognition system and obtained.

Should be appreciated that, only be illustrative purposes presented for purpose of illustration but use likelihood probability in the embodiment show in figure 3 as judging that whether recognition result satisfies the identification acceptable conditions, but not as any restriction.The present invention can also use other parameters such as degree of confidence to be used as Rule of judgment, referring to L.E.Baum, and T.Petrie; People's such as G.Soules and N.Weiss paper " A maximization technique occurring the statistical analysis of probabilistic functions of Markov chains, " Ann.Math.Stat.; Vol.41; No.1, pp.164-171,1970.

Get back to Fig. 1, in step S140, based on the input of large vocabulary continuous speech recognition network recognizing voice again, to obtain second recognition result to said phonetic entry.

Large-scale acoustic model and language model are adopted in speech recognition based on large vocabulary continuous speech recognition network, do not receive the grammer restriction to can be used for simulating any free phonetic entry.Shown in its decoding process is specific as follows.Step one: be written into systematic parameters such as predetermined extensive acoustic model and language model.Alternatively, any time before can the actual recognizing voice when method 100 beginning (for example initialization) or in execution in step 140, carry out said being written into.Similarly, in the present embodiment, acoustic model has adopted the field of speech recognition HMM based on transition probability and transmission probability (hidden Markov) model commonly used, is used to simulate character RP characteristic.Should be appreciated that the present invention can also use such as neural network (Neural Network mode) and wait other acoustic models.Step two: the language model extension of network that will have the word frequency probability becomes the search network based on acoustic model, searches for for subsequent path.Step three: in the search volume of said search network definition, search is corresponding to the optimal path of phonetic entry.For example, can use the Viterbi search, to the speech frame sequence of extracting, from search network, find its corresponding optimum word sequence, thereby obtain recognition result.

Preferably, in speech recognition, utilized the path values in step S120, thereby can feed back recognition result as soon as possible based on the decoding of the optimum in the search of limited grammer network path based on large vocabulary continuous speech recognition network.Below with reference to accompanying drawing 4 one of performing step S140 preferred realization that is elaborated.

Fig. 4 shows the process flow diagram based on the improved Viterbi searching method 400 of the continuous speech recognition of large vocabulary continuous speech recognition network that is used for according to a preferred embodiment of the present invention.

In step S410, initialization also is provided with current speech frame i=1.

In step S420, calculate the current speech frame corresponding to the optimum historical path of all live-vertexs and add up current historical path maximal value Si.

In step S430, calculate the difference of Si and the current speech frame historical path values Si ' in the optimum decoding of limited grammer network path.

In step S440: judge that whether above-mentioned difference is greater than the predefined thresholding S of system.If then change step S450 over to, otherwise change step S470 over to.

In step S450, it is next speech frame i++ that current investigation speech frame is set.

In step S460, whether judge current investigation speech frame greater than speech frame sum T, if, then change step S470 over to, otherwise change step S420 over to, continue to calculate current historical path maximal value Si to current investigation speech frame.Wherein, speech frame sum T is the current speech input totalframes of when limited grammer network decoding, confirming.

In step S470, return current recognition result.Preferably, can return historical path score, historical path and decoded totalframes etc.

In method 400, utilized the Search Results of having carried out based on limited grammer recognition network, can under the situation of all speech frames of not decoding, finish identifying in advance based on large vocabulary continuous speech recognition network.In the preferred embodiment; For the current speech frame; When its based on the optimum historical path score in the search of large vocabulary continuous speech recognition network and its based on the difference of the path values in the decoding of the optimum in the limited grammer web search path during less than predetermined threshold value; Can finish search in advance, directly return based on the recognition result of limited grammer recognition network recognition result as phonetic entry based on large vocabulary continuous speech recognition network.If accomplish the identification based on large vocabulary continuous speech recognition network of all frames, then in step S470, will return second recognition result based on large vocabulary continuous speech recognition network.In method 400, whether finish (that is, not accomplishing) in advance based on the decode procedure of large vocabulary continuous speech recognition network, for example can indicate through returning the decoding totalframes.If the speech frame sum T that the decoding totalframes equals to be scheduled to then explains based on the decoding of large vocabulary continuous speech recognition network and accomplishes, otherwise then is to finish in advance.Alternatively, also can indicate whether whether identifying finishes in advance through other signs binary bits of (as have " true/vacation " value) are set.

Return Fig. 1, when the identification that finishes among the step S140 based on large vocabulary continuous speech recognition network, method 100 advances to step S150.In step S150, comprehensively, confirm the final decoded result of said phonetic entry based on the recognition result of limited grammer recognition network with based on the recognition result of large vocabulary continuous speech recognition network.If (promptly finishing in advance) do not accomplished in the identification based on large vocabulary continuous speech recognition network, confirm that then the recognition result based on limited grammer recognition network is the final decoded result of user speech input.If the identification based on large vocabulary continuous speech recognition network is accomplished; But the score of its recognition result is less than the score based on the recognition result of limited grammer recognition network; Confirm still that then the recognition result based on limited grammer recognition network is the final decoded result of user speech input, otherwise confirm that the recognition result based on large vocabulary continuous speech recognition network is the final decoded result of user speech input.

The concrete realization of step S140 has been shown in Fig. 5.

In step S510, judge and whether accomplish based on the decode procedure of large vocabulary continuous speech recognition network, promptly be decoded to last frame.If then change step S520 over to, otherwise change step S540 over to.

In step S520, judge based on the system optimal path score in the identification of limited grammer recognition network whether less than based on the system optimal path score in the identification of large vocabulary continuous speech recognition network.If then change step S530 over to, otherwise change step S540 over to.Alternatively, as the substituting or replenishing of system optimal path score, also can use the frame average as criterion.

In step S530, output based on the continuous speech recognition result of large vocabulary continuous speech recognition network as final decoded result.

In step S540, output based on the recognition result of the limited recognition network of grammer as final decoded result.

Usually, confirm that in step S150 after the final decoded result of said phonetic entry, method 100 finishes.

Preferably, the final decoded result of method 100 acquisition phonetic entries will be used for triggering the respective application in the mobile device, for example call application, short message application etc.

Specified the method for the speech recognition of under unified interface, user's arbitrary form being imported above with reference to accompanying drawing.Although should be noted that the operation of having described the inventive method in the accompanying drawings with particular order,, this is not that requirement or hint must be carried out these operations according to this particular order, or the operation shown in must carrying out all could realize the result of expectation.On the contrary, the step of describing in the process flow diagram can change execution sequence.Additionally or alternatively, can omit some step, a plurality of steps merged into a step carry out, and/or a step is decomposed into a plurality of steps carries out, also can increase other steps.

In addition, this method can be included in the mobile device local side and realize separately based on various concrete realizations, and the mobile device local side combines server end realization etc.

In one embodiment, said method 100 can be realized at the mobile device local side fully.Under this scheme, the limited grammer recognition network of storage and in the storer that mobile device is handled based on large vocabulary continuous speech recognition network.

Alternatively, in another embodiment, obtain user speech input (step S110) at the local side of mobile device.Then, mobile device sends the original or treated user speech input of obtaining to server.Said treated user speech input can phonetic entry digital form, the characteristic sequence that perhaps extracts.Server receives this user speech input.Server is carried out the identification to phonetic entry then, comprising: based on first identification (S120) of limited grammer recognition network; Judge the decoding validity (S130) of first recognition result; When first recognition result is invalid, carry out second identification (S140) based on large vocabulary continuous speech recognition network; And comprehensive first recognition result and second recognition result, confirm the final decoded result (S150) of said phonetic entry.Then, server sends final decoded result to mobile device.

In this embodiment, server end is safeguarded large vocabulary continuous speech recognition network.In addition; Server end is also all safeguarded the information bank of propertyization one by one to each mobile device or user, and for example limited grammer recognition network is used to improve the voice command that has customized information; As identify " phoning Wang Zhiguo ", but not " Wang Zhiguo " in the extensive language model.

In yet another embodiment, obtain user speech input (S110), carry out first identification (S120) based on limited grammer at the local side of mobile device, and the decoding validity (S130) of judging first recognition result.When first decoding was invalid, mobile device sent the user speech input of obtaining to server end, and it can be the characteristic sequence of voice signal or extraction.

Utilize its powerful decoding operation ability and ultra-large model bank (for example, large vocabulary continuous speech recognition network) that user speech input carrying out continuous speech is decoded at server end, to obtain second recognition result (S140).Preferably, in order to improve the decoding efficiency of server, mobile device is to Server Transport phonetic feature sequence the time, and the decoded result (i.e. first recognition result) of transmitting locally end simultaneously comprises the decoding path values of every frame.

Then, mobile device can receive second recognition result from server.

Then, mobile device can be comprehensively based on the recognition result of limited grammer recognition network with based on the recognition result of large vocabulary continuous speech recognition network, confirm the final decoded result (S150) of said phonetic entry.

In this embodiment, mobile device storage limited grammer recognition network separately.At the ultra-large large vocabulary continuous speech recognition network of server end storage.

Should be appreciated that method of the present invention is not limited to shown concrete example and distortion.Under the situation that does not break away from the spirit and scope of the present invention, it may occur to persons skilled in the art that other modifications, substitute and distortion.

Fig. 6 shows the speech recognition system that is used for mobile device 600 according to a preferred embodiment of the present invention.System 600 can be used to carry out said method 100.For example, system 600 can be mounted on the mobile device, perhaps is distributed on mobile device local side and the server.

System 600 comprises that deriving means 610, first recognition device 620, the second recognition device S640 and decoding confirm device 650.

According to one embodiment of present invention, deriving means 610 is used to obtain the user speech input.Preferably, deriving means 610 extracts speech frame from the user speech input, phonetic entry is expressed as a series of speech frames.Deriving means 610 can adopt any voice signal tracking technique known or following exploitation to obtain the user speech input, can carry out digital sample to continuous voice signal, obtains the digitized forms of phonetic entry.Preferably, deriving means 610 can comprise pretreatment unit, is used for pre-service is carried out in phonetic entry, to strengthen voice and to eliminate the noise in the voice.Preferably, deriving means 610 can also comprise the acoustic feature extraction element, is used for extracting acoustic feature to characterize phonetic entry from voice signal (particularly through pretreated voice signal).

First recognition device 620 is used for discerning said phonetic entry based on limited grammer recognition network.First recognition device 620 uses the acoustic model and the limited grammer network that are written in advance that phonetic entry is discerned, to obtain first recognition result.

But second recognition device 640 is used for not satisfying the identification acceptable conditions in response to first recognition result, based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition to obtain second recognition result.Second recognition device 640 uses the acoustic model and the large vocabulary continuous speech recognition network that are written in advance that phonetic entry is discerned, to obtain second recognition result.

Decoding confirms that device 650 is used to combine confirm the final decoded result of said phonetic entry based on the recognition result of limited grammer recognition network with based on the recognition result of large vocabulary continuous speech recognition network.If the score that obtains greater than first recognition device 620 based on the score of the recognition result of large vocabulary continuous speech recognition network that second recognition device 640 obtains based on the recognition result of limited grammer recognition network; Then decoding confirms that the recognition result that device 650 confirms that second recognition device 640 obtains is the final decoded result of user speech input, otherwise confirms that the recognition result that first recognition device 620 obtains is the final decoded result of user speech input.

According to a preferred embodiment of the present invention, system 600 also comprises decoding availability deciding device 630, is used to judge the decoding validity based on the recognition result of limited grammer recognition network identification.After first recognition device 620 obtains first recognition result; If decoding availability deciding device 630 judges that but this first recognition result satisfies acceptable conditions, then it impels the definite device 650 of decoding to confirm that this first recognition result is the final decoded result of user speech input.

According to a preferred embodiment of the present invention, second recognition device 640 utilizes first recognition device, 620 executed path values based on the decoding of the optimum in the search of limited grammer network path to judge whether to finish in advance its identifying.When the optimum historical path score based in the search of large vocabulary continuous speech recognition network of judging the current speech frame is not more than its path values in the path of decoding based on the optimum in the limited grammer web search; When perhaps both differences are less than predetermined threshold, can finish identification in advance based on large vocabulary continuous speech recognition network.When the identification that finishes in advance based on large vocabulary continuous speech recognition network, it is the final decoded result of user speech input that second recognition device 640 can be exported first recognition result that signal impels decoding to confirm that device 650 confirms that first recognition devices 620 obtain.

For the purpose of clear, the sub-device that also not shown each device is comprised in Fig. 6.Yet, should be appreciated that each device of record in the system 600 is corresponding with each step in the method for describing with reference to figure 1 100.Thus, preceding text are directed against the operation of Fig. 1 description and device and the sub-device that characteristic is equally applicable to system 600 and wherein comprises, and repeat no more at this.

Should be appreciated that although in above-detailed, mentioned the some devices or the sub-device of system, this division only is not enforceable.In fact, according to the embodiment of the present invention, the characteristic of above-described two or more devices and function can be specialized in a device.Otherwise the characteristic of an above-described device and function can further be divided into by multiple arrangement to be specialized.

In addition, system shown in Figure 6 only is exemplary, rather than restrictive.Can there be various distortion in system 600.

In one embodiment, system 600 is installed on the mobile device.

In another embodiment, system 600 is installed on the server.In this case, server also comprises the communicator (not shown) with mobile device, is used for transmission user phonetic entry and recognition result between mobile device.

In yet another embodiment, system 600 is distributed in mobile device local side and server on the two.In this embodiment, mobile device comprises: deriving means is used to obtain the user speech input; First recognition device is used for discerning said phonetic entry to obtain first recognition result based on limited grammer recognition network; R-T unit, but be used for not satisfying the identification acceptable conditions in response to first recognition result, send the user speech input to server, and receive second recognition result that said phonetic entry obtains based on large vocabulary continuous speech recognition Network Recognition from server; And the definite device of decoding, be used for selecting the final decoded result of the preferred person of said first and second recognition results as said phonetic entry.Server comprises: receiving trap is used for receiving the user speech input from mobile device; Second recognition device is used for based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition to obtain second recognition result; Dispensing device is used for sending second recognition result to mobile device.

In addition, system 600 can also comprise other devices, and for example volatibility or Nonvolatile memory devices are used to store the phonetic entry and/or its recognition result that obtain.System 600 can also comprise flip flop equipment, is used for coming according to the final decoded result of phonetic entry the respective application of trigger equipment, for example call application, short message application etc.

And system 600 and each ingredient thereof can utilize variety of way to realize.For example, in some embodiments, system 600 can utilize software and/or firmware module to realize.In addition, system 600 also can utilize hardware module to realize.For example, system 600 can be implemented as integrated circuit (IC) chip or special IC (ASIC).System 600 also can be implemented as SOC(system on a chip) (SOC).Other modes known now or exploitation in the future also are feasible, and scope of the present invention is unrestricted in this regard.

Fig. 7 shows an example that is suitable for being used for realizing the mobile phone 700 of embodiment of the present invention.Yet should be appreciated that scope of the present invention is not limited to the particular type of described mobile phone.

Mobile phone 700 can be any portable terminal that needs interactive voice.Mobile phone 700 can comprise the shell 30 that is used to hold and protect it.Mobile phone 700 may further include the display 32 of LCD form.In other embodiments of the present invention, display can be any suitable display technique that is suitable for display image or literal.Mobile phone 700 may further include keypad 34.In other embodiments of the present invention, can adopt any appropriate data or user interface mechanisms.For example, can user interface be embodied as dummy keyboard or data entry system with the part as touch-sensitive display.This mobile phone can comprise microphone 36 or can be any suitable audio frequency input of digital signal input or simulating signal input.Mobile phone 700 may further include audio output apparatus, its can be in embodiments of the present invention following any one: the output of earphone 38, loudspeaker or analogue audio frequency or DAB connects.Mobile phone 700 can also comprise battery 40 (perhaps in other embodiments of the present invention, this equipment can be supplied power by any suitable mobile energy device, such as solar cell, fuel cell or clockwork spring generator).This mobile phone may further include the infrared port 42 that is used for carrying out with other equipment the short distance line-of-sight communication.In other embodiments, mobile phone 700 may further include any suitable short distance communication scheme, connects or USB/ live wire wired connection such as blue teeth wireless.

Mobile phone 700 can comprise and is used for controller 56 or processor that this mobile phone 700 is controlled.Controller 56 can be connected to storer 58; This storer 58 can be stored preset acoustic model, limited grammer recognition network, large-scale words amount recognition network etc. in embodiments of the present invention, and/or can also store the instruction that is used for realization on controller 56.Controller 56 can further be connected to codec circuit 54, and it is applicable to the Code And Decode that enforcement or subcontrol 56 are implemented audio frequency and/or video data, comprises speech recognition according to an embodiment of the invention.

Mobile phone 700 may further include card reader 48 and smart card 46, for example UICC and UICC card reader, and it is used to provide user profile and is suitable for providing authentication information at the network place user being carried out authentication and authorization.

Mobile phone 700 can comprise radio interface circuit 52, and it is connected to controller and is suitable for generating wireless communication signals, for example is used for and cellular communications networks, wireless communication system or wireless LAN communication.Mobile phone 700 may further include the antenna 44 that is connected to radio interface circuit 52, is used to transmit and be received in the radiofrequency signal that radio interface circuit 52 places generate.

Speech recognition system 600 according to the present invention can be used as hardware and realizes being included in the mobile phone 700.Especially, except that the hardware embodiment, can realize through the form of computer program according to equipment 600 of the present invention.For example, the method for describing with reference to figure 1 100 can realize through computer program.This computer program can be stored in the storer for example shown in Figure 7 58, perhaps downloads on the mobile phone 700 from suitable position through network.Computer program can comprise the computer code part, and it comprises can be by the programmed instruction of proper process equipment (for example, controller shown in Fig. 7 56 and/or coding-decoding circuit 54) execution.Said programmed instruction can comprise at least: the instruction that is used to obtain the user speech input; Be used for discerning the instruction of said phonetic entry based on limited grammer recognition network; Be used for instruction based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition; And be used to combine confirm the instruction of the final decoded result of said phonetic entry based on the recognition result of limited grammer recognition network with based on the recognition result of large vocabulary continuous speech recognition network.Preferably, said programmed instruction also comprises the instruction that is used to judge based on the decoding validity of the recognition result of limited grammer recognition network identification.Preferentially, said programmed instruction also comprises and utilizes said path values based on the optimum in the identification of limited grammer network decoding path to finish the instruction based on the identifying of large vocabulary continuous speech recognition network in advance.

Preceding text have combined embodiment to explain spirit of the present invention and principle.Embodiment of the present invention provides a kind of new speech recognition system and method, can provide unified system interface simple efficient mutual with system accomplished to the user, realizes the various types of voice order control to mobile device.Through the hybrid network of the large vocabulary continuous speech recognition network that adopt to combine based on limited grammer recognition network and can support arbitrarily to say, realized the identification of the precise and high efficiency of brief voice command and the transcription that continuous speech is imported.According to embodiments of the invention not the needs user at first select to get into the program of appointment, corresponding according to the current application environmental selection more subsequently recognition system operation.For example, suppose that " Wang Zhiguo " is a contact person in the address list of equipment.When user input voice input " phoning Wang Zhiguo ", will export recognition result apace according to embodiments of the invention based on limited grammer recognition network, and based on this recognition result can the calling communication record in the information of Wang Zhi state make a phone call to it.When the user provides phonetic entry " tonight company to 3 building meeting rooms meetings " at 7 o'clock with the mode of random theory, will export recognition result based on large vocabulary continuous speech recognition network " tonight company to 3 building meeting rooms meetings " fast at 7 o'clock to realize speech text conversion fast according to embodiments of the invention.Audio recognition method of the present invention and system be precise and high efficiency more, and intelligence man-machine interaction mode easily is provided more.

The term of mentioning in the instructions " identification ", " decoding " have similar implication for field of speech recognition, only be that it all representes to convert the voice signal of audio frequency into the corresponding character character from the selection under the different context.

Though described the present invention, should be appreciated that the present invention is not limited to disclosed embodiment with reference to some embodiments.The present invention is intended to contain included various modifications and equivalent arrangements in spirit and the scope of accompanying claims.The scope of accompanying claims meets the most wide in range explanation, thereby comprises all such modifications and equivalent structure and function.

Claims

1. audio recognition method of on mobile device or server, carrying out comprises:

Obtain the user speech input;

Discern said phonetic entry to obtain first recognition result based on limited grammer recognition network;

But do not satisfy the identification acceptable conditions in response to first recognition result, based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition to obtain second recognition result; And

Select preferred person in said first and second recognition results as the final decoded result of said phonetic entry.

2. audio recognition method according to claim 1, wherein:

But satisfy the identification acceptable conditions in response to first recognition result, directly with the final decoded result of first recognition result as said phonetic entry.

3. audio recognition method according to claim 2, wherein:

But at least a based in following of said identification acceptable conditions: the corresponding probability score or the degree of confidence of each identification character of the likelihood probability mean value of every frame voice of said phonetic entry, said phonetic entry.

4. audio recognition method according to claim 1 wherein saidly comprises to obtain second recognition result based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition:

Phonetic entry is extracted as each speech frame, and through in based on the search volume of large vocabulary continuous speech recognition network definition, realizing said identification based on large vocabulary continuous speech recognition network by speech frame search optimal path.

5. audio recognition method according to claim 4, said identification based on large vocabulary continuous speech recognition network also according to real-time decoding state premature termination search procedure to improve decoding efficiency, comprising:

Calculate the current speech frame corresponding to the optimum historical path of all live-vertexs and add up current historical path maximal value Si,

Calculate the difference of Si and the current speech frame historical path values Si ' in the optimum decoding of limited grammer network path, and

Less than preset thresholding, stop said search procedure in response to above-mentioned difference.

6. audio recognition method that is used for mobile device comprises:

Obtain the user speech input;

But do not satisfy the identification acceptable conditions in response to first recognition result, send the user speech input, and receive second recognition result that said phonetic entry obtains based on large vocabulary continuous speech recognition Network Recognition from server to server; And

7. speech recognition system comprises:

Deriving means is used to obtain the user speech input,

First recognition device is used for discerning said phonetic entry obtaining first recognition result based on limited grammer recognition network,

Second recognition device, but be used for not satisfying the identification acceptable conditions in response to first recognition result, based on the said phonetic entry of large vocabulary continuous speech recognition Network Recognition obtaining second recognition result, and

Device is confirmed in decoding, is used for selecting the final decoded result of the preferred person of said first and second recognition results as said phonetic entry.

8. speech recognition system according to claim 7, but wherein said decoding confirms that device is also in response to the satisfied identification of first recognition result acceptable conditions, directly with the final decoded result of first recognition result as said phonetic entry.

9. speech recognition system according to claim 8, wherein:

10. speech recognition system according to claim 7; Wherein said deriving means is also from being extracted as each speech frame with said phonetic entry, and said second recognition device is through realizing said identification based on large vocabulary continuous speech recognition network by speech frame search optimal path in based on the search volume of large vocabulary continuous speech recognition network definition.

11. speech recognition system according to claim 10, wherein, said second recognition device further comprises:

First calculation element is used to calculate the current speech frame corresponding to the optimum historical path of all live-vertexs and add up current historical path maximal value Si,

Second calculation element is used for calculating Si and the current speech frame difference at the historical path values Si ' in the optimum decoding of limited grammer network path, and

Judgment means is used in response to above-mentioned difference stopping said search procedure less than preset thresholding.

12. mobile device or server comprise any described speech recognition system among the claim 6-9.

13. a mobile device comprises:

Deriving means is used to obtain the user speech input,

R-T unit; But be used for not satisfying the identification acceptable conditions in response to first recognition result; Send the user speech input to server, and receive second recognition result that said phonetic entry obtains based on large vocabulary continuous speech recognition Network Recognition from server, and