CN103474075A

CN103474075A - Method and system for sending voice signals, and method and system for receiving voice signals

Info

Publication number: CN103474075A
Application number: CN2013103620247A
Authority: CN
Inventors: 江源; 周明; 凌震华; 何婷婷; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2013-08-19
Filing date: 2013-08-19
Publication date: 2013-12-25
Anticipated expiration: 2033-08-19
Also published as: CN103474075B

Abstract

The invention discloses a method and system for sending voice signals. The sending method comprises: determining the text contents corresponding to continuous voice signals to be sent; determining, according to the text contents, the voice synthesis parameter model of each synthesis unit; obtaining a voice synthesis parameter model sequence by splicing the voice synthesis parameter model of each synthesis unit; determining a sequence number serial corresponding to the voice synthesis parameter model sequence; and sending the sequence number serial to a receiving end so that the receiving end can restore the continuous voice signals according to the sequence number serial. The invention also discloses a method and system for receiving voice signals. By using the methods and systems provided by the invention, signal transmission with an extremely low code stream rate can be realized under the condition that the timbre loss during a voice restoration process can be ensured to be minimized.

Description

Voice signal sending method and system, method of reseptance and system

Technical field

The present invention relates to the signal transmission technology field, be specifically related to a kind of voice signal sending method and system and a kind of voice signal method of reseptance and system.

Background technology

Along with the popularization of the universal and portable set of internet, the various chat softwares based on handheld device arise at the historic moment.The Natural humanity of interactive voice is that other interactive meanses are unsurpassable, particularly in the hand-held smaller screen equipment application that is unfavorable for hand-written key-press input.To these a lot of products support voice interactive function all, the transmitting voice signal that certain terminal is received is to destination, and micro-news product of releasing as Tengxun supports the voice message transmission function of Voice Message.Yet directly the voice signal data amount of transmission is often very big, at internet or communication network etc., brought larger financial burden to the user in by the channel of flow charging.Obviously the data volume that how compression transmits as far as possible under the prerequisite that does not affect voice quality is to improve the precondition of transmitting voice signal using value.

For the problem of transmitting voice signal, the researchist has attempted the multiple voice coding method, and voice signal is carried out to digital quantization and compression transmission, in the recovery that improves voice signal, talks about under the matter condition and reduces encoder bit rate and promote transfer efficiency.Speech signal compression method commonly used has waveform coding and parameter coding etc. at present.Wherein:

Waveform coding be by the analog signal waveform of time domain through sampling, quantize, coding, form digital signal, that this coded system has advantages of is adaptable, speech quality is high.But, due to the waveform shape of the original voice signal of needs maintenance recovery, this scheme rate bit stream is had relatively high expectations, and higher than 16kb/s, just can obtain tonequality preferably.

Parameter coding extracts the parameter that characterizes the sound pronunciation feature from primary speech signal, and this characteristic parameter is encoded.The target of this scheme is to keep the meaning of one's words of raw tone, guarantees intelligibility.Its advantage is that rate bit stream is lower, but it is impaired more to recover tonequality.

In the voice communication epoch traditional, often adopt the time-based charging mode, and algorithm time delay and communication quality are mainly considered in coding method; And, in mobile Internet age, voice are a kind of as data-signal, use traffic is collected the charges usually, and the height of encoded voice rate bit stream will directly affect the cost that the user uses.In addition, black phone channel voice are only used the 8k sampling rate, belong to narrowband speech, and tonequality is impaired and have a upper limit.If obviously continue to use traditional coded system to process broadband or ultra broadband voice, needing increases rate bit stream, promotes at double traffic consumes.

Summary of the invention

The embodiment of the present invention provides a kind of voice signal sending method and system on the one hand, realizes the signal transmission of utmost point low code stream rate under the prerequisite that guarantees voice recovery tonequality minimization of loss.

The embodiment of the present invention provides a kind of voice signal method of reseptance and system on the other hand, to reduce voice, recovers the tonequality loss.

For this reason, the invention provides following technical scheme:

A kind of voice signal sending method comprises:

Determine content of text corresponding to continuous speech signal to be sent;

Determine the phonetic synthesis parameter model of each synthesis unit according to described content of text;

The phonetic synthesis parameter model that splices each synthesis unit obtains phonetic synthesis parameter model sequence;

Determine sequence number string corresponding to described phonetic synthesis parameter model sequence;

Described sequence number string is sent to receiving end, so that described receiving end recovers described continuous speech signal according to described sequence number string.

A kind of voice signal transmitting system comprises:

The text acquisition module, for determining content of text corresponding to continuous speech signal to be sent;

The parameter model determination module, for determining the phonetic synthesis parameter model of each synthesis unit according to described content of text;

Concatenation module, obtain phonetic synthesis parameter model sequence for the phonetic synthesis parameter model that splices each synthesis unit;

Sequence number string determination module, for determining sequence number string corresponding to described phonetic synthesis parameter model sequence;

Sending module, for described sequence number string is sent to receiving end, so that described receiving end recovers described continuous speech signal according to described sequence number string.

Voice signal sending method and system that the embodiment of the present invention provides, employing Statistic analysis models coding, its processing mode and speech sample rate are irrelevant, greatly reduced the transmission code flow rate under the prerequisite that guarantees voice recovery tonequality minimization of loss, reduced traffic consumes, solved the problem that the traditional voice coding method can not take into account tonequality and flow, improved the mobile network under the epoch telex network demand experience.

Correspondingly, voice signal method of reseptance and system that the embodiment of the present invention provides, the take over party obtains phonetic synthesis parameter model sequence according to sequence number string corresponding to phonetic synthesis parameter model sequence received from code book, utilize this sequence to obtain voice signal by the phonetic synthesis mode, greatly reduce voice and recover the tonequality loss, realized the very big compression of voice signal and minimizing of the loss of signal.

The accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, below will the accompanying drawing of required use in embodiment be briefly described, apparently, the accompanying drawing the following describes is only some embodiment that put down in writing in the present invention, for those of ordinary skills, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the process flow diagram of embodiment of the present invention voice signal sending method;

Fig. 2 determines a kind of process flow diagram of the phonetic synthesis parameter model of each synthesis unit in the embodiment of the present invention;

Fig. 3 is the structure process flow diagram of binary decision tree in the embodiment of the present invention;

Fig. 4 is the schematic diagram of a kind of binary decision tree in the embodiment of the present invention;

Fig. 5 determines the another kind of process flow diagram of the phonetic synthesis parameter model of each synthesis unit in the embodiment of the present invention;

Fig. 6 is the process flow diagram of embodiment of the present invention voice signal method of reseptance;

Fig. 7 is the structured flowchart of voice signal transmitting system in the embodiment of the present invention;

Fig. 8 is the structured flowchart of parameter model determination module in the embodiment of the present invention;

Fig. 9 is the structured flowchart that in the embodiment of the present invention, binary decision tree builds module;

Figure 10 is a kind of structured flowchart of fundamental frequency model determining unit in the voice signal transmitting system in the embodiment of the present invention;

Figure 11 is a kind of structured flowchart of voice signal transmitting system intermediate frequency spectrum model determining unit in the embodiment of the present invention;

Figure 12 is the another kind of structured flowchart of fundamental frequency model determining unit in the voice signal transmitting system in the embodiment of the present invention;

Figure 13 is the another kind of structured flowchart of voice signal transmitting system intermediate frequency spectrum model determining unit in the embodiment of the present invention;

Figure 14 is the structured flowchart of embodiment of the present invention voice signal receiving system.

Embodiment

In order to make those skilled in the art person understand better the scheme of the embodiment of the present invention, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.

Process broadband or ultra broadband voice for traditional coded system, need to increase rate bit stream, the problem that traffic consumes is large, the embodiment of the present invention provides a kind of voice signal sending method and system, and a kind of voice signal method of reseptance and system, be applicable to the coding of various types of voice (as the ultra broadband voice of 16KHz sampling rate, the narrowband speech of 8KHz sampling rate etc.), guaranteeing that voice recover under the prerequisite of tonequality minimization of loss, realize the signal transmission of utmost point low code stream rate.

As shown in Figure 1, be the process flow diagram of embodiment of the present invention voice signal sending method, comprise the following steps:

Step 101, determine content of text corresponding to continuous speech signal to be sent.

Particularly, can pass through the described content of text of speech recognition algorithm automatic acquisition, can certainly obtain described content of text by the mode of artificial mark.In addition, in order further to guarantee the correctness of the content of text that speech recognition obtains, the content of text that can also obtain speech recognition carries out human-edited's correction.

Step 102, determine the phonetic synthesis parameter model of each synthesis unit according to described content of text.

Described synthesis unit is predefined minimum synthetic object, as syllable unit, phoneme unit, or even the state cell in the phoneme HMM model etc.

Recover the loss of tonequality in order to reduce receiving end as far as possible, make receiving end recover the continuous speech signal by the phonetic synthesis mode, the phonetic synthesis parameter model that transmitting terminal obtains from primary speech signal should meet former voice signal characteristics as far as possible, to reduce the loss of signal compression and recovery.

Particularly, can carry out the voice snippet cutting to the continuous speech signal according to described content of text, obtain the voice snippet that each synthesis unit is corresponding, and then obtain duration, fundamental frequency model and spectral model that each synthesis unit is corresponding, detailed process will be described in detail later.

Step 103, the phonetic synthesis parameter model that splices each synthesis unit obtains phonetic synthesis parameter model sequence.

Step 104, determine sequence number string corresponding to described phonetic synthesis parameter model sequence.

Step 105, send to receiving end by described sequence number string, so that described receiving end recovers described continuous speech signal according to described sequence number string.

Embodiment of the present invention voice signal sending method, adopt the Statistic analysis models coding, and its processing mode and speech sample rate are irrelevant, and to 16kHz ultra broadband voice coding, without paying additional code flow rate cost, its acoustical quality is good, and the coding flow is low.The one section typical Chinese speech fragment of take is example, its efficient voice section continues 10s, have 80 sound mothers (phoneme), in each phoneme have 5 fundamental frequency states, 5 frequency spectrum states, long status 1 time, every state adopts 1 byte code (8bit), its rate bit stream is m:m=[80* (5+5+1)] * 8bit/10s=704b/s, lower than 1kb/s, belong to utmost point Low Bit-rate Coding method, rate bit stream is significantly less than every coding standard in current main-stream speech communication field, and the flow of network communication will reduce greatly.Compare the communications field voice coding method of current main-stream, the voice coding modes of the inventive method can be processed ultra broadband voice (16kHz sampling rate), and tonequality is higher; And there is lower rate bit stream (1kb/s is following), effectively reduce network traffic.

As shown in Figure 2, be in the embodiment of the present invention, to determine to comprise the following steps a kind of process flow diagram of the phonetic synthesis parameter model of each synthesis unit:

Step 201, carry out the voice snippet cutting according to content of text to the continuous speech signal, obtains the voice snippet that each synthesis unit is corresponding.

Particularly, acoustic model sequence that can described continuous speech signal is corresponding with synthesis unit in described content of text is done and is forced to align, be the speech recognition decoder of computing voice signal corresponding to described acoustic model sequence, thereby obtain the sound bite that each synthesis unit is corresponding.

It should be noted that, described synthesis unit can be selected different size according to different application demands.In general, if rate bit stream is had relatively high expectations, select larger voice unit, as syllable unit, phoneme unit etc.; If otherwise tonequality is had relatively high expectations, can select less voice unit, as the state cell of model, feature stream unit etc.

Adopting based on HMM(Hidden Markov Model, hidden Markov model) acoustic model arrange down, also can further choose each state of HMM model as synthesis unit, and obtain the voice snippet of corresponding state-based layer.Subsequently each state is determined respectively corresponding fundamental frequency model and the spectral model of each state corresponding fundamental frequency binary decision tree and frequency spectrum binary decision tree from it.The phonetic synthesis parameter model that can make like this to obtain can be described the characteristics of voice signal more meticulously.

Step 202, obtain the synthesis unit of current investigation.

Step 203, add up the sound bite duration corresponding to synthesis unit of current investigation.

Step 204, determine the fundamental frequency model of the synthesis unit of current investigation.

Particularly, at first obtain the fundamental frequency binary decision tree corresponding to synthesis unit of current investigation; Described synthesis unit is carried out to text resolution, obtain the contextual information of described synthesis unit, such as, phoneme unit, tonality, part of speech, the inferior contextual information of rhythmite; Then, according to described contextual information, in described fundamental frequency binary tree, carry out path decision, obtain corresponding leaf node, using described leaf node, corresponding fundamental frequency model is as the fundamental frequency model of described synthesis unit.

Particularly, carry out the process of path decision as follows:

According to the contextual information of described synthesis unit, from the root node of described fundamental frequency binary decision tree, start successively each node split problem to be answered; Obtain a top-down coupling path according to answering result; Obtain leaf node according to described coupling path.

Step 205, determine the spectral model of the synthesis unit of current investigation.

Particularly, at first obtain the fundamental frequency binary decision tree corresponding to synthesis unit of current investigation; Described synthesis unit is carried out to text resolution, obtain the contextual information of described synthesis unit, such as, phoneme unit, tonality, part of speech, the inferior contextual information of rhythmite; Then, according to described contextual information, in described frequency spectrum binary decision tree, carry out path decision, obtain corresponding leaf node, using described leaf node, corresponding spectral model is as the spectral model of described synthesis unit.

Particularly, carry out the process of path decision as follows:

According to the contextual information of described synthesis unit, from the root node of described frequency spectrum binary decision tree, start successively each node split problem to be answered; Obtain a top-down coupling path according to answering result; Obtain leaf node according to described coupling path.

Step 206, judge whether the synthesis unit of current investigation is last synthesis unit.If so, perform step 207; Otherwise, perform step 202.

Step 207, export sound bite duration, fundamental frequency model and spectral model that each synthesis unit is corresponding.

The structure of the quality of the phonetic synthesis parameter model that synthesis unit is corresponding and binary decision tree (comprising fundamental frequency binary decision tree and frequency spectrum binary decision tree) has direct relation.In embodiments of the present invention, adopt clustering method from below to up to build binary decision tree.

As shown in Figure 3, be the structure process flow diagram of binary decision tree in the embodiment of the present invention, comprise the following steps:

Step 301, obtain training data.

Particularly, can gather a large amount of voice training data and it is carried out to text marking, then carry out the voice snippet cutting of basic voice unit and even synthesis unit (as the state cell of basic speech unit models) according to the content of text of mark, obtain the voice snippet set that each synthesis unit is corresponding, and using each synthesis unit the voice snippet training data corresponding as this synthesis unit in corresponding voice snippet set.

Step 302 is extracted the synthetic parameters of the voice snippet set that synthesis unit is corresponding from described training data.

Described synthetic parameters comprises: fundamental frequency feature and spectrum signature etc.

Step 303, carry out initialization according to the synthetic parameters binary decision tree corresponding to described synthesis unit extracted, and root node be set as current investigation node.

Described binary decision tree is carried out to initialization and build the binary decision tree that only has root node.

Step 304, judge whether current investigation node needs division.If so, perform step 305; Otherwise perform step 306.

Select residue problem in default problem set the data of current investigation node are divided to trial, obtain child node.Described residue problem refers to the problem of not inquiring.

Particularly, can at first calculate the sample concentration class of current investigation node, describe the degree of scatter of sample in the voice snippet set.In general, degree of scatter is larger, illustrates that the possibility of this node split is larger, otherwise the possibility of division is less.Specifically can adopt sample variance to weigh the sample concentration class of node, calculate the average of the distance (or square) at all sample distance-like center under this node.Then calculate the sample concentration class of the rear child node of division, and select to there is the problem of maximum sample concentration class fall as optimal selection problem.

Then divide trial according to described optimal selection problem, obtain child node.If be less than the threshold value of setting according to the concentration class decline of described optimal selection problem division, or in the child node after division, training data, lower than the thresholding of setting, determines that current investigation node no longer continues division.

Step 305, divided current investigation node, and obtain child node and training data corresponding to described child node after division.Then, perform step 307.

Particularly, can to current investigation node, be divided according to described optimal selection problem.

Step 306 is leaf node by current investigation vertex ticks.

Step 307, judge the nonleaf node of whether not investigating in addition in described binary decision tree.If so, perform step 308; Otherwise perform step 309.

Step 308, obtain the next nonleaf node of not investigating as current investigation node.Then, return to step 304.

Step 309, the output binary decision tree.

It should be noted that, in embodiments of the present invention, fundamental frequency binary decision tree and frequency spectrum binary decision tree can be set up according to flow process shown in Fig. 3.

As shown in Figure 4, be the schematic diagram of a kind of binary decision tree in the embodiment of the present invention.

Fig. 4 has showed phoneme " *-aa+ " the design of graphics of binary decision tree of the 3rd state.As shown in Figure 4, training data that can root node is corresponding according to the answer to default problem " whether right adjacent phoneme is nasal sound " when root node divides splits, subsequently when next node layer division, when left sibling is divided, training data that can described node is corresponding according to the answer to default problem " whether left adjacent phoneme is the voiced consonant " further splits.Finally when node can't further split, set it for leaf node, and utilize its institute to obtain mathematical statistical model to deserved training data training, as Gauss model, the synthetic parameters model corresponding as current leaf node using this statistics model.

Obviously, in embodiment illustrated in fig. 2, the phonetic synthesis parameter model select the binary decision tree mainly depended on based on text analyzing, the pronunciation type of phoneme class as contextual as the synthesis unit by current investigation, current phoneme etc.Select like this phonetic synthesis parameter model convenient and swift, but, to the input of special sound signal, this phonetic synthesis parameter model with universality determines that method can't embody pronunciation characteristic well.

For this reason, Fig. 5 shows the another kind of process flow diagram of determining the phonetic synthesis parameter model of each synthesis unit in the embodiment of the present invention, comprises the following steps:

Step 501, carry out the voice snippet cutting according to content of text to the continuous speech signal, obtains the voice snippet that each synthesis unit is corresponding.

Particularly, acoustic model that can described continuous speech signal is corresponding with default synthesis unit is done and is forced to align, and the computing voice signal is corresponding to the speech recognition decoder of described acoustic model sequence, thereby obtains the sound bite that each synthesis unit is corresponding.

Step 502, determine duration and corresponding fundamental frequency characteristic sequence and the spectrum signature sequence of described continuous speech signal of the voice snippet that each synthesis unit is corresponding.

Step 503, determine the fundamental frequency model of described synthesis unit according to described fundamental frequency characteristic sequence and fundamental frequency model set corresponding to described synthesis unit.

Particularly, determine the fundamental frequency characteristic sequence that described synthesis unit is corresponding, and obtain the fundamental frequency model set that described synthesis unit is corresponding, be i.e. the fundamental frequency model corresponding to all leaf nodes of the fundamental frequency binary decision tree of described synthesis unit.Then calculate the likelihood score of each fundamental frequency model in described fundamental frequency characteristic sequence and described fundamental frequency model set, and select to have the fundamental frequency model of the fundamental frequency model of maximum likelihood degree as described synthesis unit.

Step 504, determine the spectral model of each synthesis unit according to described spectrum signature sequence and spectral model set corresponding to described synthesis unit.

Particularly, determine the spectrum signature sequence that described synthesis unit is corresponding, and obtain the spectral model set that described synthesis unit is corresponding, be i.e. the spectral model corresponding to all leaf nodes of the frequency spectrum binary decision tree of described synthesis unit.Then calculate the likelihood score of each spectral model in described spectrum signature sequence and described spectral model set, and select to have the spectral model of the spectral model of maximum likelihood degree as described synthesis unit.

Visible, the voice signal sending method of the embodiment of the present invention, greatly reduced the transmission code flow rate under the prerequisite that guarantees voice recovery tonequality minimization of loss, reduced traffic consumes, solved the problem that the traditional voice coding method can not take into account tonequality and flow, improved the mobile network under the epoch telex network demand experience.

Correspondingly, the embodiment of the present invention also provides a kind of voice signal method of reseptance, as shown in Figure 6, is the process flow diagram of the method, comprises the following steps:

Step 601, receive sequence number string corresponding to phonetic synthesis parameter model sequence.

Step 602 is obtained phonetic synthesis parameter model sequence according to described sequence number string from code book.

Because each phonetic synthesis parameter model has a unique sequence number, and, all preserve identical code book at transmit leg and take over party, comprised all phonetic synthesis parameter models in described code book.Therefore, the take over party can obtain the phonetic synthesis parameter model of corresponding each sequence number according to the sequence number string of receiving from code book, splices these phonetic synthesis parameter models and obtains described phonetic synthesis parameter model sequence.

Step 603, determine the phonetic synthesis argument sequence according to described phonetic synthesis parameter model sequence.

Particularly, can determine the phonetic synthesis parameter according to described phonetic synthesis parameter model sequence and duration sequence corresponding to synthesis unit, generate the phonetic synthesis argument sequence.

Such as, obtain the phonetic synthesis argument sequence according to following formula:

O _max=arg?max?P(O|,λ,T)

Wherein, O is argument sequence, and λ is given phonetic synthesis parameter model sequence, and T is the duration sequence that each synthesis unit is corresponding.

O _maxfinal base frequency parameters sequence or the frequency spectrum parameter sequence generated, in the scope of unit duration sequence T, ask for the argument sequence O with maximum likelihood value corresponding to given phonetic synthesis parameter model sequence λ _maxthereby, obtain the argument sequence for phonetic synthesis.

Step 604, recover voice signal according to described phonetic synthesis argument sequence.

The phonetic synthesis argument sequence O that upper step is obtained _maxsend into voice operation demonstrator and can obtain corresponding voice.Voice operation demonstrator is that a kind of analysis of voice signal recovers instrument, parameterized speech data (as base frequency parameters, frequency spectrum parameter) can be recovered to high-quality speech waveform.

Visible, embodiment of the present invention voice signal sending method and method of reseptance, extraction and signal by the phonetic synthesis parameter model corresponding to the continuous speech signal are synthetic, have realized the very big compression of voice signal and minimizing of the loss of signal, effectively reduce distorted signals.

Correspondingly, the embodiment of the present invention also provides a kind of voice signal transmitting system, as shown in Figure 7, is the structured flowchart of this system.

In this embodiment, described voice signal transmitting system comprises:

Text acquisition module 701, for determining content of text corresponding to continuous speech signal to be sent;

Parameter model determination module 702, for determining the phonetic synthesis parameter model of each synthesis unit according to described content of text;

Concatenation module 703, obtain phonetic synthesis parameter model sequence for the phonetic synthesis parameter model that splices each synthesis unit;

Sequence number string determination module 704, for determining sequence number string corresponding to described phonetic synthesis parameter model sequence;

Sending module 705, for described sequence number string is sent to receiving end, so that described receiving end recovers described continuous speech signal according to described sequence number string.

In actual applications, above-mentioned text acquisition module 701 can pass through the described content of text of speech recognition algorithm automatic acquisition, can certainly obtain described content of text by the mode of artificial mark.For this reason, voice recognition unit and/or markup information acquiring unit can be set in text acquisition module 701, in order to can make the user select different modes to obtain content of text corresponding to continuous speech signal to be sent.Wherein, described voice recognition unit, for determining content of text corresponding to continuous speech signal to be sent by speech recognition algorithm; Described markup information acquiring unit is for obtaining content of text corresponding to continuous speech signal to be sent by the mode of artificial mark.

Recover the loss of tonequality in order to reduce receiving end as far as possible, make receiving end recover the continuous speech signal by the phonetic synthesis mode, the phonetic synthesis parameter model that parameter model determination module 702 obtains from primary speech signal should meet former voice signal characteristics as far as possible, to reduce the loss of signal compression and recovery.Particularly, can carry out the voice snippet cutting to the continuous speech signal according to described content of text, obtain the voice snippet that each synthesis unit is corresponding, and then obtain duration, fundamental frequency model and spectral model that each synthesis unit is corresponding.

Embodiment of the present invention voice signal transmitting system, adopt the Statistic analysis models coding, and its processing mode and speech sample rate are irrelevant, and to 16kHz ultra broadband voice coding, without paying additional code flow rate cost, its acoustical quality is good, and the coding flow is low.Compare the communications field speech coding system of current main-stream, the voice coding modes of system of the present invention can be processed ultra broadband voice (16kHz sampling rate), and tonequality is higher; And there is lower rate bit stream (1kb/s is following), effectively reduce network traffic.

As shown in Figure 8, be a kind of structured flowchart of parameter model determination module in the embodiment of the present invention.

Described parameter model determination module comprises:

Cutting unit 801, for according to described content of text, described continuous speech signal being carried out to the voice snippet cutting, obtain the voice snippet that each synthesis unit is corresponding.

Particularly, acoustic model sequence that can the continuous speech signal is corresponding with synthesis unit in described content of text is done and is forced to align, be the speech recognition decoder of computing voice signal corresponding to described acoustic model sequence, thereby obtain the sound bite that each synthesis unit is corresponding.

It should be noted that, described synthesis unit can be selected different size according to different application demands.In general, if rate bit stream is had relatively high expectations, select larger voice unit, as syllable unit, phoneme unit etc.; If otherwise tonequality is had relatively high expectations, can select less voice unit, as the state cell of model, feature stream unit etc.Adopting based on HMM(Hidden Markov Model, hidden Markov model) acoustic model arrange down, also can further choose each state of HMM model as synthesis unit, and obtain the voice snippet of corresponding state-based layer.Subsequently each state is determined respectively corresponding fundamental frequency model and the spectral model of each state corresponding fundamental frequency binary decision tree and frequency spectrum binary decision tree from it.The phonetic synthesis parameter model that can make like this to obtain can be described the characteristics of voice signal more meticulously.

Duration determining unit 802, for determining successively the duration of the voice snippet that each synthesis unit is corresponding.

Fundamental frequency model determining unit 803, for determining successively the fundamental frequency model of the voice snippet that each synthesis unit is corresponding.

Spectral model determining unit 804, for determining successively the spectral model of the voice snippet that each synthesis unit is corresponding.

In actual applications, above-mentioned fundamental frequency model determining unit 803 and spectral model determining unit 804 can have multiple implementation, such as, can obtain fundamental frequency model and spectral model according to binary decision tree, for this reason, in another embodiment of voice signal transmitting system of the present invention, described system also comprises that binary decision tree builds module, for building fundamental frequency binary decision tree and frequency spectrum binary decision tree.In addition, above-mentioned fundamental frequency model determining unit 803 and spectral model determining unit 804 can also obtain fundamental frequency model and spectral model based on signal characteristic optimization, to this, will be described in detail later.

As shown in Figure 9, be in the embodiment of the present invention in the voice signal transmitting system binary decision tree build the structured flowchart of module.

Described binary decision tree builds module and comprises:

Training data acquiring unit 901, for obtaining training data;

Parameter extraction unit 902, for extract the synthetic parameters of the voice snippet set that described synthesis unit is corresponding from described training data, described synthetic parameters comprises: fundamental frequency feature and spectrum signature;

Initialization unit 903, carry out initialization for the binary decision tree corresponding to described synthesis unit according to described synthetic parameters, builds the binary decision tree that only has root node;

Node is investigated unit 904, for the root node from described binary decision tree, investigates successively each nonleaf node; If current investigation node needs division, current investigation node is divided, and obtained child node and training data corresponding to described child node after division; Otherwise, by current investigation vertex ticks, be leaf node;

Binary decision tree output unit 905, after in described node investigation unit, all nonleaf nodes having been investigated, export the binary decision tree of described synthesis unit.

In this embodiment, training data acquiring unit 901 specifically can gather a large amount of voice training data and it is carried out to text marking, then carry out the voice snippet cutting of basic voice unit and even synthesis unit (as the state cell of basic speech unit models) according to the content of text of mark, obtain the voice snippet set that each synthesis unit is corresponding, and using each synthesis unit the voice snippet training data corresponding as this synthesis unit in corresponding voice snippet set.

Above-mentioned node is investigated unit 904 when judging whether current investigation node needs to divide, and can select the problem with maximum sample concentration class fall to divide trial as optimal selection problem according to the sample concentration class of current investigation node, obtains child node.If be less than the threshold value of setting according to the concentration class decline of described optimal selection problem division, or in the child node after division, training data, lower than the thresholding of setting, determines that current investigation node no longer continues division.

Above-mentioned investigation and fission process can, with reference to the description in the embodiment of the present invention voice signal sending method of front, not repeat them here.

It should be noted that, in embodiments of the present invention, fundamental frequency binary decision tree and frequency spectrum binary decision tree can build module by this binary decision tree and set up, and its implementation procedure is similar, at this, describes in detail no longer one by one.

Based on above-mentioned fundamental frequency binary decision tree and frequency spectrum binary decision tree, below further describe the implementation of fundamental frequency model determining unit and spectral model determining unit in the embodiment of the present invention.

As shown in figure 10, be a kind of structured flowchart of fundamental frequency model determining unit in the voice signal transmitting system in the embodiment of the present invention.

In this embodiment, described fundamental frequency model determining unit comprises:

The first acquiring unit 161, for obtaining the fundamental frequency binary decision tree that described synthesis unit is corresponding.

The first resolution unit 162, for described synthesis unit is carried out to text resolution, obtain the contextual information of described synthesis unit, such as, phoneme unit, tonality, part of speech, the inferior contextual information of rhythmite.

The first decision package 163, for according to described contextual information, described fundamental frequency binary tree, carrying out path decision, obtain corresponding leaf node.

Particularly, the process of carrying out path decision is as follows: according to the contextual information of described synthesis unit, from the root node of described fundamental frequency binary decision tree, start successively each node split problem to be answered; Obtain a top-down coupling path according to answering result; Obtain leaf node according to described coupling path.

The first output unit 164, the fundamental frequency model for the fundamental frequency model that described leaf node is corresponding as described synthesis unit.

With above-mentioned fundamental frequency model determining unit realize similarly, as shown in figure 11, be a kind of structured flowchart of voice signal transmitting system intermediate frequency spectrum model determining unit in the embodiment of the present invention.

In this embodiment, described spectral model determining unit comprises:

Second acquisition unit 171, for obtaining the frequency spectrum binary decision tree that described synthesis unit is corresponding.

The second resolution unit 172, for described synthesis unit is carried out to text resolution, obtain its phoneme unit, tonality, and part of speech, the inferior contextual information of rhythmite, such as, phoneme unit, tonality, part of speech, the inferior contextual information of rhythmite.

The second decision package 173 for the contextual information according to described synthesis text, carries out path decision in described frequency spectrum binary tree, obtains corresponding leaf node.

Particularly, the process of carrying out path decision is as follows: according to the contextual information of described synthesis unit, from the root node of described frequency spectrum binary decision tree, start successively each node split problem to be answered; Obtain a top-down coupling path according to answering result; Obtain leaf node according to described coupling path.

The second output unit 174, using described leaf node, corresponding spectral model is as the spectral model of described synthesis unit.

It should be noted that, in actual applications, the spectral model determining unit shown in the fundamental frequency model determining unit shown in above-mentioned Figure 10 and Figure 11 can be respectively by separately independently physical location realize, also can unify to be realized by a physical location.When needs generate fundamental frequency model, obtain the fundamental frequency binary decision tree that synthesis unit is corresponding, and synthesis unit is resolved and decision-making accordingly, obtain the fundamental frequency model of corresponding described synthesis unit.When needs generate spectral model, obtain the frequency spectrum binary decision tree that synthesis unit is corresponding, and synthesis unit is resolved and decision-making accordingly, obtain the spectral model of corresponding described synthesis unit.

As shown in figure 12, be the another kind of structured flowchart of fundamental frequency model determining unit in the voice signal transmitting system in the embodiment of the present invention.

The first determining unit 181, for determining fundamental frequency characteristic sequence corresponding to described synthesis unit.

The first set acquiring unit 182, for obtaining the fundamental frequency model set that described synthesis unit is corresponding, i.e. the fundamental frequency model corresponding to all leaf nodes of the fundamental frequency binary decision tree of described synthesis unit.

The first computing unit 183, for calculating the likelihood score of described fundamental frequency characteristic sequence and described each fundamental frequency model of fundamental frequency model set.

The first selected cell 184, for selecting to have the fundamental frequency model of the fundamental frequency model of maximum likelihood degree as described synthesis unit.

With above-mentioned fundamental frequency model determining unit realize similarly, Figure 13 is the another kind of structured flowchart of voice signal transmitting system intermediate frequency spectrum model determining unit in the embodiment of the present invention.

In this embodiment, described spectral model determining unit comprises:

The second determining unit 191, for determining spectrum signature sequence corresponding to described synthesis unit.

The second set acquiring unit 192, for obtaining the spectral model set that described synthesis unit is corresponding, i.e. the spectral model corresponding to all leaf nodes of the fundamental frequency binary decision tree of described synthesis unit.

The second computing unit 193, for calculating the likelihood score of described spectrum signature sequence and described each spectral model of spectral model set.

The second selected cell 194, for selecting to have the spectral model of the spectral model of maximum likelihood degree as described synthesis unit.

It should be noted that, in actual applications, the spectral model determining unit shown in the fundamental frequency model determining unit shown in above-mentioned Figure 12 and Figure 13 can be respectively by separately independently physical location realize, also can unify to be realized by a physical location.When needs generate fundamental frequency model, obtain the fundamental frequency binary decision tree that synthesis unit is corresponding, and synthesis unit is resolved and decision-making accordingly, obtain the fundamental frequency model of corresponding described synthesis unit.When needs generate spectral model, obtain the frequency spectrum binary decision tree that synthesis unit is corresponding, and synthesis unit is resolved and decision-making accordingly, obtain the spectral model of corresponding described synthesis unit.

Visible, the voice signal transmitting system of the embodiment of the present invention, greatly reduced the transmission code flow rate under the prerequisite that guarantees voice recovery tonequality minimization of loss, reduced traffic consumes, solved the problem that the traditional voice coding method can not take into account tonequality and flow, improved the mobile network under the epoch telex network demand experience.

Correspondingly, the embodiment of the present invention also provides a kind of voice signal receiving system, as shown in figure 14, is the structured flowchart of this system.

In this embodiment, described voice signal receiving system comprises:

Receiver module 141, for receiving sequence number string corresponding to phonetic synthesis parameter model sequence;

Extraction module 142, for obtaining phonetic synthesis parameter model sequence according to described sequence number string from code book;

Determination module 143, for determining the phonetic synthesis argument sequence according to described phonetic synthesis parameter model sequence;

Signal recover module 144, for recovering voice signal according to described phonetic synthesis argument sequence.

Above-mentioned determination module 143 can be determined the phonetic synthesis parameter according to described phonetic synthesis parameter model sequence and the lasting duration of model sequence, generates phonetic synthesis ginseng sequence.The specific implementation process can, with reference to the description in the embodiment of the present invention voice signal method of reseptance of front, not repeat them here.

Because recovery and the speech sample rate of voice signal in embodiment of the present invention voice signal receiving system are irrelevant, therefore, can under the prerequisite that guarantees voice recovery tonequality minimization of loss, realize the signal transmission of utmost point low code stream rate, tonequality and the problems of liquid flow of traditional voice coding method have been solved preferably, improve mobile network's telex network demand experience under the epoch, saved network charges.

The voice signal of the embodiment of the present invention sends and reception programme goes for the coding of various types of voice (as the ultra broadband voice of 16k sampling rate, the narrowband speech of 8k sampling rate etc.), and can obtain tonequality preferably.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and between each embodiment, identical similar part is mutually referring to getting final product, and each embodiment stresses is the difference with other embodiment.Especially, for system embodiment, due to it, substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part gets final product referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, the wherein said unit as the separating component explanation can or can not be also physically to separate, the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed on a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the purpose of the present embodiment scheme.Those of ordinary skills in the situation that do not pay creative work, can understand and implement.

Above the embodiment of the present invention is described in detail, has applied embodiment herein the present invention is set forth, the explanation of above embodiment is just for helping to understand method and apparatus of the present invention; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention simultaneously.

Claims

1. a voice signal sending method, is characterized in that, comprising:

Determine content of text corresponding to continuous speech signal to be sent;

2. method according to claim 1, is characterized in that, described definite content of text corresponding to continuous speech signal to be sent comprises:

Determine content of text corresponding to continuous speech signal to be sent by speech recognition algorithm; Perhaps

Obtain content of text corresponding to continuous speech signal to be sent by the mode of artificial mark.

3. method according to claim 1, is characterized in that, describedly according to described content of text, determines that the phonetic synthesis parameter model of each synthesis unit comprises:

According to described content of text, described continuous speech signal is carried out to the voice snippet cutting, obtain the voice snippet that each synthesis unit is corresponding;

Determine successively duration, fundamental frequency model and the spectral model of the voice snippet that each synthesis unit is corresponding.

4. method according to claim 3, is characterized in that, fundamental frequency model corresponding to described definite synthesis unit comprises:

Obtain the fundamental frequency binary decision tree that described synthesis unit is corresponding;

Described synthesis unit is carried out to text resolution, obtain the contextual information of described synthesis unit;

Carry out path decision according to described contextual information in described fundamental frequency binary tree, obtain corresponding leaf node;

Using described leaf node, corresponding fundamental frequency model is as the fundamental frequency model of described synthesis unit.

5. method according to claim 3, is characterized in that, spectral model corresponding to described definite synthesis unit comprises:

Obtain the frequency spectrum binary decision tree that described synthesis unit is corresponding;

Described synthesis unit is carried out to text resolution, obtain its phoneme unit, tonality, part of speech, the inferior contextual information of rhythmite;

According to the contextual information of described synthesis text, carry out path decision in described frequency spectrum binary tree, obtain corresponding leaf node;

Using described leaf node, corresponding spectral model is as the spectral model of described synthesis unit.

6. according to the described method of claim 4 or 5, it is characterized in that, described method also comprises: build in the following manner the binary decision tree that described synthesis unit is corresponding:

Obtain training data;

Extract the synthetic parameters of the voice snippet set that described synthesis unit is corresponding from described training data, described synthetic parameters comprises: fundamental frequency feature and spectrum signature;

The binary decision tree corresponding to described synthesis unit according to described synthetic parameters carries out initialization;

From the root node of described binary decision tree, investigate successively each nonleaf node;

If current investigation node needs division, current investigation node is divided, and obtained child node and training data corresponding to described child node after division; Otherwise, by current investigation vertex ticks, be leaf node;

After all nonleaf nodes have been investigated, obtain the binary decision tree of described synthesis unit.

7. method according to claim 3, is characterized in that, fundamental frequency model corresponding to described definite synthesis unit comprises:

Determine the fundamental frequency characteristic sequence that described synthesis unit is corresponding;

Obtain the fundamental frequency model set that described synthesis unit is corresponding;

Calculate the likelihood score of each fundamental frequency model in described fundamental frequency characteristic sequence and described fundamental frequency model set;

Selection has the fundamental frequency model of the fundamental frequency model of maximum likelihood degree as described synthesis unit.

8. method according to claim 3, is characterized in that, spectral model corresponding to described definite synthesis unit comprises:

Determine the spectrum signature sequence that described synthesis unit is corresponding;

Obtain the spectral model set that described synthesis unit is corresponding;

Calculate the likelihood score of each spectral model in described spectrum signature sequence and described spectral model set;

Selection has the spectral model of the spectral model of maximum likelihood degree as described synthesis unit.

9. a voice signal method of reseptance, is characterized in that, comprising:

Receive sequence number string corresponding to phonetic synthesis parameter model sequence;

Obtain phonetic synthesis parameter model sequence according to described sequence number string from code book;

Determine the phonetic synthesis argument sequence according to described phonetic synthesis parameter model sequence;

Recover voice signal according to described phonetic synthesis argument sequence.

10. method according to claim 9, is characterized in that, describedly according to described phonetic synthesis parameter model sequence, determines that the phonetic synthesis argument sequence comprises:

Determine the phonetic synthesis parameter according to described phonetic synthesis parameter model sequence and the lasting duration of model sequence, generate phonetic synthesis ginseng sequence.

11. a voice signal transmitting system, is characterized in that, comprising:

12. system according to claim 11, is characterized in that, described text acquisition module comprises:

Voice recognition unit, for determining content of text corresponding to continuous speech signal to be sent by speech recognition algorithm; Perhaps

The markup information acquiring unit, obtain content of text corresponding to continuous speech signal to be sent for the mode by artificial mark.

13. system according to claim 11, is characterized in that, described parameter model determination module comprises:

The cutting unit, for according to described content of text, described continuous speech signal being carried out to the voice snippet cutting, obtain the voice snippet that each synthesis unit is corresponding;

The duration determining unit, for determining successively the duration of the voice snippet that each synthesis unit is corresponding;

The fundamental frequency model determining unit, for determining successively the fundamental frequency model of the voice snippet that each synthesis unit is corresponding

The spectral model determining unit, for determining successively the spectral model of the voice snippet that each synthesis unit is corresponding.

14. system according to claim 13, is characterized in that, described fundamental frequency model determining unit comprises:

The first acquiring unit, for obtaining the fundamental frequency binary decision tree that described synthesis unit is corresponding;

The first resolution unit, for described synthesis unit is carried out to text resolution, obtain the contextual information of described synthesis unit;

The first decision package, for according to described contextual information, described fundamental frequency binary tree, carrying out path decision, obtain corresponding leaf node;

The first output unit, the fundamental frequency model for the fundamental frequency model that described leaf node is corresponding as described synthesis unit.

15. system according to claim 13, is characterized in that, described spectral model determining unit comprises:

Second acquisition unit, for obtaining the frequency spectrum binary decision tree that described synthesis unit is corresponding;

The second resolution unit, for described synthesis unit is carried out to text resolution, obtain its phoneme unit, tonality, part of speech, the inferior contextual information of rhythmite;

The second decision package for the contextual information according to described synthesis text, carries out path decision in described frequency spectrum binary tree, obtains corresponding leaf node;

The second output unit, the spectral model for the spectral model that described leaf node is corresponding as described synthesis unit.

16. according to the described system of claims 14 or 15, it is characterized in that, described system also comprises: binary decision tree builds module, and described binary decision tree builds module and comprises:

The training data acquiring unit, for obtaining training data;

Parameter extraction unit, for extract the synthetic parameters of the voice snippet set that described synthesis unit is corresponding from described training data, described synthetic parameters comprises: fundamental frequency feature and spectrum signature;

Initialization unit, carry out initialization for the binary decision tree corresponding to described synthesis unit according to described synthetic parameters;

Node is investigated unit, for the root node from described binary decision tree, investigates successively each nonleaf node; If current investigation node needs division, current investigation node is divided, and obtained child node and training data corresponding to described child node after division; Otherwise, by current investigation vertex ticks, be leaf node;

The binary decision tree output unit, after in described node investigation unit, all nonleaf nodes having been investigated, export the binary decision tree of described synthesis unit.

17. system according to claim 13, is characterized in that, described fundamental frequency model determining unit comprises:

The first determining unit, for determining fundamental frequency characteristic sequence corresponding to described synthesis unit;

The first set acquiring unit, for obtaining the fundamental frequency model set that described synthesis unit is corresponding;

The first computing unit, for calculating the likelihood score of described fundamental frequency characteristic sequence and described each fundamental frequency model of fundamental frequency model set;

The first selected cell, for selecting to have the fundamental frequency model of the fundamental frequency model of maximum likelihood degree as described synthesis unit.

18. system according to claim 13, is characterized in that, described spectral model determining unit comprises:

The second determining unit, for determining spectrum signature sequence corresponding to described synthesis unit;

The second set acquiring unit, for obtaining the spectral model set that described synthesis unit is corresponding;

The second computing unit, for calculating the likelihood score of described spectrum signature sequence and described each spectral model of spectral model set;

The second selected cell, for selecting to have the spectral model of the spectral model of maximum likelihood degree as described synthesis unit.

19. a voice signal receiving system, is characterized in that, comprising:

Receiver module, for receiving sequence number string corresponding to phonetic synthesis parameter model sequence;

Extraction module, for obtaining phonetic synthesis parameter model sequence according to described sequence number string from code book;

Determination module, for determining the phonetic synthesis argument sequence according to described phonetic synthesis parameter model sequence;

Signal recover module, for recovering voice signal according to described phonetic synthesis argument sequence.

20. system according to claim 19, is characterized in that,

Described determination module, specifically for according to described phonetic synthesis parameter model sequence and the lasting duration of model sequence, determining the phonetic synthesis parameter, generate phonetic synthesis ginseng sequence.