CN114203150A

CN114203150A - Voice data processing method and device

Info

Publication number: CN114203150A
Application number: CN202111420017.9A
Authority: CN
Inventors: 吴少铎; 戴治波; 王瑞; 吴晨捷; 丁进飞
Original assignee: Nanjing Xingyun Digital Technology Co Ltd
Current assignee: Nanjing Xingyun Digital Technology Co Ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-03-18

Abstract

The invention discloses a voice data processing method and a device, wherein the method comprises the following steps: acquiring original sound source data converted from a text in advance; receiving and analyzing the configuration information to obtain modification information matched with the original sound source data; processing the original sound source data according to the modification information to obtain voice data to be played; original sound source data converted from the text is adjusted and processed through modification information, secondary creation is carried out on the original sound source data, and therefore customized played voice with more emotional colors is obtained, and more anthropomorphic and more entertaining reading experience is obtained when a listener listens.

Description

Voice data processing method and device

Technical Field

The invention relates to the field of computer data processing, in particular to a voice data processing method and device.

Background

The language is a great art, the human language is closely related to the civilization course thereof, the cultural characteristic social form of the times is reflected, and the human language has rich emotional colors in the communication process. The characters are used as carriers, a plurality of skills are often needed to be combined with contexts in the expression process to accurately express the emotional colors, when the characters are converted into the voice, because the machine for converting cannot understand the emotional colors, the artificial intelligence technology in the prior art can convert the texts into the voice, but the emotional colors are difficult to simulate, and the emotional colors of the voice are mainly expressed as reading skills including accents, pauses, tone, speed and tone during playing.

In order to solve the above problems, a machine deep learning method is usually adopted to process voice data at present, but the method is only limited to text conversion in a partial field, and the playing effect of the converted voice is far away from the effect gap of manually reading the text.

Disclosure of Invention

The invention aims to: the method and the device for processing the voice data can enrich emotional colors in the voice converted from the text, so that the effect of manually reading the text can be approached.

The technical scheme of the invention is as follows: in a first aspect, the present invention provides a method for processing voice data, the method comprising:

acquiring original sound source data converted from a text in advance;

receiving and analyzing configuration information to obtain modification information matched with the original sound source data;

and processing the original sound source data according to the modification information and a preset modification rule to obtain the voice data to be played.

In a preferred embodiment, the modification information includes a control command cooperating with the original sound source data, and the control command at least includes: at least one of a pause instruction, an accent instruction, a pace adjust instruction, a full intonation adjust instruction, and an add mouth addiction instruction.

In a preferred embodiment, when the control instruction includes the mouth nodule adding instruction and/or the period adjustment instruction, the modification information further includes auxiliary sound source data matched with the control instruction.

In a preferred embodiment, when the control instruction includes a pause instruction, the processing the original sound source data according to the modification information and a preset modification rule to obtain to-be-played sound data includes:

and carrying out pause identification on the corresponding position in the original sound source data based on the pause instruction and the preset modification rule so as to obtain the voice data to be played.

In a preferred embodiment, when the control instruction includes an accent instruction, the processing the original sound source data according to the modification information and a preset modification rule to obtain to-be-played sound data includes:

and adjusting the amplitude of the position corresponding to the original sound source data based on the accent instruction and the preset modification rule, or adjusting the amplitude and the frequency of the position corresponding to the original sound source data based on the accent instruction and the preset modification rule to obtain the voice data to be played.

In a preferred embodiment, when the control instruction includes a speech rate adjustment instruction, the processing the original sound source data according to the modification information and a preset modification rule to obtain to-be-played sound data includes:

and adjusting the playing frame rate of the corresponding position of the original sound source data based on the speech rate adjusting instruction and the preset modification rule to obtain the voice data to be played.

In a preferred embodiment, when the control instruction includes a period adjustment instruction, the processing the original sound source data according to the modification information and a preset modification rule to obtain the to-be-played voice data includes:

acquiring auxiliary sound source data associated with the sentence tone adjusting instruction,

and intercepting audio frames in the auxiliary sound source data based on the sentence tone adjusting instruction and replacing corresponding audio frames in the original sound source data with the intercepted audio frames in the auxiliary sound source data to obtain the voice data to be played.

In a preferred embodiment, when the control instruction includes an instruction to add mouth addiction, the processing the original sound source data according to the modification information and a preset modification rule to obtain to-be-played voice data includes:

and performing secondary synthesis by taking the original sound source data as a template based on the mouth nodule adding instruction to obtain voice data to be played.

In a preferred embodiment, the performing, based on the adding picnic instruction, a secondary synthesis with the original sound source data as a template to obtain to-be-played speech data includes:

acquiring a corresponding first audio frame in the auxiliary sound source data based on the adding mouth addiction instruction;

inserting a first audio frame into a specified position in the original sound source data based on the mouth nodule adding instruction, and carrying out displacement processing on an audio frame positioned behind the first audio frame in the original sound source data so as to eliminate time difference caused by inserting the first audio frame and obtain to-be-played voice data; or:

acquiring a corresponding first audio frame in the auxiliary sound source data and positioning a corresponding second audio frame in the original sound source data based on the adding pica instruction;

and replacing the second audio frame with a first audio frame based on the mouth nodule adding instruction, and performing displacement processing on an audio frame positioned behind the first audio frame in the original sound source data to eliminate time difference caused by inserting the first audio frame so as to obtain voice data to be played.

In a second aspect, the present invention provides a speech data processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring original sound source data converted from texts in advance;

the receiving and analyzing module is used for analyzing the configuration information to obtain and receive modification information matched with the original sound source data;

and the processing module is used for processing the original sound source data according to the modification information and a preset modification rule to obtain the voice data to be played.

Compared with the prior art, the invention has the advantages that: a method and a device for processing voice data are provided, and the method comprises the following steps: acquiring original sound source data converted from a text in advance; receiving and analyzing the configuration information to obtain modification information matched with the original sound source data; processing the original sound source data according to the modification information to obtain voice data to be played; original sound source data converted from the text is adjusted and processed through modification information, secondary creation is carried out on the original sound source data, and therefore customized played voice with more emotional colors is obtained, and more anthropomorphic and more entertaining reading experience is obtained when a listener listens.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a voice data processing method according to an embodiment of the present invention;

fig. 2 is a block diagram of a voice data processing apparatus according to a second embodiment of the present invention.

Detailed Description

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As described in the background art, after the current text is converted into speech, each sentence in the speech is a mechanical pronunciation, and the emotional color of artificial reading is lacked during playing, or because it is difficult for a program to recognize contexts such as whispering in the text, the emotional color understood by the program is biased, and the experience of a listener is poor.

In order To solve the above problems, the present invention provides a method and an apparatus for processing voice data, wherein a playing system obtains original sound source data converted from a Text and pre-configured modification information matched with the original sound source data, performs information supplementation on the original sound source data by using the modification information, and supplements reading control factors as much as possible by establishing a control model through abstraction, so as To enrich the original sound source data, thereby enabling TTS (abbreviation of Text To Speech, i.e. "from Text To Speech", which is a part of a man-machine conversation and enables a machine To speak) To express emotion elements.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The first embodiment is as follows: the present embodiment provides a voice data processing method, as shown in fig. 1, the method includes:

and S110, acquiring original sound source data converted from the text in advance.

Specifically, all text strings in the text are recognized after the text is acquired, prosodic information of each text string is recognized using a tree structure including a prosody matching template, the tree structure is based on accent patterns such that each node of the tree structure provides an accent level managed with a syllable portion of the text string, and the text string is converted into audible speech using the prosodic information. Prosody refers to the rhythm and intonation of a speech. Alternatively, a text-to-speech TTS converter is used to convert the received text data into an audio speech signal.

Of course, other methods may also be used to convert the text into the speech to obtain the original sound source data, which is not limited in this embodiment.

And S120, receiving and analyzing the configuration information to obtain modification information matched with the original sound source data.

Specifically, modification information which is configured in advance and matched with the original sound source data to enrich emotional colors in the original sound source data is received. Because the emotional colors in the speech are mainly embodied by reading skills such as accents, pauses, tone, speed, intonation, and the like, preferably, the modification information includes control instructions matched with the original sound source data, and the control instructions at least include: at least one of a pause instruction, an accent instruction, a pace adjust instruction, a full intonation adjust instruction, and an add mouth addiction instruction. Namely: the control instruction may be a pause instruction, or an accent instruction, or a speech rate adjustment instruction, or a single instruction of a mouth addiction instruction, or may be a combination of any two of the above instructions, a combination of any three of the above instructions, a combination of any four of the above instructions, or a combination of five of the above instructions.

More specifically, the pause instruction further includes pause node position information and pause duration information, the stress instruction further includes stress point position information and stress level information, the speech rate adjustment instruction further includes speech rate adjustment node position information and speech rate speed information, the sentence tone adjustment instruction further includes sentence tone adjustment node position information and target sentence tone information, and the mouth addiction adding instruction further includes adding node position information and mouth addiction content information.

In a preferred embodiment, when the control command includes an add-on picnic command and/or a sentence adjustment command, the modification information further includes auxiliary sound source data matched with the control command.

Specifically, when the control command includes a sentence tone adjustment command, or includes an adding mouth nodule command, or includes a sentence tone adjustment command and an adding mouth nodule command, since the sentence tone adjustment processing is performed on the original sound source data, audio frames of different sentence tones need to be inserted, when the adding pica processing is performed, it is necessary to insert the pica content audio frame, so when the control command includes at least one of the period adjustment command and the adding pica command, it is necessary to arrange the auxiliary sound source data matching the control command in the modification information, that is, when the control command includes the sentence tone adjusting command, the sentence tone sound source matched with the sentence tone adjusting command and different from the original sound source is configured in the modification information, when the control instruction comprises the adding mouth nodule instruction, the mouth nodule sound source matched with the adding mouth nodule instruction is configured in the modification information, so that the corresponding processing of the original sound source data is facilitated.

And S130, processing the original sound source data according to the modification information and the preset modification rule to obtain the voice data to be played.

Specifically, the playing system analyzes the received original sound source data to obtain original audio data, modifies a corresponding data frame, i.e., an audio frame, in the original audio data according to a corresponding control instruction and a preset modification rule, and outputs the modified audio data to a final player for playing. The decoding process of the playing system that the decoder analyzes the original sound source data and the playing process of the final player playing the modified audio data are the existing functions of the decoder and the player, and the details of this embodiment are not repeated herein.

In one embodiment, when the control instruction includes a pause instruction, processing the original sound source data according to the modification information and the preset modification rule to obtain the to-be-played voice data includes:

s130-1, based on the pause instruction and the preset modification rule, pause identification is carried out on the corresponding position in the original sound source data to obtain the voice data to be played.

The purpose of pause is to simulate physiological ventilation needs, syntactic structures, table emphasis, express emotions or give listeners a choice and a room for thinking, understanding and accepting, help deepen impressions or give feedback gaps, such as exclamation, prevention of subsequent reading content loss and the like. The pause control frame is inserted at a specified time location, and the pause duration. When silence is inserted, the original sound wave is shifted to the right.

And after the pause instruction is received by the playing system, carrying out pause identification on the corresponding position in the original sound source data according to a preset modification rule and pause node position information and pause duration information in the pause instruction. Illustratively, the pause instruction includes a first pause node position 3s, a first pause duration 1s, a second pause node position 1min20s, and a second pause duration 2s, and after the play system receives the pause instruction, the play system inserts a first pause control frame at the original audio data play time axis 3s according to the pause instruction, inserts a second pause control frame at the original audio source data 1min20s, and identifies a pause identifier pause 1s when the player plays the processed audio to the first pause control frame, and identifies a pause identifier pause 2s when the player plays the second pause control frame.

In one embodiment, when the control instruction includes an accent instruction, processing the original sound source data according to the modification information and the preset modification rule to obtain the to-be-played voice data includes:

s130-2, adjusting the amplitude of the position corresponding to the original sound source data based on the accent instruction and the preset modification rule, or adjusting the amplitude and the frequency of the position corresponding to the original sound source data based on the accent instruction to obtain the voice data to be played.

Specifically, the re-reading of the text keyword sentence is to make a listener understand the expressed attention, and on the premise of not combining the context, the intention of the original author is difficult to express by directly describing the text keyword sentence, which is difficult to achieve at present because the text keyword sentence is limited by the resource condition and time requirement, and the text content is often subjected to word segmentation at present, but does not "think" about the central idea of the whole discussion.

Accent controls the assigned text or phoneme for tone reinforcement. In general, the amplitude of the position corresponding to the sound source may be increased, and in particular, a complementary frequency is necessary. I.e. increasing the output energy there.

After receiving the stress instruction, the playing system adjusts the amplitude of the audio frame at the specified position in the original sound source data or adjusts the amplitude and the frequency of the audio frame at the specified position in the original sound source data according to the stress point position information and the stress level information in the stress instruction so as to obtain the voice data to be played.

Illustratively, the playback system receives stress instructions, including a first stress point position of 1min and a second stress point position of 1min40s, and a first stress level and a second stress level, since the preset modification rule specifies that the first-level stress increases the first preset amplitude, the second-level stress increases the second preset amplitude and supplements the preset frequency, after the playing system receives the accent instruction, according to the accent instruction and the preset modification rule, adding a first preset amplitude to the audio frame at the position of 1min of the original sound source data, the audio frame at 1min40s of the original sound source data is subjected to the process of adding the second preset amplitude and supplementing the preset frequency, thereby obtaining a first-level accent effect at the position of the first accent node and a second-level accent effect at the position of the second accent node when the processed audio is played.

In one embodiment, when the control instruction includes a speech rate adjustment instruction, the processing the original sound source data according to the modification information and the preset modification rule to obtain the speech data to be played includes:

s130-3, adjusting the playing frame rate of the corresponding position of the original sound source data based on the speech rate adjusting instruction and the preset modification rule to obtain the voice data to be played.

Specifically, the speech rate is used for expressing the emotion of a speaker in reading aloud, and the speech rate is adjusted, namely the length and the tightness of syllables are controlled, so that the emotion of violence, joy, excitement and tension is expressed quickly, the emotion of calmness, fatness, sadness, heaviness and recall is expressed slowly, and the emotion of general narration, explanation and discussion is expressed at medium speed. Of course, the listener may also temporarily change the speech rate according to the acceptance and habits.

Since the speech rate control generally requires specifying the starting and ending positions of the text, accompanied by the speed information, the speech rate adjustment command includes the speech rate adjustment node position information and the speech rate speed information. Suppose 1.0 is normal, slow speed between 0 and 1, and double speed above 1.0.

Illustratively, the playing system receives a speech rate adjustment command, where the speech rate adjustment command includes a start time node 11min15s and an end time node 11min20s, and the speech rate is 0.8 times, and then the playing speed of the audio frames corresponding to 11min15s-11min20s in the original audio data is adjusted to 0.8 times.

The sentence pitch is the ascending and descending of the whole sentence stem, which represents the tone or attitude and is divided into ascending, descending, leveling and curving. To express irony, aversion, reflexive, unexpected, laughter, trill, complaints, and the like.

In one embodiment, when the control instruction includes a period adjustment instruction, processing the original sound source data according to the modification information and the preset modification rule to obtain the to-be-played voice data includes:

S130-4A, auxiliary sound source data related to the sentence tone adjusting instruction are obtained.

The auxiliary sound source data associated with the sentence tone adjusting instruction and the original sound source data are converted from the same text, but the sentence tones in the auxiliary sound source data associated with the sentence tone adjusting instruction are different from the sentence tones in the original sound source data, the number of the auxiliary sound source data associated with the sentence tone adjusting instruction can be more than or equal to 2, and the sentence tones of the auxiliary sound source data associated with each sentence tone adjusting instruction are different, so that the original sound source data are modified into different sentence tones. Illustratively, the original sound source data adopts a flat tone, the auxiliary sound source data associated with the first sentence tone adjusting instruction adopts an ascending tone, the auxiliary sound source data associated with the second sentence tone adjusting instruction adopts a descending tone, and the auxiliary sound source data associated with the third sentence tone adjusting instruction adopts a melody tone. And subsequently, acquiring audio frames at specified positions from the auxiliary sound source data associated with the first, second and third full-tone adjusting instructions according to the full-tone adjusting instructions to modify the original sound source data.

And S130-4B, intercepting audio frames in the auxiliary sound source data associated with the full-tone adjusting instruction based on the full-tone adjusting instruction and the preset modification rule, and replacing the corresponding audio frames in the original sound source data with the audio frames in the auxiliary sound source data associated with the full-tone adjusting instruction to obtain the voice data to be played.

In the embodiment, the corresponding part in the auxiliary sound source data associated with the sentence tone adjusting instruction is directly sent to replace again, a synthesis algorithm is not directly used, and the operation is simple and is not easy to make mistakes. And when replacing, judging the time length position to delete the original audio frame, introducing a new audio frame for synthesis, and simultaneously carrying out phase displacement on the subsequent audio frame according to the front-back time difference value.

Illustratively, the playing system receives a sentence tone adjustment command, which includes a first sentence tone adjustment start point 11min15s, a first sentence tone adjustment end point 11min20s, a first target sentence tone 1, a second sentence tone adjustment start point 11min40s, a second sentence tone adjustment end point 11min55s, a second target sentence tone 2, a third sentence tone adjustment start point 12min40s, a third sentence tone adjustment end point 12min55s, and a third target sentence tone 3, because the sentence tone 1 is specified in the modification rule to be correspondingly raised, the sentence tone 2 is correspondingly lowered, the sentence tone 3 is corresponding to a melody, and the auxiliary sound source data associated with the first sentence tone adjustment command in the auxiliary sound source data all adopts raised sentence tones, the auxiliary sound source data associated with the second sentence tone adjustment command all adopts lowered sentence tones, the auxiliary sound source adjustment command associated with the third sentence tone adjustment command all adopts melody in auxiliary sound source data associated with the sound source adjustment commands, and the auxiliary sound source data associated with the first sentence tone adjustment command is intercepted, the auxiliary sound source data associated with the first sentence tone adjustment start point 3611 min 15-15 s min and the auxiliary sound source adjustment commands are associated with the first sound source data and the audio time axis 20s min and the audio adjustment commands are intercepted Replacing the audio frame corresponding to the playing time axis 11min15s-11min20s in the original sound source data by the audio frame intercepted from the auxiliary sound source data associated with the second sentence pitch adjustment instruction, playing the audio frame corresponding to the playing time axis 11min40s-11min55s in the auxiliary sound source data associated with the second sentence pitch adjustment instruction, replacing the audio frame corresponding to the playing time axis 11min40s-11min55s in the original sound source data by the audio frame intercepted from the auxiliary sound source data associated with the second sentence pitch adjustment instruction, intercepting the audio frame corresponding to the playing time axis 12min40s-12min55s in the auxiliary sound source data associated with the third sentence pitch adjustment instruction, and replacing the audio frame corresponding to the playing time axis 12min40s-12min55s in the original sound source data by the audio frame intercepted from the auxiliary sound source data associated with the third sentence pitch adjustment instruction, so that the playing time axis 11min15s-11min20s is played as a rising pitch s after the original sound source data is replaced and modified, 11min40s-11min55s sentence is descending key, 12min40s-12min55s is melody.

In one embodiment, when the control instruction includes an add-on-mouth instruction, processing the original sound source data according to the modification information and the preset modification rule to obtain the voice data to be played includes:

and S130-5, carrying out secondary synthesis by taking the original sound source data as a template based on the mouth nodule adding instruction to obtain the voice data to be played.

Specifically, the oral nodules are symbolic products showing individual unique styles, including children's voice, special pronunciation tone for a specific character, or adding a spoken Buddhist at the beginning of a sentence, adding a sentence-end assistant, and the like.

Preferably, S130-5 specifically includes:

S130-5A, acquiring a corresponding first audio frame in the auxiliary sound source data based on the adding picnic instruction.

The first audio frame is an audio frame corresponding to the mouth-addiction content information contained in the mouth-addiction adding instruction, and specifically, when the mouth-addiction content information contained in the mouth-addiction adding instruction is a retroflex sound, the first audio frame comprises an audio frame corresponding to the retroflex sound; when the mouth addiction content information contained in the mouth addiction adding instruction is a special pronunciation of a specific character, the first audio frame comprises an audio frame of which the corresponding content is the special pronunciation of the specific character; when the oral nodule content information contained in the adding oral nodule instruction is the meditation, the first audio frame comprises an audio frame of which the corresponding content is the meditation; when the oral nodule content information contained in the oral nodule adding instruction adds sentence-end auxiliary words for the tail of a sentence, the first audio frame comprises the audio frame of which the corresponding content adds the sentence-end auxiliary words for the tail of the sentence. And acquiring a corresponding first audio frame according to the specific content of the mouth nodule content information in the mouth nodule adding instruction.

S130-5B, inserting the first audio frame into the appointed position in the original sound source data based on the mouth nodule adding instruction, and carrying out displacement processing on the audio frame behind the first audio frame in the original sound source data so as to eliminate the time difference caused by inserting the first audio frame and obtain the voice data to be played.

Specifically, when the mouth-nodule content information contained in the mouth-nodule adding instruction is a retroflex sound, the specified insertion position in the original sound source data is determined according to the position information of the adding node in the mouth-nodule adding instruction, the first audio frame of which the obtained content is the corresponding retroflex sound is inserted into the specified insertion position, the phoneme at the end of the phrase is deleted by a small amount, and the right phase of the subsequent audio is shifted. The determining of the specified insertion position in the original sound source data according to the position information of the added node in the added picnic instruction may include searching and locating a position of the preset specific phrase in the original sound source data, after searching and locating the preset specific phrase in the original sound source data, inserting a first audio frame with content of er (soft sound) after presetting the specific phrase, slightly deleting a phrase tail phoneme, and subsequently performing audio right phase shift len (voice) -offset (deleting phoneme length).

When the mouth-addiction content information contained in the mouth-addiction adding instruction is meditation, the first audio frame comprises a corresponding audio frame the content of which is meditation, the adding node position information contained in the mouth-addiction adding instruction is the beginning of the target sentence, the beginning position of the target sentence in the original sound source data is judged according to the adding node position information, the first audio frame the content of which is meditation is inserted into the beginning position of the target sentence, and the right phase of the subsequent audio is shifted. Illustratively, the control instruction received by the playing system includes an instruction for adding mouth addiction, where the instruction for adding mouth addiction includes mouth addiction content information- "ajowr" and the beginning of the target sentence- "the beginning of each sentence", then the playing system obtains a first audio frame whose corresponding content is "ajowr" from the auxiliary sound source data, and traverses the original sound source data to locate the start position of each sentence, and inserts the first audio frame whose content is "ajowr" obtained from the auxiliary sound source data at the start position of each sentence. Of course, the target sentence may be the second sentence, the eighth sentence, or other sentence positions.

When the oral nodule content information contained in the oral nodule adding instruction is a sentence tail added sentence end auxiliary word, the first audio frame comprises an audio frame of which the corresponding content is the sentence tail added sentence end auxiliary word, the adding node position information contained in the oral nodule adding instruction is a target sentence tail, the sentence tail position of each sentence in the original sound source data is judged according to the adding node position information, the first audio frame of which the content is the sentence tail added sentence end auxiliary word is inserted into the sentence tail position of each sentence, and the right phase of the subsequent audio is shifted. Illustratively, the control instruction received by the playing system includes an instruction for adding pica, the instruction for adding pica includes a first sentence tail adding sentence end aid word- "moas", a first adding node position information- "a first sentence tail, a third sentence tail", a second sentence tail adding sentence end aid word "saying", a second adding node position information- "a second sentence tail, a fourth sentence tail", the playing system obtains a first audio frame a with a corresponding content of "moas" and a first audio frame B with a corresponding content of "saying" from the auxiliary sound source data, and positions the first sentence tail and the third sentence tail by traversing the original sound source data, inserts a first audio frame a with a content of "moas" obtained from the auxiliary sound source data at the first sentence tail and the third sentence tail, positions the second sentence tail and the fourth sentence tail, inserts a second audio frame B with a content of "moas obtained from the auxiliary sound source data at the second sentence tail and the fourth sentence tail, and the right phase of the subsequent audio is shifted.

The mouth addiction adding instruction for the special pronunciation of the specific word with mouth addiction content information is similar to the sentence tone processing, and the sound source is easy to make mistakes by adopting a synthesis method, so that the mouth addiction adding instruction for the special pronunciation of the specific word with mouth addiction content information comprises the following steps:

s130-5a, acquiring a corresponding first audio frame in the auxiliary sound source data based on the adding picnic instruction and positioning a corresponding second audio frame in the original sound source data.

Specifically, when the mouth-addiction content information included in the mouth-addiction instruction received by the playing system is a special pronunciation of a specific word, the first audio frame acquired from the auxiliary audio source data includes an audio frame whose corresponding content is the special pronunciation of the specific word.

S130-5b, replacing the second audio frame with the first audio frame based on the mouth nodule adding instruction, and performing displacement processing on the audio frame behind the first audio frame in the original sound source data to eliminate the time difference caused by inserting the first audio frame so as to obtain the voice data to be played.

Specifically, the positions of all specific characters are searched and located by traversing the original sound source data in a whole sentence mode, or the modification information contains the position information of the preset specific characters, the position of a specified frame is obtained according to the position information of the preset specific characters contained in the modification information, the audio frame where the specific characters are located in the original sound source data is a second audio frame, the audio frame in the original sound source data where the specific characters are located is completely replaced by a first audio frame with the content of special pronunciations of the specific characters, and the subsequent audio phase shift replaces the time difference before and after the replacement. In the above case that the second audio frames in which all the specific words are located are all replaced by the first audio frames, the node position information is added to the "each specific word position" in the picnic adding instruction, in fact, the node position information may be other specified specific word positions in this embodiment, and the embodiment is not limited to the specific content of the specific word. In a first example, the mouth-addiction adding instruction includes adding a mouth-addiction content, i.e., a specific word "i", and adding node position information to "second", the audio frame in which the second specific word "i" is located in the original sound source data is replaced with the first audio frame in which the acquired content is "i" special rising tone pronunciation, and the time difference before and after the audio frame phase shift after the audio frame in which the replaced content is "i" special rising tone pronunciation in the original sound source data is replaced.

The playing system performs dynamic adjustment in the decoding playing process, namely, secondary synthesis.

The method and the device for processing the voice data provided by the embodiment comprise the following steps: acquiring original sound source data converted from a text in advance; receiving modification information matched with original sound source data; processing the original sound source data according to the modification information to obtain voice data to be played; original sound source data converted from the text is adjusted and processed through modification information, secondary creation is carried out on the original sound source data, and therefore customized played voice with more emotional colors is obtained, and more anthropomorphic and more entertaining reading experience is obtained when a listener listens.

Furthermore, the control instruction comprises at least one of a pause instruction, an accent instruction, a speech speed adjusting instruction, a sentence pitch adjusting instruction and an oral nodule adding instruction, and the original sound source data is correspondingly processed by pause and/or accent and/or speech speed adjusting and/or sentence pitch adjusting and/or oral nodule adding according to the control instruction, so that the emotion color can be embodied through the reading skills of pause, accent, speech speed, sentence pitch and oral nodule when the played sound source obtained after processing is played, the voice is more vivid and humanized when played, and a listener can obtain better auditory experience.

Example two: the present embodiment provides a voice data processing apparatus, including:

an obtaining module 210, configured to obtain original sound source data converted from a text in advance;

a receiving and analyzing module 220, configured to receive and analyze the configuration information to obtain modification information matched with the original sound source data;

and the processing module 230 is configured to process the original sound source data according to the modification information and a preset modification rule to obtain the voice data to be played.

In a preferred embodiment, the modification information includes a control command cooperating with the original audio source data, and the control command includes at least: at least one of a pause instruction, an accent instruction, a pace adjust instruction, a full intonation adjust instruction, and an add mouth addiction instruction.

When the control instruction comprises a pause instruction and/or an adding mouth addiction instruction and/or a period adjusting instruction, the modification information also comprises auxiliary sound source data matched with the control instruction.

When the control instruction includes a pause instruction, the auxiliary sound source data includes a mute audio frame, and the processing module 230 is specifically configured to:

and inserting the mute audio frames into the original sound source data according to the specified position and the specified duration based on the pause instruction and the preset modification rule.

When the control instruction includes an accent instruction, the processing module 230 is specifically configured to:

and adjusting the amplitude of the position corresponding to the original sound source data based on an accent instruction and a preset modification rule, or adjusting the amplitude and the frequency of the position corresponding to the original sound source data based on the accent instruction and the preset modification rule.

When the control instruction includes a speech rate adjustment instruction, the processing module 230 is specifically configured to:

and adjusting the playing frame rate of the corresponding position of the original sound source data based on the speech rate adjusting instruction and the preset modification rule.

When the control command includes a period adjustment command, the processing module 230 includes:

a first obtaining unit 231, configured to obtain auxiliary sound source data associated with the sentence tone adjustment instruction corresponding to the original sound source data,

and an intercepting and replacing unit 232, configured to intercept, based on the full-tone adjustment instruction, an audio frame in the auxiliary sound source data associated with the full-tone adjustment instruction and replace a corresponding audio frame in the original sound source data with the audio frame in the auxiliary sound source data associated with the intercepted full-tone adjustment instruction.

When the control instruction includes an add-on-mouth instruction, the processing module 230 is specifically configured to:

and performing secondary synthesis by taking the original sound source data as a template based on the adding mouth nodule instruction.

More preferably, the processing module 230 is specifically configured to, when performing secondary synthesis based on the adding picnic instruction and using the original sound source data as a template, include:

a second obtaining unit 233 configured to obtain a corresponding first audio frame in the auxiliary sound source data based on the add-on-mouth instruction;

an insertion processing unit 234, configured to insert the first audio frame into a specified position in the original sound source data based on the nodule adding instruction and perform displacement processing on an audio frame located after the first audio frame in the original sound source data to eliminate a time difference caused by inserting the first audio frame; or comprises the following steps:

an obtaining and positioning unit 235, configured to obtain a corresponding first audio frame in the auxiliary sound source data and position a corresponding second audio frame in the original sound source data based on the adding picnic instruction;

and a replacement processing unit 236, configured to replace the second audio frame with the first audio frame based on the add-on-picnic instruction, and perform displacement processing on an audio frame subsequent to the first audio frame in the original sound source data to eliminate a time difference caused by inserting the first audio frame.

The voice data processing apparatus provided in this embodiment is used to implement the voice data processing method provided in the first embodiment, and its beneficial effects are the same as those of the voice data processing method provided in the first embodiment, and are not described herein again.

It should be noted that: in the voice data processing apparatus provided in the foregoing embodiment, when the voice data processing method is executed, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the functions described above. In addition, the voice data processing apparatus and the voice data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

It should be further noted that: the terms "first" and "second" in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

It should be understood that the above-mentioned embodiments are only illustrative of the technical concepts and features of the present invention, and are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All modifications made according to the spirit of the main technical scheme of the invention are covered in the protection scope of the invention.

Claims

1. A method of processing speech data, the method comprising:

acquiring original sound source data converted from a text in advance;

2. The method according to claim 1, wherein the modification information includes a control command cooperating with the original audio source data, and the control command at least includes: at least one of a pause instruction, an accent instruction, a pace adjust instruction, a full intonation adjust instruction, and an add mouth addiction instruction.

3. The speech data processing method according to claim 2, wherein when said control instruction includes said instruction for adding mouth addiction and/or said instruction for adjusting period, said modifier information further includes auxiliary sound source data matched with said control instruction.

4. The method according to claim 3, wherein when the control command includes a pause command, the processing the original sound source data according to the modification information and the preset modification rule to obtain the to-be-played sound data includes:

5. The method according to claim 3, wherein when the control command includes an accent command, the processing the original sound source data according to the modification information and a preset modification rule to obtain the to-be-played sound data includes:

6. The method according to claim 3, wherein when the control command includes a speech rate adjustment command, the processing the original sound source data according to the modification information and a preset modification rule to obtain the speech data to be played includes:

7. The method according to claim 3, wherein when the control command includes a period adjustment command, the processing the original sound source data according to the modification information and the preset modification rule to obtain the to-be-played sound data includes:

acquiring auxiliary sound source data associated with the sentence tone adjusting instruction;

8. The method according to claim 3, wherein when the control command includes an add-on-mouth command, said processing the original sound source data according to the modifier information and a preset modifier rule to obtain the speech data to be played comprises:

9. The speech data processing method according to claim 8, wherein said performing secondary synthesis using the original sound source data as a template based on the adding picnic instruction to obtain speech data to be played comprises:

10. A speech data processing apparatus, characterized in that the apparatus comprises:

the receiving and analyzing module is used for receiving and analyzing the configuration information to obtain the modification information matched with the original sound source data;