WO2024135901A1

WO2024135901A1 - Interactive emotional voice synthesis method based on counterpart voice and conversational speech information

Info

Publication number: WO2024135901A1
Application number: PCT/KR2022/021192
Authority: WO
Inventors: 이영한; 김제우; 조충상
Original assignee: 한국전자기술연구원
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2024-06-27
Also published as: KR20240100869A

Abstract

An interactive voice synthesis method based on a counterpart voice and conversational speech information is provided. The voice synthesis method according to an embodiment of the present invention extracts a voice feature from voice information of a user, extracts a text feature from conversational speech information of the user, generates reference information from the extracted voice feature and text feature, and generates voice-synthesized sound of a system from conversational speech information of the system and the reference information. Therefore, voice information and conversational speech information of a conversational counterpart are used as reference information to generate voice-synthesized sound of a conversation system, and thus the quality of voice-synthesized sound, which was complicatedly or indiscriminately generated, can be improved.

Description

Interactive emotional voice synthesis method based on partner voice and conversational speech information

The present invention relates to voice synthesis technology, and more specifically, to a voice synthesis method for responsive emotional conversation by reflecting the other party's conversation intention and voice characteristics.

With recent research on Fully End-to-end TTS (Text To Speech) learning techniques, voice synthesis technology is advancing to a level where it is difficult to distinguish it from human speech.

However, it can only be used appropriately in fields such as general guidance and book reading, and in 1:1 conversations and N:1 conversations that reflect emotions, it still cannot produce utterances appropriate for the conversation atmosphere, reducing the immersion of the voice interface. It is acting as a factor.

To solve this problem, research on creating prosody in speech synthesis by inserting some emotional information or adjusting the length of phonemes, pitch, and loudness has been actively conducted recently.

However, the method of individually predicting this information in generating corresponding voices in real time has the disadvantage of being less useful.

The present invention was created to solve the above problems. The purpose of the present invention is to improve the quality of voice synthesis in a conversation system by using the conversation partner's voice information and conversation speech information as reference information. The goal is to provide a voice synthesis method for generating voice synthesis sounds.

A voice synthesis method according to an embodiment of the present invention for achieving the above object includes extracting voice features from user's voice information; Generating reference information from extracted voice features; It includes: generating a speech synthesis sound of the system from the dialogue speech information and reference information of the system.

The speech synthesis method according to the present invention further includes extracting text features from the user's conversation utterance information, and the reference information generating step may generate reference information from the extracted speech features and text features.

Reference information may be referenced to generate a voice synthesis sound by reflecting the user's intention and emotion.

The prosody of the system's synthesized voice may vary depending on the reference information.

The voice synthesis method according to the present invention further includes converting the user's voice information into embedding information, and the voice feature extraction step may be extracting voice features from voice information converted into embedding information.

The speech synthesis method according to the present invention further includes converting the user's conversation speech information into embedding information, and the text feature extraction step may be extracting text features from the conversation speech information converted into embedding information.

The speech synthesis sound generation step of the system includes extracting text features from dialogue speech information of the system; and generating a speech synthesis sound of the system from the extracted text features and reference information.

The voice synthesis method according to the present invention may further include outputting the voice synthesis sound of the generated system.

A user may include multiple users.

According to another aspect of the present invention, there is provided a processor that extracts voice features from the user's voice information, generates reference information from the extracted voice features, and generates a voice synthesis sound of the system from the conversation speech information and reference information of the system; and a storage unit that provides storage space necessary for the processor. A voice synthesis system is provided, characterized in that it includes a storage unit.

According to another aspect of the present invention, receiving input of user's voice information; Extracting voice features from input user voice information; Generating reference information from extracted voice features; Generating a speech synthesis sound of the system from the dialogue speech information and reference information of the system; A speech synthesis method is provided, comprising the step of outputting the generated speech synthesis sound.

According to another aspect of the present invention, a microphone that receives user's voice information; a processor that extracts voice features from the user's voice information input through a microphone, generates reference information from the extracted voice features, and generates a voice synthesis sound of the system from the system's conversation speech information and reference information; and a speaker that outputs the voice synthesis sound generated by the processor. A voice synthesis system is provided, characterized in that it includes a speaker.

As described above, according to the embodiments of the present invention, by using the voice information of the conversation partner and the conversation speech information as reference information to generate the voice synthesis sound of the conversation system, the voice synthesis sound that had to be generated in a complex or uniform manner can be created. Quality can be improved.

In particular, according to embodiments of the present invention, the tone or prosody of the voice synthesis sound can be changed depending on the conversation content or emotional state of the conversation partner, thereby providing improved synthesized sound in services that utilize interactive voice interface technology such as virtual assistants. By providing this, you can increase your sense of immersion in the service.

1 is a diagram showing a prosody-based voice synthesis method;

Figure 2 is a diagram provided to explain an emotional voice synthesis system according to an embodiment of the present invention;

Figure 3 is a flowchart provided to explain an emotional voice synthesis method according to another embodiment of the present invention;

Figure 4 is a diagram showing the configuration of a conversation system according to another embodiment of the present invention.

Hereinafter, the present invention will be described in more detail with reference to the drawings.

Figure 1 is a diagram illustrating a prosody-based voice synthesis method. This is expected to create a synthetic sound with a similar rhyme by inputting the prosody information to be copied, or generate a synthetic sound with an adjusted prosody by inputting detailed information about the prosody.

When using this method, additional research is required, such as finding a sample suitable for the current prosody to be created, or analyzing and entering detailed information. In other words, there is a limitation in that additional research is needed to utilize the technology as an interactive speech synthesis generation model.

Accordingly, an embodiment of the present invention presents an interactive emotional voice synthesis method based on the other person's voice and conversational speech information. When generating interactive voice synthesis sounds, it is a technology that reflects reactive emotions by reflecting the other person's conversational intention and voice characteristics.

Compared to speech synthesis that utilizes the length of phonemes, pitch, and size of the sound, or inputs information limited to the types of emotions defined in advance (happy, angry, sad, etc.), the method according to the embodiment of the present invention can produce various types of synthesized sounds depending on the other person's speech or voice.

In particular, in order to solve the One-to-Many Problem that cannot be solved in prosody research in speech synthesis, a variety of cases are possible with input information, so that various synthesized sounds can be generated even if the same utterance is spoken in a different tone.

Figure 2 is a diagram provided to explain an emotional voice synthesis system according to an embodiment of the present invention. As shown, the emotional voice synthesis system according to an embodiment of the present invention includes a voice encoder 110, a text encoder 120, a reference encoder 130, a TTS encoder 140, and a TTS decoder 150. do.

The voice encoder 110 extracts voice features from the user's voice information converted into embedding information. The front end of the voice encoder 110 may include a voice pre-learning model that converts the user's voice information into embedding information.

The text encoder 120 extracts text features from the user's conversation utterance information converted into embedding information. The front end of the text encoder 120 may include a language pre-learning model that converts the user's conversation utterance information into embedding information.

The reference encoder 130 generates reference information by combining the voice features extracted from the voice encoder 110 and the text features extracted from the text encoder 120. Reference information is information that is referenced to generate a voice synthesis sound by reflecting the user's intention and emotion. The voice synthesis sound finally output from the emotional voice synthesis system according to an embodiment of the present invention has different styles and prosody depending on this reference information.

The TTS encoder 140 extracts text features from conversation information to be uttered by the system.

The TTS decoder 150 generates a speech synthesis sound of the system from the text features extracted by the TTS encoder 140 and the reference information generated by the reference encoder 130.

The rear end of the TTS decoder 150 may include output means for outputting the voice synthesis sound of the system generated by the TTS decoder 150.

As a result, the emotional voice synthesis system according to an embodiment of the present invention generates voices with different prosody depending on the other person's voice and conversational speech information even if the same utterance is generated.

Furthermore, the number of conversation partners (users) is not limited to one person, and conversation is possible even when there are multiple conversation partners.

The emotional voice synthesis system according to an embodiment of the present invention is not limited to inputting raw data of voice and conversation speech information, and is also applicable to implementing a voice synthesis system including features that can be obtained through voice and conversation speech. possible.

Figure 3 is a flowchart provided to explain an emotional voice synthesis method according to another embodiment of the present invention.

For emotional voice synthesis, the voice pre-learning model first converts the user's voice information into embedding information (S210), and the voice encoder 110 extracts voice features from the user's voice information converted into embedding information in step S210. (S220).

Meanwhile, the language pre-learning model converts the user's conversation utterance information into embedding information (S230), and the text encoder 120 extracts text features from the user's conversation utterance information converted into embedding information in step S230 (S240).

Then, the reference encoder 130 generates reference information by combining the voice features extracted in step S220 and the text features extracted in step S240 (S250).

Next, the TTS encoder 140 extracts text features from the conversation information to be uttered by the system (S260). And the TTS decoder 150 generates a speech synthesis sound of the system from the text features extracted in step S260 and the reference information generated in step S260 (S270). Afterwards, the synthesized voice sound of the system generated in step S270 is output.

Figure 4 is a diagram showing the configuration of a conversation system according to another embodiment of the present invention. The conversation system according to an embodiment of the present invention includes a microphone 310, a processor 320, a speaker 330, and a storage unit 340.

The microphone 310 is a voice input means for receiving the user's voice utterance, and the speaker 330 is a voice output means for outputting the voice synthesis sound of the system.

The processor 320 performs the functions of the system shown in FIG. 2 described above or performs the method shown in FIG. 3. The storage unit 340 provides storage space necessary for the processor 320 to operate and function.

So far, the interactive emotional voice synthesis method based on the other person's voice and conversational speech information has been described in detail with preferred embodiments.

In an embodiment of the present invention, by using the other person's speech and voice as input in the existing interactive voice synthesis model, the quality of synthesized sounds that previously had to be created in a complex or uniform manner can be improved, and the tone of voice or utterance can be improved. By adjusting, the prosody of the generated synthesized sound can be changed.

As a result, it will be possible to increase immersion in the service by providing improved synthesized sounds in services that utilize interactive voice interface technology such as virtual assistants, and will be used in research on voice synthesis technology that can adapt to voice changes in the same speech (text). You can.

Meanwhile, of course, the technical idea of the present invention can be applied to a computer-readable recording medium containing a computer program that performs the functions of the device and method according to this embodiment. Additionally, the technical ideas according to various embodiments of the present invention may be implemented in the form of computer-readable code recorded on a computer-readable recording medium. A computer-readable recording medium can be any data storage device that can be read by a computer and store data. For example, of course, computer-readable recording media can be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, etc. Additionally, computer-readable codes or programs stored on a computer-readable recording medium may be transmitted through a network connected between computers.

In addition, although preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the invention pertains without departing from the gist of the present invention as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical idea or perspective of the present invention.

Claims

Extracting voice features from user voice information;

Generating reference information from extracted voice features;

A speech synthesis method comprising: generating a speech synthesis sound of the system from dialogue speech information and reference information of the system.
In claim 1,

Further comprising: extracting text features from the user's conversation utterance information,

The reference information creation step is,

A speech synthesis method characterized by generating reference information from extracted speech features and text features.
In claim 2,

Reference information is:

A voice synthesis method characterized by being used to generate voice synthesis sounds by reflecting the user's intentions and emotions.
In claim 3,

The voice synthesis sound of the system is,

A speech synthesis method characterized in that prosody varies depending on reference information.
In claim 2,

Converting the user's voice information into embedding information; further comprising,

The voice feature extraction step is,

A speech synthesis method characterized by extracting speech features from speech information converted to embedding information.
In claim 2,

Converting the user's conversation utterance information into embedding information,

The text feature extraction step is,

A speech synthesis method characterized by extracting text features from conversation utterance information converted to embedding information.
In claim 1,

The voice synthesis sound generation stage of the system is,

Extracting text features from dialogue utterance information in the system; and

A speech synthesis method comprising: generating a speech synthesis sound of the system from extracted text features and reference information.
In claim 1,

A voice synthesis method further comprising: outputting a voice synthesis sound of the generated system.
In claim 1,

Users,

A voice synthesis method characterized by including a plurality of users.
a processor that extracts voice features from the user's voice information, generates reference information from the extracted voice features, and generates a voice synthesis sound of the system from the system's conversation speech information and reference information; and

A voice synthesis system comprising a storage unit that provides storage space necessary for the processor.
Receiving user voice information;

Extracting voice features from input user voice information;

Generating reference information from extracted voice features;

Generating a speech synthesis sound of the system from the dialogue speech information and reference information of the system;

A speech synthesis method comprising: outputting the generated speech synthesis sound.
A microphone that receives the user's voice information;

a processor that extracts voice features from the user's voice information input through a microphone, generates reference information from the extracted voice features, and generates a voice synthesis sound of the system from the system's conversation speech information and reference information; and

A voice synthesis system comprising a speaker that outputs the voice synthesis sound generated by the processor.