JPH10333699A

JPH10333699A - Voice recognition and voice synthesizer

Info

Publication number: JPH10333699A
Application number: JP9147607A
Authority: JP
Inventors: Hitoshi Iwamida; 均岩見田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-06-05
Filing date: 1997-06-05
Publication date: 1998-12-18

Abstract

PROBLEM TO BE SOLVED: To make processable the voice recognition and voice synthesis in common, by providing a voice waveform node for the input/output of the voice waveform and a word symbol node for the input/output of a character line, and directly inputting the voice waveform to a neural circuit. SOLUTION: The voice synthesizer is constituted of a voice waveform node 1, a word symbol node 2, and intermediate nodes 3-6. The voice waveform node 1 receives the input voice as the voice waveform and outputs it to individual nodes in voice recognition and outputs the wavefrom of the synthesized voice according to the outputs from the individual nodes in voice synthesis. The word symbol node 2 outputs the recognition result of the inputted voice waveform according to the outputs from individual nodes in voice recognition and receives the outputted synthesized voice in a character line and outputs it to individual nodes in voice synthesis. The intermediate node 3 converts and defines the necessary process and data as the function of nodes and for weighting between the nodes. The voice waveform is directly inputted to a neural circuit, and the whole processing of voice recognition and voice synthesis is done in the common neural circuit.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声を入力すると
音声認識結果を出力する音声認識処理と、文字列を入力
すると相当する合成音声を出力する音声合成処理とを、
組み合わせて使用する音声認識および音声合成装置に関
する。The present invention relates to a speech recognition process for outputting a speech recognition result when a speech is input and a speech synthesis process for outputting a corresponding synthesized speech when a character string is inputted.
The present invention relates to a speech recognition and speech synthesis device used in combination.

【０００２】[0002]

【従来の技術】一般に、音声認識および音声合成は、音
声対話による観光案内システムや電子秘書システムなど
で応用されている。しかし、従来の音声処理のやり方で
は、例えば雑誌「FUJITSU 」1995-5月号 VOL.46, No.3,
319〜324 ページに開示されているように、音声認識部
および音声合成部は、それぞれ独立したユニットで処理
する構成となっている場合が多い。2. Description of the Related Art Generally, speech recognition and speech synthesis are applied to a sightseeing guide system and an electronic secretary system by voice dialogue. However, in the conventional speech processing method, for example, the magazine "FUJITSU" 1995-May issue VOL.46, No.3,
As disclosed on pages 319 to 324, the speech recognition unit and the speech synthesis unit are often configured to perform processing in independent units.

【０００３】このため、従来の装置では、音声認識およ
び音声合成のそれぞれを実現する装置を搭載する必要が
あり、これは必要とする資源（メモリ、記憶装置など）
の増大を招いていた。For this reason, in the conventional device, it is necessary to mount a device for realizing each of the voice recognition and the voice synthesis, and this requires the necessary resources (memory, storage device, etc.).
Was increasing.

【０００４】[0004]

【発明が解決しようとする課題】必要とする資源を少量
化するために、音声認識および音声合成の処理や使用す
るデータ（音素データや知識など、音声認識および音声
合成に必要とるもの）を共有する方式も検討されてきた
が、処理においてもデータにおいても、そのごく一部の
共有しか実現されていない。In order to reduce the required resources, processing of speech recognition and speech synthesis and data to be used (phoneme data and knowledge, etc. necessary for speech recognition and speech synthesis) are shared. However, only a part of the sharing has been realized in both processing and data.

【０００５】現在、音声認識および音声合成は、それぞ
れ独立して実現されているだけでなく、それぞれの装置
においても複数の手段から構成されて実現されている。
このため、音声認識および音声合成の各手段の対応がと
れず、また各手段で入力するデータや出力するデータも
当然異なったものとなり、その処理およびデータの共有
を阻害する要因となっている。At present, speech recognition and speech synthesis are not only independently realized, but are also realized by a plurality of means in each device.
For this reason, the means for speech recognition and speech synthesis cannot be handled, and the data input and output by each means are naturally different, which hinders the processing and data sharing.

【０００６】本発明は、音声認識および音声合成が、そ
れぞれ独立した装置として構成されるのではなく、共通
の神経回路で実現されることを目的とする。[0006] It is an object of the present invention to realize speech recognition and speech synthesis not by independent devices but by a common neural circuit.

【０００７】[0007]

【課題を解決するための手段】本発明において、音声認
識および音声合成を、共通の神経回路で実現するには、
現在のように入力された音声を分析して特徴データを比
較する装置に入力するのではなく、入力した音声をその
音声波形のまま処理すればよいことに着目した。According to the present invention, in order to realize speech recognition and speech synthesis using a common neural circuit,
Instead of analyzing the input voice and inputting it to a device for comparing feature data as in the present case, attention has been paid to the fact that the input voice may be processed as it is.

【０００８】そこで、音声認識処理においては音声波形
を入力し、音声合成処理では出力する音声波形を出力す
る、共通の音声波形ノードを設け、また、音声認識にお
いては認識した音声を文字列で出力し、音声合成におい
ては出力すべき合成音声を指示する文字列を入力する、
共通の言語記号ノードを設けることにより、音声認識と
音声合成の処理全体を共通の神経回路で行うことを可能
とする。Therefore, a common speech waveform node for inputting a speech waveform in the speech recognition process and outputting a speech waveform to be output in the speech synthesis process is provided. In speech recognition, the recognized speech is output as a character string. In speech synthesis, a character string indicating a synthesized speech to be output is input.
By providing a common language symbol node, it is possible to perform the entire processing of speech recognition and speech synthesis with a common neural circuit.

【０００９】すなわち、本発明は、音声信号の入力によ
り音声を認識し、文字列の入力により合成音声を出力す
る音声認識および音声合成装置において、音声波形の入
出力を行う音声波形ノードと、文字列の入出力を行う言
語記号ノードとを有し、音声波形を直接神経回路に入力
することにより、音声認識および音声合成の全処理を共
通の神経回路で行うことを特徴とする。That is, according to the present invention, in a speech recognition and speech synthesizer for recognizing speech by inputting a speech signal and outputting a synthesized speech by inputting a character string, a speech waveform node for inputting and outputting a speech waveform; A language symbol node for inputting / outputting a column, and by directly inputting a speech waveform to a neural circuit, all processes of speech recognition and speech synthesis are performed by a common neural circuit.

【００１０】[0010]

【発明の実施の形態】図１に、本発明の基本構成図を示
す。図から明らかなように、本発明の音声認識／合成装
置は、全体として、音声波形ノード1 、言語記号ノード
2 および中間ノード3 〜6 から構成されている。音声波
形ノード1 は、音声認識においては入力音声を音声波形
として入力して各ノードへの出力を行い、音声合成にお
いては各ノードからの出力に従って合成音声の波形を出
力する。言語記号ノード2 は、音声認識においては各ノ
ードからの出力に従って入力した音声波形の認識結果を
出力し、音声合成においては出力すべき合成音声を文字
列で入力し各ノードへ出力する。中間ノード3 は、音声
認識および音声合成に必要となる処理およびデータが、
それぞれのノードの関数やノード間の重み付けに変換さ
れて定義されている。FIG. 1 shows a basic configuration diagram of the present invention. As is apparent from the figure, the speech recognition / synthesis apparatus of the present invention has a speech waveform node 1 and a language symbol node as a whole.
2 and intermediate nodes 3-6. The speech waveform node 1 inputs an input speech as a speech waveform in speech recognition and outputs the speech to each node, and outputs a synthesized speech waveform in accordance with an output from each node in speech synthesis. The language symbol node 2 outputs the recognition result of the input speech waveform in accordance with the output from each node in the speech recognition, and inputs the synthesized speech to be output as a character string in the speech synthesis and outputs it to each node. The intermediate node 3 processes and data required for speech recognition and speech synthesis,
It is defined by being converted into the function of each node and the weight between nodes.

【００１１】具体的な実施例を図２に基づいて説明す
る。この実施例では、日本語音声の５母音、つまり
「あ」、「い」、「う」、「え」、「お」の音声認識お
よび音声合成を行うものを想定する。以下、母音の表記
は、それぞれ「Ａ」、「Ｉ」、「Ｕ」、「Ｅ」、「Ｏ」
として説明する。音声認識合成を行う神経回路は、図１
同様、音声波形ノード11、言語記号ノード21〜25および
中間ノード31〜36からなる。ここで、言語記号ノード21
〜25は母音毎に用意されている。A specific embodiment will be described with reference to FIG. In this embodiment, it is assumed that five vowels of Japanese voice, that is, voice recognition and voice synthesis of "A", "I", "U", "E", and "O" are performed. Hereinafter, the notation of vowels is “A”, “I”, “U”, “E”, “O”, respectively.
It will be described as. The neural circuit that performs speech recognition and synthesis is shown in FIG.
Similarly, it comprises a speech waveform node 11, language symbol nodes 21 to 25, and intermediate nodes 31 to 36. Where the language symbol node 21
~ 25 are prepared for each vowel.

【００１２】各ノード11, 21〜25, 31〜36は重み付きで
相互に結合している。各ノード11,21〜25, 31〜36の出
力は、全ノードの直前の時刻の出力およびそれらから自
ノードへの結合の重み付けを入力とする関数で決定され
る。すなわち、ノードj の時刻t での出力Ｘ_j(t) は、
下記の通り決定される。The nodes 11, 21 to 25 and 31 to 36 are mutually connected with weight. The output of each of the nodes 11, 21 to 25, 31 to 36 is determined by a function that receives the output of the time immediately before all nodes and the weight of the connection to the own node from them. That is, the output X _j (t) of the node j at the time t is
It is determined as follows.

【００１３】[0013]

【数１】 (Equation 1)

【００１４】ここで、Ｗ_jkは、ノードk からノードj へ
の結合の重み付けの値であり、Ｄ_j(t) は時刻t におけ
るノードj への外部からの関数である。また、ｆは適当
な非線形関数である。また、この神経回路には、音声認
識および音声合成の入出力インターフェースが接続され
ている。Here, W _jk is a value of the weight of the connection from the node k to the node j, and D _j (t) is a function from the outside to the node j at the time t. F is an appropriate nonlinear function. Also, an input / output interface for speech recognition and speech synthesis is connected to this neural circuit.

【００１５】具体的には、音声を入力するマイク71、入
力した音声をアナログデータからデジタルデータに変換
するＡ／Ｄ変換器72、デジタルデータに変換した音声デ
ータを一時保存しておくバッファ73、音声合成処理によ
り出力された合成音声のデジタルデータを一時的に保存
しておくバッファ74、前記合成音声をデジタルデータか
らアナログデータに変換するＤ／Ａ変換器75、アナログ
データ化された合成音声を出力するスピーカ76、合成し
たい音声を指示するテキスト入力部77、音声合成の際は
テキスト入力部77で指定されたテキストを言語記号ノー
ドで取り扱える時系列データへ変換し、また音声認識の
際は言語記号ノード21〜25から出力される時系列データ
をテキストに変換する時系列−テキスト変換部78、時系
列−テキスト変換部78から出力されたテキストを表示す
るテキスト表示部79が接続されている。More specifically, a microphone 71 for inputting audio, an A / D converter 72 for converting input audio from analog data to digital data, a buffer 73 for temporarily storing audio data converted to digital data, A buffer 74 for temporarily storing digital data of the synthesized voice output by the voice synthesis processing, a D / A converter 75 for converting the synthesized voice from digital data to analog data, A speaker 76 for output, a text input unit 77 for instructing speech to be synthesized, and a text input unit 77 for speech synthesis converts text specified by the text input unit 77 into time-series data that can be handled by a language symbol node, and a language for speech recognition. A time series-text converter 78 that converts time series data output from the symbol nodes 21 to 25 into text, and a time series-text converter 78 Text display unit 79 for displaying the force text is connected.

【００１６】以下、本実施例の具体的な動きを説明す
る。まず、音声認識について説明する。マイク71から入
力された音声は、１つの単語毎にＡ／Ｄ変換器72で音声
のデジタル信号Ｓ(1),Ｓ(2),Ｓ(3),・・・に変換して、バ
ッファ73に一時保存する。これを用いて、以下の手順で
音声認識を行う。Hereinafter, a specific operation of this embodiment will be described. First, speech recognition will be described. The audio input from the microphone 71 is converted into audio digital signals S (1), S (2), S (3),... Temporarily. Using this, speech recognition is performed in the following procedure.

【００１７】(1) 各ノードj の出力値に適当な初期値Ｘ
_j(0) を設定する。 (2) 各時刻t について、以下の(3) 〜(4) を実行する。 (3) 各ノード_jについて、以下の(4) を実行する。 (4) Ｘ_j(t) ＝ｆ（Σ_kＷ_jkＸ_k(t-1））を計算する。但し、ノード_jが音声波形ノード11の場合は、Ｘ_j(t)
＝Ｓ(t) とする。(1) An appropriate initial value X for the output value of each node j
Set _j (0). (2) The following (3) to (4) are executed for each time t. (3) For each node _j , execute the following (4). _{(4) X j (t)} = f (Σ k W jk X k (t-1)) is calculated. However, if the node _j is the speech waveform node 11, X _j (t)
= S (t).

【００１８】(5) 時系列−テキスト変換部78において、
言語記号ノード21〜25の出力時系列Ｘ_a(t) 、Ｘ_i(t)
、Ｘ_u(t) 、Ｘ_e(t) 、Ｘ_o(t) の中で、一定時間以
上に渡り、最大値を示す言語記号ノード21〜25を求め
る。 (6) (5) で一定時間以上に渡り最大値を与える言語記号
ノード21〜25に該当するテキスト（それぞれ「Ａ」、
「Ｉ」、「Ｕ」、「Ｅ」、「Ｏ」）を、テキスト表示部
79で表示する。(5) In the time series-text converter 78,
Output time series X _a (t), X _i (t) of language symbol nodes 21 to 25
, X _u (t), X _e (t), and X _o (t), the language symbol nodes 21 to 25 that indicate the maximum value over a certain period of time are obtained. (6) Texts corresponding to the language symbol nodes 21 to 25 that give the maximum value over a certain time in (5) ("A",
"I", "U", "E", "O") in the text display section
Display with 79.

【００１９】ある単語を上記手順で音声認識した結果、
言語記号ノードに出力される出力時系列Ｘ_a(t) 、Ｘ_i
(t) 、Ｘ_u(t) 、Ｘ_e(t) 、Ｘ_o(t) の例を図３に示
す。ここでは、一定時間に渡り最大値を示しているは、
まずＸ_a(t) であり、その次にＸ_i(t) である。この言
語記号ノード21〜25の出力時系列により、認識結果「Ａ
Ｉ」が、時系列−テキスト変換部78を介してテキスト表
示部79に表示される。As a result of speech recognition of a certain word in the above procedure,
Output time series X _a (t), X _i output to language symbol nodes
FIG. 3 shows examples of (t), X _u (t), X _e (t), and X _o (t). Here, the maximum value is shown over a certain period of time.
First, X _a (t), and then X _i (t). According to the output time series of the language symbol nodes 21 to 25, the recognition result “A
"I" is displayed on the text display unit 79 via the time series-text conversion unit 78.

【００２０】次に、音声合成について説明する。テキス
ト入力部77から指定されたテキスト、例えば「ＡＩ」を
時系列−テキスト変換部78で、各言語記号ノード21〜25
への外部入力となる時系列Ｄ_a(t) 、Ｄ _i(t) 、Ｄ
_u(t) 、Ｄ_e(t) 、Ｄ_o(t) に変換する。「ＡＩ」とい
うテキストを時系列に変換した例を図４に示す。Next, speech synthesis will be described. Text
Text specified by the input unit 77, for example, "AI"
In the time series-text conversion unit 78, each language symbol node 21 to 25
Time series D to be external input to_a(t), D _i(t), D
_u(t), D_e(t), D_o(t). "AI"
FIG. 4 shows an example in which the text is converted into a time series.

【００２１】この時系列データに基づいて、以下の手順
で音声合成を行う。 (1) 各ノードj の出力値に適当な初期値Ｘ_j(0) を設定
する。 (2) 各時刻t について、以下の(3) 〜(4) を実行する。 (3) 各ノード_jについて、以下の(4) を実行する。 (4) Ｘ_j(t) ＝ｆ（Σ_kＷ_jkＸ_k(t-1））を計算する。Based on the time series data, speech synthesis is performed in the following procedure. (1) Set an appropriate initial value X _j (0) for the output value of each node j. (2) The following (3) to (4) are executed for each time t. (3) For each node _j , execute the following (4). _{(4) X j (t)} = f (Σ k W jk X k (t-1)) is calculated.

【００２２】但し、ノード_jが言語記号ノード21〜25の
場合は、Ｘ_j(t) はそのノードに応じて、前述のＤ
_a(t) 〜Ｄ_o(t) を用いる。 (5) 音声波形ノード11の出力時系列Ｘ_w(t) をバッファ
74に一時保存する。 (6) すべての時刻の計算が終了したら、バッファ74の時
系列データをＤ／Ａ変換器75でアナログ音声信号に変換
し、スピーカ76で出力する。However, when the node _j is a language symbol node 21 to 25, X _j (t) is set to the above-mentioned D according to the node.
using _{_{a (t) ~D o (t}} ). (5) Buffer the output time series X _w (t) of the audio waveform node 11
Temporarily save to 74. (6) When the calculation of all times is completed, the time-series data in the buffer 74 is converted into an analog audio signal by the D / A converter 75 and output by the speaker 76.

【００２３】図４の例では、テキスト入力部77から入力
された時系列データに基づいて、「ＡＩ」の合成音声が
スピーカ76を通じて出力される。＜その他の実施例＞前述の実施例では、各ノード間の重
み付けはその方向によって異なることを想定していた
が、方向によって異ならなくてもよく、同一ノード間で
はどちらの方向も同一の重み付けがなされていてもよ
い。In the example of FIG. 4, a synthesized voice of “AI” is output through the speaker 76 based on the time series data input from the text input unit 77. <Other Embodiments> In the above-described embodiment, it is assumed that the weights between the nodes are different depending on the directions. However, the weights may not be different depending on the directions. It may be done.

【００２４】また、入力となるノードからの出力、具体
的には音声認識における音声波形ノードからの出力、ま
た音声合成における言語記号ノードからの出力には重み
付けはなされず、入力された値がそのまま出力されるよ
うになっていてもよい。言語記号ノードについても、５
母音毎に、いわば音素毎に設けているが、１つの言語記
号ノードに集約されていてもよいし、また音素の数に応
じて設けるのではなく、音節の数に応じて、例えば
「か」、「な」などの音素の組み合わせにより生じる数
に応じて設けてもよいし、さらには、単語の数に応じ
て、例えば「あい」、「かき」などの認識する単語の数
に応じて設けてもよい。複数の言語記号ノードを設ける
場合は、前述の実施例同様、一定時間最大値を示したノ
ードに対応する音素あるいは音節あるいは単語が、音声
認識された、あるいは、出力すべき合成音声として指示
されたと判断し、１つの言語記号ノードに集約される場
合には、そのノードに入力される、あるいは、出力され
る信号の帯域により、認識された音声あるいは出力すべ
き合成音声を判断する。Also, the output from the input node, specifically, the output from the speech waveform node in speech recognition, and the output from the language symbol node in speech synthesis are not weighted, and the input value remains unchanged. It may be output. For the language symbol node,
Although it is provided for each vowel, so to speak for each phoneme, it may be collected in one language symbol node, and may not be provided according to the number of phonemes, but may be provided according to the number of syllables. , May be provided according to the number generated by the combination of phonemes such as “na”, or may be provided according to the number of words to be recognized, for example, “ai”, “kaki”, etc. You may. When a plurality of language symbol nodes are provided, similar to the above-described embodiment, it is assumed that the phoneme, syllable, or word corresponding to the node that has shown the maximum value for a certain period of time has been speech-recognized or designated as a synthesized speech to be output. If it is determined that the speech is collected into one language symbol node, the recognized speech or the synthesized speech to be output is determined based on the band of the signal input to or output from the node.

【００２５】また、音声波形ノードへの入出力も、デジ
タル化された音声データを入力あるいは出力されるよう
になっているが、Ａ／Ｄ変換器およびＤ／Ａ変換器に相
当する機能を神経回路に組み込むことにより、アナログ
データのまま入出力することも可能である。前述の実施
例に、さらに図５に示すように、音声の音質（男性の声
や女性の声など）やピッチ（音声の高さ）などを、音声
認識においては認識結果を出力し、音声合成においては
合成音声の特徴の指示を入力する、属性制御ノードを追
加することにより、音声認識では、音素の認識のみなら
ず、音声データから得られる属性情報の認識も可能とな
り、音声合成では、合成する音声の特性を制御すること
が可能となる。The input / output to / from the audio waveform node is also such that digitized audio data is input or output. However, a function corresponding to an A / D converter and a D / A converter is provided. By incorporating it in a circuit, it is possible to input and output analog data as it is. As shown in FIG. 5, the speech quality (such as a male voice or a female voice) and the pitch (the pitch of the voice) are output from the above-described embodiment. In, by adding an attribute control node that inputs the instruction of the characteristics of the synthesized speech, speech recognition can recognize not only phonemes but also attribute information obtained from speech data. It is possible to control the characteristics of the sound to be played.

【００２６】[0026]

【発明の効果】以上、説明した通り、音声波形を直接入
出力する音声波形ノードを設けることにより、音声認識
と音声合成の処理の共有化が可能となる。これにより、
音声認識および音声合成の両方を必要とする装置におい
て、独立した音声認識装置および音声合成装置を搭載す
る必要はなく、共通の神経回路で実現された音声認識お
よび音声合成装置を搭載すればよい。As described above, by providing an audio waveform node for directly inputting / outputting an audio waveform, it becomes possible to share the processing of speech recognition and speech synthesis. This allows
In a device that requires both voice recognition and voice synthesis, there is no need to mount an independent voice recognition device and voice synthesis device, and a voice recognition and voice synthesis device realized by a common neural circuit may be mounted.

[Brief description of the drawings]

【図１】本発明の基本構成を示す図FIG. 1 is a diagram showing a basic configuration of the present invention.

【図２】本発明の実施例の構成を示す図FIG. 2 is a diagram showing a configuration of an embodiment of the present invention.

【図３】本発明の実施例での音声認識における言語記号
ノードからの出力時系列の例を示す図FIG. 3 is a diagram showing an example of an output time series from a language symbol node in speech recognition according to the embodiment of the present invention.

【図４】本発明の実施例での音声合成における言語記号
ノードへの入力時系列の例を示す図FIG. 4 is a diagram showing an example of an input time series to a language symbol node in speech synthesis according to the embodiment of the present invention.

【図５】本発明のその他の実施例の構成を示す図FIG. 5 is a diagram showing a configuration of another embodiment of the present invention.

Claims

[Claims]

1. A speech recognition and speech synthesizer for recognizing speech by inputting a speech signal and outputting a synthesized speech by inputting a character string, a speech waveform node for inputting and outputting a speech waveform, and a character string input / output And a language symbol node for performing speech recognition and speech synthesis by directly inputting a speech waveform to a neural circuit, thereby performing speech recognition and speech synthesis processing in a common neural circuit.