JP2012128440A

JP2012128440A - Voice interactive device

Info

Publication number: JP2012128440A
Application number: JP2012022981A
Authority: JP
Inventors: Hiroshige Asada; 博重浅田
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2012-02-06
Filing date: 2012-02-06
Publication date: 2012-07-05

Abstract

PROBLEM TO BE SOLVED: To perform smooth interaction adapted to speaker's sensitivity in a voice interaction device.SOLUTION: A response control part 7 outputs a response voice corresponding to an utterance content of a speaker that has been recognized by a voice recognition part 4, from a voice synthesis part 2 through a speaker 8. In this case, a response time, between when the voice recognition part 4 detects an end of the utterance of the speaker and when the response voice starts to be supplied, is normally caused to change corresponding to an utterance speed detected by an utterance speed detection part 5.

Description

本発明は、話者の発話内容の認識結果に応じた応答音声を合成音声により提供するようにした音声対話装置に関する。 The present invention relates to a voice interactive apparatus that provides a response voice according to a recognition result of a speaker's utterance content by a synthesized voice.

例えば、カーナビゲーションシステムやハンズフリー電話システムなどの車載システムにおいては、ユーザからの発話音声を、予め記憶された認識対象語彙の標準音声発生パターンと比較することにより、ユーザが発話した音声コマンドの認識を行い、その音声コマンドに応じた制御処理を行うものが普及している。この種の音声認識技術を利用する場合、ユーザからの１回の発話音声を解析するだけでは、その発話内容を一意的に特定できない場合が多く、誤って特定した場合には不要な制御処理が行われてしまう。そこで、このような事態への対策として、車載システム側でユーザによる発話内容を特定するための質問（応答音声）を出力して返答を求めることで対話的に発話内容を理解する音声対話装置が考えられている。 For example, in an in-vehicle system such as a car navigation system or a hands-free telephone system, recognition of a voice command uttered by a user is performed by comparing the uttered voice from the user with a standard voice generation pattern of a recognition target vocabulary stored in advance. And performing a control process according to the voice command has become widespread. When this type of speech recognition technology is used, it is often impossible to uniquely identify the utterance content simply by analyzing a single utterance speech from the user. Will be done. Therefore, as a countermeasure against such a situation, a voice dialogue device that interactively understands the utterance content by outputting a question (response voice) for specifying the utterance content by the user on the in-vehicle system side and obtaining a response. It is considered.

一般的な音声対話装置は、話者（ユーザ）の発話内容に対する応答音声の合成速度や応答時間（応答開始までの時間）が画一的な構成となっている。ところが、話者の発話速度は、その個人毎の特性などに応じて異なってくるという事情がある。このため、話者側で、応答音声の出力が終了するまで不要に待たされると感じたり、質問に対する返答を急がされるという感じを持ったりするなど、話者の感性に即した円滑な対話（話者が苛立つことが少ない対話）が困難になる状態が発生しやすいという問題点があった。このような問題点を解決するために、従来では、例えば特許文献１に見られるように、話者による発話速度を測定し、測定された発話速度に応じて応答音声の出力速度を変化させるようにした音声対話装置が考えられている。
特公平７−２１７５９号公報 A general voice interaction apparatus has a uniform structure of response voice synthesis speed and response time (time until response start) with respect to the utterance content of a speaker (user). However, there is a situation that the speaking rate of a speaker varies depending on the characteristics of each individual. For this reason, the conversation on the side of the speaker feels unnecessarily waited until the output of the response voice is finished, or the user feels that the response to the question is urgently required. There was a problem that a situation in which it was difficult to (dialogue with less frustrating speaker) was likely to occur. In order to solve such problems, conventionally, as seen in Patent Document 1, for example, the speaking rate by a speaker is measured, and the output rate of the response voice is changed according to the measured speaking rate. A spoken dialogue device is considered.
Japanese Patent Publication No. 7-21759

上記従来の音声対話装置では、応答音声の出力速度を話者の発話速度に応じて単純に変化させているに過ぎないため、その出力が完了するまでの期間は、応答音声の速度が一定に保持されることになる。このため、応答音声の出力途中の段階で、話者側で応答音声が早期に終わって欲しいと感じるイベント（例えば、電話の着信）が発生した場合でも、その応答音声が終了するまで不要に待たされることがあり、これにより、話者側の苛立ちが増大するなど、話者の感性に即した円滑な対話が困難になる恐れがあった。また、音声対話装置では、会話における所謂「間」が、話者の感性に合った円滑な対話を行う上で非常に重要な要素となるものであるが、従来では、この「間」が話者の発話速度と無関係に一定であったため、話者の感性に即した円滑な対話が困難になるという状況下にあった。 In the above conventional voice interaction device, the output speed of the response voice is simply changed according to the speaking speed of the speaker, so the speed of the response voice is constant during the period until the output is completed. Will be retained. For this reason, even if an event (for example, an incoming call) that the speaker wants the response voice to end early in the middle of the output of the response voice occurs, it is unnecessary to wait until the response voice ends. As a result, there is a risk that smooth dialogue based on the sensitivity of the speaker may become difficult, such as increased irritation on the speaker side. In a speech dialogue apparatus, the so-called “between” in conversation is a very important element in conducting a smooth conversation that matches the sensitivity of the speaker. Because it was constant regardless of the speaking speed of the speaker, it was in a situation where smooth dialogue according to the sensitivity of the speaker became difficult.

本発明は上記事情に鑑みてなされたものであり、その目的は、話者の感性に即した円滑な対話を行うことができる音声対話装置を提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a voice dialogue apparatus capable of performing a smooth dialogue in accordance with a speaker's sensitivity.

請求項１記載の手段によれば、話者の発話内容が音声認識手段により認識された場合には、応答制御手段が、その認識結果に応じた応答音声を合成音声により提供するようになる。この場合、話者による発話速度を検出する話速検出手段が設けられており、応答制御手段は、音声認識手段が話者の発話の終了を検出した時点から応答音声を提供するまでの応答時間、つまり、会話における所謂「間」を、前記話速検出手段により検出された発話速度に応じて変化させるようになる。このように、話者の感性に合った円滑な対話を行う上で非常に重要な要素となる「間」が、その話者の発話速度に応じた長さに制御される結果、話者の感性に即した円滑な対話を実現できるようになる。 According to the first aspect, when the speech content of the speaker is recognized by the voice recognition means, the response control means provides the response voice according to the recognition result as synthesized voice. In this case, speech speed detecting means for detecting the speaking speed by the speaker is provided, and the response control means is a response time from when the speech recognition means detects the end of the speaker's utterance until the response speech is provided. That is, the so-called “between” in the conversation is changed according to the utterance speed detected by the speech speed detecting means. In this way, the “interval”, which is a very important factor in conducting a smooth conversation that matches the speaker's sensibility, is controlled to a length corresponding to the speaker's speaking speed. It will be possible to realize a smooth dialogue based on sensitivity.

請求項２記載の手段によれば、応答制御手段は、環境条件検出センサから取り込んだ検出信号が予め決められた環境条件を満たすものであった場合に、その環境条件が解除されるまでの期間だけ応答音声の提供を見合わせた待機状態を呈するようになる。このため、例えば、話者が応答音声に反応しない方が良いと考えられる環境条件（音声対話装置が車載システムとして搭載されていた場合、車両の右左折などに伴うステアリング操作を実施中の状態や、急ブレーキによる減速操作を実施中の状態などが考えられる）の下では、応答音声の提供が見合わされることになって、対話の信頼性が向上するようになる。 According to the means described in claim 2, the response control means is a period until the environmental condition is canceled when the detection signal taken in from the environmental condition detection sensor satisfies a predetermined environmental condition. Only the standby state in which the provision of the response voice is forgotten will be exhibited. For this reason, for example, an environmental condition where it is preferable that the speaker does not respond to the response voice (when the voice interaction device is mounted as an in-vehicle system, In other words, it is possible to provide a response voice, and the reliability of the dialogue is improved.

請求項３記載の手段によれば、応答制御手段は、音声認識手段による認識結果に、制御対象機器に係る動作開始コマンド或いは動作停止コマンドが含まれていた場合に、当該コマンドの実行タイミングを、環境条件検出センサからの検出信号により示される環境条件に応じた最適なタイミングとなるように調整するようになる。このため、話者からの動作開始コマンド或いは動作停止コマンドに基づいた制御対象機器の制御動作を実行しようとする際に、その実行を直ちに行わない方が良いと考えられる環境条件（制御対象機器が車両用のものであった場合、車両の右左折などに伴うステアリング操作を実施中の期間や、車両の後退走行期間などが考えられる）の下では、制御対象機器の制御動作の開始が見合わされることになるから、制御対象機器の制御に係る信頼性が向上するようになる。 According to the means of claim 3, the response control means, when the recognition result by the voice recognition means contains an operation start command or an operation stop command related to the control target device, The timing is adjusted so as to be optimal according to the environmental condition indicated by the detection signal from the environmental condition detection sensor. For this reason, when the control operation of the control target device based on the operation start command or the operation stop command from the speaker is to be executed, it is considered that it is better not to execute the control immediately. If it is for a vehicle, the start of the control operation of the device to be controlled will be postponed under a period during which a steering operation associated with turning right or left of the vehicle is being performed or a period during which the vehicle is traveling backwards. Therefore, the reliability related to the control of the control target device is improved.

本発明の一実施例の基本構成を示す機能ブロック図Functional block diagram showing the basic configuration of one embodiment of the present invention 応答制御部による制御内容の一例を説明するためのシーケンス図Sequence diagram for explaining an example of control contents by the response control unit

以下、本発明の一実施例について図面を参照しながら説明する。
図１には、音声対話装置１及びこれに関連した部分の基本的な構成例が機能ブロックの組み合わせにより概略的に示されている。この図１において、音声対話装置１は、車両に搭載されて、話者（車両運転者）との間での音声応答処理及びその話者による音声コマンドに応じた制御処理を行うように構成されたもので、具体的に図示しないが、カーナビゲーション装置やエンジン制御ＥＣＵ、移動体通信用ＥＣＵなどとの間で例えば車内ＬＡＮを介してデータの授受を行い得るように構成されている。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 schematically shows a basic configuration example of a voice interactive apparatus 1 and parts related thereto by a combination of functional blocks. In FIG. 1, a voice interactive apparatus 1 is mounted on a vehicle and configured to perform a voice response process with a speaker (vehicle driver) and a control process according to a voice command by the speaker. Although not specifically shown, data is exchanged with a car navigation device, an engine control ECU, a mobile communication ECU, and the like via, for example, an in-vehicle LAN.

音声対話装置１を構成する音声合成部２、話速制御部３（話速制御手段に相当）、音声認識部４（音声認識手段に相当）、話速検出部５（話速検出手段に相当）、対話エージェント部６、応答制御部７（応答制御手段に相当）は、実際には、音声対話装置用ＥＣＵのプログラムにより実現されたものであり、それぞれには以下のような機能が設定されている。 The voice synthesizing unit 2, the speech speed control unit 3 (corresponding to the speech speed control unit), the voice recognition unit 4 (corresponding to the voice recognition unit), and the speech speed detection unit 5 (corresponding to the speech speed detection unit) constituting the voice interactive device 1. ), The dialogue agent unit 6 and the response control unit 7 (corresponding to response control means) are actually realized by the program of the voice dialogue unit ECU, and the following functions are set for each. ing.

音声合成部２は、応答制御部７からの指示に基づいた音声信号を、周知の音声合成処理によってリアルタイムに生成し、その音声信号を応答音声として車載スピーカ８から出力する。尚、このスピーカ８は専用のものでも良いが、カーオーディオ装置のスピーカを兼用できるものである。 The voice synthesis unit 2 generates a voice signal based on an instruction from the response control unit 7 in real time by a known voice synthesis process, and outputs the voice signal from the in-vehicle speaker 8 as a response voice. The speaker 8 may be a dedicated speaker, but can also be used as a speaker of a car audio device.

話速制御部３は、音声合成部２から出力される応答音声の発話速度を、応答制御部７からの指示速度に基づいた速度となるようにリアルタイム且つ連続的に変更する機能を備えたものである。この場合、話速制御部３は、その発話速度の変更アルゴリズムとして、例えば、応答音声の時間軸を圧縮伸長するという周知のＴＤＨＳ（Time Domain Harmonic Scaling）法を採用しており、応答音声のピッチが一定のまま発話速度が変更されることになる。 The speech speed control unit 3 has a function of changing the speech rate of the response voice output from the speech synthesis unit 2 in real time and continuously so as to become a speed based on the instruction speed from the response control unit 7. It is. In this case, the speech speed control unit 3 employs, for example, the well-known TDHS (Time Domain Harmonic Scaling) method of compressing and expanding the time axis of the response speech as the speech speed change algorithm, and the pitch of the response speech The utterance speed is changed with the constant.

音声認識部４は、車載マイクロホン９から入力された話者の音声を取り込み、その音声（つまり、話者の発話内容）を、例えばキーワードスポッティングを使用した音声認識処理方式より解析することにより、対話に必要な語彙を認識する。
話速検出部５は、音声認識部４が認識した語彙とその発声時間に基づいて、話者による発話速度を検出（予測）する。
対話エージェント部６は、対話のマネージメントのために設けられたもので、音声認識部４が認識した語彙中に含まれるコマンドを選別して応答制御部７に与える構成となっている。 The voice recognition unit 4 takes in the voice of the speaker input from the in-vehicle microphone 9 and analyzes the voice (that is, the utterance content of the speaker) by, for example, a voice recognition processing method using keyword spotting. Recognize the necessary vocabulary.
The speech speed detection unit 5 detects (predicts) the speech speed of the speaker based on the vocabulary recognized by the speech recognition unit 4 and its speech time.
The dialogue agent unit 6 is provided for dialogue management, and has a configuration in which commands included in the vocabulary recognized by the speech recognition unit 4 are selected and given to the response control unit 7.

応答制御部７は、上述したような音声合成部２及び話速制御部３の制御機能の他に、車両に搭載された移動体通信機器（携帯電話、自動車電話、データコミュニケーションモジュール（ＤＣＭ）など）、カーナビゲーション装置、カーオーディオ装置などの制御対象機器群１０の動作制御を実行する機能を備えたもので、車載センサ群１１（環境条件検出センサに相当）からの各種の検出信号が入力されるようになっている。この場合、上記車載センサ群１１の構成要素としては、車両の走行状態や操作状態を検出するための車速センサ、加速度センサ、操舵角センサの他に、移動体通信機器の動作状態などを検出するためのアダプタなどが含まれるものであり、また、カーナビゲーション装置における経路案内情報、渋滞情報のような運転者支援情報などを出力するためのインタフェース部や、所定の車載機器の操作状況をモニタする手段なども含まれ、必要に応じて、車両運転者の視線方向を検出するための視線認識装置や車室内の騒音や会話をモニタするためのマイクロホンも含まれるものである。尚、車載センサ群１１からの検出信号は、エンジン制御ＥＣＵ、移動体通信用ＥＣＵなどから車内ＬＡＮを通じて入力する構成であっても良い。 In addition to the control functions of the speech synthesis unit 2 and the speech speed control unit 3 as described above, the response control unit 7 is a mobile communication device (a mobile phone, a car phone, a data communication module (DCM), etc.) mounted on the vehicle. ), Which has a function of executing the operation control of the control target device group 10 such as a car navigation device or a car audio device, and receives various detection signals from the in-vehicle sensor group 11 (corresponding to an environmental condition detection sensor). It has become so. In this case, as components of the in-vehicle sensor group 11, in addition to a vehicle speed sensor, an acceleration sensor, and a steering angle sensor for detecting a running state and an operation state of the vehicle, an operation state of the mobile communication device is detected. For monitoring the operation status of a predetermined in-vehicle device and an interface unit for outputting route guidance information, driver assistance information such as traffic jam information in a car navigation device, etc. Means are also included, and a line-of-sight recognition device for detecting the line-of-sight direction of the vehicle driver and a microphone for monitoring noise and conversation in the passenger compartment are included as necessary. The detection signal from the in-vehicle sensor group 11 may be input from the engine control ECU, the mobile communication ECU, etc. through the in-vehicle LAN.

さて、以下においては、応答制御部７による制御内容のうち、本発明の要旨に関係した部分について関連した作用と共に説明する。
即ち、応答制御部７は、通常状態ではノンアクティブ状態を呈しているが、例えば話者（車両運転者）が操作可能な位置に設けられた対話開始スイッチがオンされたとき、或いは、音声応答装置１の動作開始を要求する特定音声コマンドが音声認識部４及び対話エージェント部６を通じて与えられたとき（話者がマイクロホン９を通じて特定音声コマンドを入力したとき）に、アクティブ状態に切り換えられて話者との対話がスタートする構成となっている。 In the following, a part related to the gist of the present invention in the control contents by the response control unit 7 will be described together with related actions.
That is, the response control unit 7 is in an inactive state in a normal state, but for example, when a dialogue start switch provided at a position where a speaker (vehicle driver) can operate is turned on, or a voice response When a specific voice command requesting the start of operation of the device 1 is given through the voice recognition unit 4 and the dialogue agent unit 6 (when the speaker inputs the specific voice command through the microphone 9), the voice is switched to the active state and the talk is performed. It is configured to start a dialogue with the person.

話者との対話がスタートしたときには、マイクロホン９を通じて入力された話者の発話内容が音声認識部４にて解析されて対話に必要な語彙が認識される共に、その語彙中に含まれるコマンドが対話エージェント部６により選別されて応答制御部７に与えられる。 When the conversation with the speaker starts, the speech recognition unit 4 analyzes the utterance content of the speaker input through the microphone 9 to recognize the vocabulary necessary for the conversation, and the commands included in the vocabulary are included. It is selected by the dialogue agent unit 6 and given to the response control unit 7.

上記のようなコマンドを受けた応答制御部７は、以下（１）、（２）のような制御を連続的に実行する。
（１）話速検出部５により検出される話者の発話速度に応じた長さのポーズ時間を決定すると共に、そのポーズ時間が経過したときに、音声合成部２に対して与えられたコマンドに応じた応答を行うための音声信号の作成を指示する制御。 The response control unit 7 that has received the above command continuously executes the following controls (1) and (2).
(1) A command given to the speech synthesizer 2 when a pause time having a length corresponding to the speech speed of the speaker detected by the speech speed detector 5 is determined and the pause time has elapsed. Control to instruct the creation of an audio signal for performing a response according to the.

（２）話速制御部３に対して、上記音声合成部２での音声信号の生成速度、つまり、スピーカ８を通じて出力される応答音声の発話速度を、話速検出部５により検出される話者の発話速度に応じた指示速度となるように決定し、その指示速度を話速制御部３に与える制御。
尚、本実施例では、上記ポーズ時間は、話者の発話速度が速い状態時ほど比例的に短くなるように制御され、また、上記指示速度は、後述するようなイベントが発生していない通常状態では、話者の発話速度が速い状態時ほど比例的に高速化するように制御される。 (2) For the speech speed control unit 3, the speech speed detected by the speech speed detection unit 5 is the voice signal generation speed in the voice synthesis unit 2, that is, the speech speed of the response voice output through the speaker 8. Control that determines the instruction speed according to the speaking speed of the person and gives the instruction speed to the speaking speed control unit 3.
In this embodiment, the pause time is controlled to be proportionally shorter as the speaking speed of the speaker is higher, and the indicated speed is a normal speed when no event occurs as will be described later. In the state, the speed is controlled so as to increase proportionally as the speaking rate of the speaker increases.

応答制御部７には、車載センサ群１１からの検出信号をモニタすることにより、予め決められた所定のイベント、例えば、話者側で応答音声が早期に終わって欲しいと感じるイベント（電話の着信など）の発生の有無を監視する機能が設定されている。応答制御部７は、上記のような応答音声の出力途中の段階で、当該イベントが発生したときには、以下（３）のような制御を実行する。 The response control unit 7 monitors the detection signal from the in-vehicle sensor group 11 to detect a predetermined event, for example, an event that the speaker wants the response voice to end early (incoming call) Etc.) is set. The response control unit 7 executes the following control (3) when the event occurs in the middle of outputting the response voice as described above.

（３）音声合成部２での音声信号の生成速度、つまり、応答音声の発話速度を、それまでの速度より連続的に高速化する制御（応答音声のピッチは変化しない）。
また、応答制御部７は、車載センサ群１１から取り込んだ検出信号が予め決められた環境条件（例えば、車両の右左折などに伴うステアリング操作を実施中の状態、車両の急制動を実施中の状態）を満たすものであった場合には、以下（４）、（５）のような制御を実行する。 (3) Control for continuously increasing the voice signal generation speed in the voice synthesizer 2, that is, the speech speed of the response voice, compared to the previous speed (the pitch of the response voice does not change).
In addition, the response control unit 7 detects that the detection signal acquired from the in-vehicle sensor group 11 is in a state where a steering operation associated with a predetermined environmental condition (for example, turning right or left of the vehicle) is being performed, If the condition (state) is satisfied, the following controls (4) and (5) are executed.

（４）上記のような環境条件が解除されるまでの期間だけ、音声合成部２による音声合成処理を禁止した状態、要するに、スピーカ８からの応答音声の提供を見合わせた待機状態を呈する制御。
（５）応答音声の出力途中の段階で上記環境条件が満たされた場合には、例えば、応答音声の出力を一旦停止し、その後に環境条件が解除されたときに、当該応答音声を最初から出力し直すという制御。 (4) Control that presents a state in which speech synthesis processing by the speech synthesizer 2 is prohibited only during the period until the above environmental conditions are canceled, that is, a standby state in which the provision of response speech from the speaker 8 is suspended.
(5) When the environmental condition is satisfied in the middle of outputting the response voice, for example, when the output of the response voice is temporarily stopped and then the environmental condition is canceled, the response voice is started from the beginning. Control to output again.

さらに、応答制御部７は、音声認識部４による認識結果に、制御対象機器群１０に係る動作開始コマンド或いは動作停止コマンドが含まれていた場合に、以下（６）のような制御を行う。
（６）動作開始コマンド或いは動作停止コマンドの実行タイミングを、車載センサ群１１から取り込んだ検出信号により示される環境条件（例えば、車両の右左折などに伴うステアリング操作を実施中の状態、車両の急制動を実施中の状態）に応じた最適なタイミングとなるように調整する制御（具体的には、例えば、上記動作開始コマンドが、移動体通信機器によるハンズフリー電話に対する発信コマンドであった場合には、上記環境条件が満たされている状態では当該発信コマンドの実行を見合わせ、その後に当該環境条件が解除されたときに始めて発信コマンドを実行することになる）。 Further, the response control unit 7 performs the following control (6) when the recognition result by the voice recognition unit 4 includes an operation start command or an operation stop command related to the control target device group 10.
(6) The execution timing of the operation start command or the operation stop command is determined based on the environmental condition indicated by the detection signal acquired from the in-vehicle sensor group 11 (for example, a state in which a steering operation accompanying a right or left turn of the vehicle is being performed, Control that adjusts to the optimal timing according to the state of braking (specifically, for example, when the operation start command is a call command for a hands-free phone by a mobile communication device) In the state where the environmental condition is satisfied, the execution of the transmission command is postponed, and then the transmission command is executed only when the environmental condition is canceled).

図２のシーケンス図には、上記（１）、（２）、（４）〜（６）に係る制御内容についての具体例が示されており、以下これについて説明する。即ち、この図２の例は、話者である車両運転者が「Ａ」さん宛てにハンズフリー電話をかけるというシチュエーションに対応したものであり、車両の運転状況が、直進→右折（或いは左折）→直進→急制動→直進再開というように変化した場合における、車載センサ群１１、話者、音声認識部４、音声合成部２、話速検出部５、話速制御部３、応答制御部７、対話エージェント部６、制御対象機器群１０の動作内容の推移が概略的に示されている。 In the sequence diagram of FIG. 2, specific examples of the control contents according to the above (1), (2), and (4) to (6) are shown, which will be described below. That is, the example of FIG. 2 corresponds to a situation in which a vehicle driver who is a speaker makes a hands-free phone call to “A”, and the driving state of the vehicle is straight ahead → right turn (or left turn). In-vehicle sensor group 11, speaker, speech recognition unit 4, speech synthesis unit 2, speech speed detection unit 5, speech speed control unit 3, response control unit 7 The transition of the operation contents of the dialogue agent unit 6 and the control target device group 10 is schematically shown.

Ｓ１：車両の直進状態で、話者が「電話したい」と発話する。
Ｓ２：音声認識部４が「電話したい」という発話内容を認識し、話速検出部５が話者による発話速度を検出し、対話エージェント部６が音声認識部４による音声認識結果に応じたコマンド（電話発信を要求するコマンド）を選別して応答制御部７へ出力する。 S1: The speaker speaks “I want to call” while the vehicle is running straight.
S2: The speech recognition unit 4 recognizes the utterance content “I want to call”, the speech speed detection unit 5 detects the speech rate by the speaker, and the dialogue agent unit 6 uses the command according to the speech recognition result by the speech recognition unit 4 (Command for requesting telephone call) is selected and output to the response control unit 7.

Ｓ３：応答制御部７が、ポーズ時間・指示速度決定及び音声出力処理を行う。この処理時には、話者の発話速度が速い状態時ほど比例的に短くなるポーズ時間と、話者の発話速度が速い状態時ほど比例的に高速化する指示速度とを決定し、そのポーズ時間経過後に上記指示速度を話速制御部３に与えると共に、音声合成部２に対し、所定の音声信号（例えば、話者の発話を受け付けたことを示すと共に、電話の発信先に質問するための「はい、どちらですか」の音声信号の生成を指示する制御が行われる。 S3: The response control unit 7 performs pause time / instruction speed determination and voice output processing. During this process, the pause time that is proportionally shortened when the speaker's speaking speed is high and the instruction speed that is proportionally increased when the speaker's speaking speed is high are determined, and the pause time elapses. Later, the instruction speed is given to the speech speed control unit 3 and a predetermined voice signal (for example, “Speaker's utterance is accepted” is indicated to the speech synthesizer 2, and “ Control is performed to instruct the generation of the voice signal “Yes, which is?”.

Ｓ４：音声合成部２が、指示された音声信号（「はい、どちらですか」）を生成し、その音声信号を応答音声としてスピーカ８から出力すると共に、話速制御部３が、応答音声の発話速度を応答制御部７からの指示速度となるように制御する。
Ｓ５：話者が「Ａさんの携帯」と発話する（車両は直進状態のまま）。
Ｓ６：音声認識部４が「Ａさんの携帯」という発話内容を認識し、話速検出部５が話者による発話速度を検出し、対話エージェント部６が音声認識部４による音声認識結果に応じたコマンド（電話発信左記を指示するコマンド）を選別して応答制御部７へ出力する。 S4: The voice synthesizer 2 generates an instructed voice signal (“Yes, which is?”) And outputs the voice signal from the speaker 8 as a response voice. The utterance speed is controlled to be the instruction speed from the response control unit 7.
S5: The speaker speaks “Mr. A's mobile phone” (the vehicle remains straight).
S6: The speech recognition unit 4 recognizes the utterance content “Mr. A's mobile phone”, the speech speed detection unit 5 detects the speech rate by the speaker, and the dialogue agent unit 6 responds to the speech recognition result by the speech recognition unit 4 Are selected and output to the response control unit 7.

Ｓ７：応答制御部７が、ポーズ時間・指示速度決定及び音声出力処理を行う。この処理時には、ポーズ時間及び指示速度を決定し、そのポーズ時間経過後に上記指示速度を話速制御部３に与えると共に、音声合成部２に対し、所定の音声信号（例えば、電話の発信先に報知するための「Ａさんの携帯に電話します」の音声信号の生成を指示する制御が行われる。 S7: The response control unit 7 performs pause time / instruction speed determination and voice output processing. During this processing, the pause time and the instruction speed are determined, and after the pause time has elapsed, the instruction speed is given to the speech speed control unit 3 and a predetermined voice signal (for example, a call destination) is given to the voice synthesis unit 2. Control is performed to instruct generation of a voice signal of “calling Mr. A's mobile phone” for notification.

Ｓ８：話者が車両を右折（或いは左折）させるためのステアリング操作を行う。
Ｓ９：車載センサ群１１（特には操舵角センサ）が右折（或いは左折）のためのステアリング操作を検出する。尚、車載センサ群１１に車両運転者の視線方向を検出するための視線認識装置が含まれていた場合には、その視線認識装置による検出出力を車両の右折或いは左折操作の判定に利用しても良い。 S8: The speaker performs a steering operation to turn the vehicle to the right (or left).
S9: The in-vehicle sensor group 11 (particularly the steering angle sensor) detects a steering operation for a right turn (or a left turn). In addition, when the gaze recognition device for detecting the gaze direction of the vehicle driver is included in the in-vehicle sensor group 11, the detection output by the gaze recognition device is used for the determination of the right turn or left turn operation of the vehicle. Also good.

Ｓ１０：応答制御部７が、音声合成部２による音声合成（音声信号の生成）をストップさせる。
Ｓ１１：話者が車両を直進状態に戻すためのステアリング操作を行う。
Ｓ１２：車載センサ群１１（特には操舵角センサ）が車両を直進状態へ戻すためのステアリング操作を検出する。
Ｓ１３：応答制御部７が、音声合成部２による音声合成（音声信号の生成）をスタートさせる。尚、車両の右折（或いは左折）が、前記ポーズ時間が経過する前に行われたときには、音声合成が最初から行われることは当然であるが、車両の右折（或いは左折）がポーズ時間経過したタイミングであって応答音声の出力中に行われた場合にも、音声合成を最初からスタートさせる構成となっている。 S10: The response control unit 7 stops speech synthesis (sound signal generation) by the speech synthesis unit 2.
S11: The speaker performs a steering operation for returning the vehicle to the straight traveling state.
S12: The vehicle-mounted sensor group 11 (particularly the steering angle sensor) detects a steering operation for returning the vehicle to a straight traveling state.
S13: The response control unit 7 starts speech synthesis (sound signal generation) by the speech synthesis unit 2. When the vehicle turns right (or left) before the pause time elapses, it is natural that speech synthesis is performed from the beginning, but the vehicle right turn (or left turn) has passed the pause time. Even when it is performed at the timing and while the response voice is being output, the voice synthesis is started from the beginning.

Ｓ１４：音声合成部２が、指示された音声信号（「Ａさんの携帯に電話します」）を生成し、その音声信号を応答音声としてスピーカ８から出力すると共に、話速制御部３が、応答音声の発話速度を応答制御部７からの指示速度となるように制御する。
Ｓ１５：話者が車両に急ブレーキをかける操作を所定期間だけ行う。
Ｓ１６：車載センサ群１１が急ブレーキ操作を検出する。尚、この検出は、車載センサ群１１の車速センサ或いは加速度センサの出力に基づいて行うことができる。 S14: The voice synthesizer 2 generates an instructed voice signal ("calls Mr. A's mobile phone"), outputs the voice signal from the speaker 8 as a response voice, and the speech speed controller 3 Control is made so that the utterance speed of the response voice becomes the instruction speed from the response control section 7.
S15: An operation in which the speaker suddenly brakes the vehicle is performed for a predetermined period.
S16: The vehicle-mounted sensor group 11 detects a sudden braking operation. This detection can be performed based on the output of the vehicle speed sensor or acceleration sensor of the in-vehicle sensor group 11.

Ｓ１７：車載センサ群１１（特には、車速センサ）が、急ブレーキの解除に伴う直進状態の再開を検出する。
Ｓ１８：応答制御部７が、制御対象機器群１０（特には、移動体通信機器）に対して、Ａさんへの電話発信コマンドを送出する。 S17: The in-vehicle sensor group 11 (particularly, the vehicle speed sensor) detects the restart of the straight traveling state accompanying the release of the sudden brake.
S18: The response control unit 7 sends a telephone call command to Mr. A to the control target device group 10 (particularly, mobile communication device).

Ｓ１９：制御対象機器群１０（特には、移動体通信機器）が、Ａさんへの電話発信を行う。尚、この電話発信に必要な電話番号は、例えば音声対話装置１内或いは移動体通信機器側に予め構築された電話番号データベースから取得する構成になっている。また、電話発信時には、その発信音が例えば移動体通信機器から報知されるものであるが、例えば、音声合成部２にて電話発信音を生成し、その発信音をスピーカ８から出力する構成としても良い。 S19: The control target device group 10 (particularly, mobile communication device) makes a call to Mr. A. Note that the telephone number necessary for making a telephone call is obtained from a telephone number database built in advance in the voice interactive apparatus 1 or on the mobile communication device side, for example. Further, when a call is made, the dial tone is notified from, for example, a mobile communication device. For example, the voice synthesizer 2 generates a dial tone and outputs the dial tone from the speaker 8. Also good.

要するに、上記した実施例の構成によれば、話者の発話内容が音声認識部４により認識された場合には、応答制御部７が、その認識結果に応じた応答音声を、音声合成部２による合成音声により提供するようになる。この場合、話者による発話速度を検出する話速検出部５及び上記応答音声の発話速度を調整するための話速制御部３が設けられており、通常状態では、当該応答音声の発話速度が、そのピッチを変化させることなく、話者の発話速度が速い状態時ほど比例的に高速化された速度となるように制御される。これにより、話者の感性に即した円滑な対話が可能になる。 In short, according to the configuration of the above-described embodiment, when the utterance content of the speaker is recognized by the voice recognition unit 4, the response control unit 7 converts the response voice according to the recognition result into the voice synthesis unit 2. Will be provided by synthesized speech. In this case, a speech speed detector 5 for detecting the speech speed by the speaker and a speech speed controller 3 for adjusting the speech speed of the response voice are provided. In a normal state, the speech speed of the response voice is Without changing the pitch, the speed is controlled to be proportionally increased as the speaking speed of the speaker is higher. As a result, a smooth dialogue according to the sensitivity of the speaker is possible.

また、応答音声は、話者の発話内容が音声認識部４により認識されたときに直ちに出力されるのではなく、話者の発話速度が速い状態時ほど比例的に短くなるように制御されるポーズ時間が経過したときに始めて出力される構成、換言すれば、音声認識部４が話者の発話内容を認識した時点から応答音声を提供するまでの応答時間、つまり、会話における所謂「間」を、話者の実際の発話速度に応じて変化させる構成となっている。このように、話者の感性に合った円滑な対話を行う上で非常に重要な要素となる「間」が、その話者の発話速度に応じた長さに制御される結果、話者の感性に即した円滑な対話を実現できるようになる。 In addition, the response voice is not output immediately when the speech content of the speaker is recognized by the speech recognition unit 4, but is controlled so as to be proportionally shorter as the speech speed of the speaker is higher. A configuration that is output for the first time when the pause time has elapsed, in other words, the response time from when the speech recognition unit 4 recognizes the utterance content of the speaker until the response speech is provided, that is, a so-called “interval” in conversation. Is changed according to the actual speaking speed of the speaker. In this way, the “interval”, which is a very important factor in conducting a smooth conversation that matches the speaker's sensibility, is controlled to a length corresponding to the speaker's speaking speed. It will be possible to realize a smooth dialogue based on sensitivity.

一方、応答音声の出力期間中に、話者側で応答音声が早期に終わって欲しいと感じるイベント（電話の着信など）が発生したときには、応答音声の発話速度が、そのピッチを変化させることなく、それまでの速度より連続的に高速化するように制御される。このため、話者側で応答音声が早期に終わって欲しいと感ずるような範疇のイベントが発生した場合に、その応答音声が終了するまで不要に待たされる事態を防止できるようになり、以て話者側の苛立ちが増大する恐れがなくなるなど、話者の感性に即した円滑な対話が可能になるものである。また、この場合には、イベントの発生に応じて応答音声の発話速度が変化することになるから、話者側では、イベントが発生したことを応答音声の発話速度の変化に基づいて間接的に認知可能になるという利点もある。 On the other hand, when an event (such as an incoming call) that the speaker wants the response voice to finish early during the response voice output period occurs, the speaking speed of the response voice does not change the pitch. , It is controlled to continuously speed up from the previous speed. For this reason, when an event in a category that causes the speaker to feel that the response voice is desired to end early, it is possible to prevent a situation where the response voice is unnecessarily waited until the response voice ends. This makes it possible to have a smooth dialogue based on the sensitivity of the speaker, such as eliminating the risk of increased irritation on the part of the speaker. In this case, since the speaking rate of the response voice changes according to the occurrence of the event, the speaker side indirectly determines that the event has occurred based on the change of the speaking rate of the response voice. There is also an advantage that it can be recognized.

尚、例えば、話者側で、応答音声の内容を確実に認識したいと感ずるような範疇のイベントが発生した場合には、応答音声を、ピッチを変化させることなく連続的に低速化する制御を行う構成とすれば良く、このような構成とした場合には、その応答音声内容を話者側で確実に認識できるようになり、以て話者の要求に応じた円滑な対話が可能になるものである。 For example, when an event in a category that causes the speaker to feel that the content of the response voice is surely recognized occurs, control is performed to continuously reduce the speed of the response voice without changing the pitch. In such a configuration, it becomes possible for the speaker side to reliably recognize the response voice content, thereby enabling a smooth conversation according to the speaker's request. Is.

応答制御部７は、車載センサ群１１から取り込んだ検出信号が予め決められた環境条件を満たすものであった場合に、その環境条件が解除されるまでの期間だけ応答音声の提供を見合わせた待機状態を呈するようになる。このため、例えば、話者が応答音声に反応しない方が良いと考えられる環境条件（実施例中で説明したような車両の右左折などに伴うステアリング操作を実施中の状態、或いは、車両の急制動を実施中の状態など）の下では、応答音声の提供が見合わされることになって、対話の信頼性が向上するようになる。 The response control unit 7 waits for provision of response sound only during a period until the environmental condition is canceled when the detection signal acquired from the in-vehicle sensor group 11 satisfies a predetermined environmental condition. Presents a state. For this reason, for example, an environmental condition where it is preferable that the speaker does not respond to the response voice (a state in which a steering operation associated with a vehicle turning right or left as described in the embodiment is being performed, or a vehicle suddenly Under the condition of braking etc.), the provision of response voices will be delayed and the reliability of the dialogue will be improved.

また、応答制御部７は、音声認識部４による認識結果に、制御対象機器群１０の動作に係るコマンドが含まれていた場合に、当該コマンドの実行タイミングを、車載センサ群１１からの検出信号により示される環境条件に応じた最適なタイミングとなるように調整するようになる。このため、話者からのコマンドに基づいた制御対象機器群１０の制御動作を実行しようとする際に、その実行を直ちに行わない方が良いと考えられる環境条件（実施例中で説明したような車両の急制動を実施中の期間や、車両の後退走行期間などが考えられる）の下では、制御対象機器群１０の制御動作の開始が見合わされることになるから、その制御対象機器群１０の制御に係る信頼性が向上するようになる。 When the recognition result by the voice recognition unit 4 includes a command related to the operation of the control target device group 10, the response control unit 7 detects the execution timing of the command from the in-vehicle sensor group 11. It adjusts so that it may become the optimal timing according to the environmental condition shown by. For this reason, when trying to execute the control operation of the control target device group 10 based on the command from the speaker, it is considered that it is better not to immediately execute the control operation (as described in the embodiment). Under the period during which the vehicle is suddenly braked or when the vehicle is traveling backward, the start of the control operation of the control target device group 10 is postponed. The reliability related to the control is improved.

（その他の実施の形態）
本発明は上記した実施例に限定されるものではなく、例えば以下のような変形または拡張が可能である。
音声対話装置１を車両に搭載した場合の実施例について説明したが、所謂デジタル家電の動作を制御するための音声対話装置などに適用しても良い。
話速制御部３は、応答音声の発話速度をＴＤＨＳ法により変更する構成としたが、他の方式の音声時間軸圧縮伸長アルゴリズムを採用しても良く、また、音声信号の生成スピードを変えることで応答音声の発話速度を変える手法を採用しても良い。 (Other embodiments)
The present invention is not limited to the above-described embodiments, and for example, the following modifications or expansions are possible.
Although the embodiment in which the voice interactive device 1 is mounted on a vehicle has been described, the present invention may be applied to a voice interactive device for controlling the operation of a so-called digital home appliance.
The speech speed control unit 3 is configured to change the speech speed of the response voice by the TDHS method. However, a voice time axis compression / decompression algorithm of another method may be adopted, and the voice signal generation speed may be changed. A method of changing the utterance speed of the response voice may be adopted.

応答制御部７は、話者の発話に応答するための応答音声の発話速度を、その応答直前の話者の発話速度に応じた指示速度となるように決定する構成となっているが、対話が開始された後における話者側の一連の発話速度を順次平均し、その平均発話速度に応じた指示速度を決定する構成としても良い。この構成によれば、話者の発話速度が一時的要因により変化した場合であっても、最終的には、応答音声の発話速度が当該話者の感性にあった状態に収束するようになるから、円滑な対話を行う上で有益になる。 The response control unit 7 is configured to determine the utterance speed of the response voice for responding to the utterance of the speaker so as to become an instruction speed according to the utterance speed of the speaker immediately before the response. A series of utterance speeds on the speaker side after the start of, is sequentially averaged, and an instruction speed corresponding to the average utterance speed may be determined. According to this configuration, even when the speaking speed of the speaker changes due to a temporary factor, the speaking speed of the response voice eventually converges to a state that matches the sensitivity of the speaker. Therefore, it will be useful for smooth dialogue.

音声対話装置１を複数の話者が利用する場合には、話速検出部５により検出される話者の発話速度に応じた指示速度のデフォルト値を、実際に音声対話装置１を利用する話者毎に変更する構成としても良い。このような構成とする場合には、各話者の発話速度の平均値を別途に記憶しておき、その平均値に応じて上記指示速度のデフォルト値を変更することになる。 When a plurality of speakers use the voice interactive device 1, the default value of the instruction speed corresponding to the speaker's speech speed detected by the speech speed detecting unit 5 is used as the talk actually using the voice interactive device 1. It is good also as a structure changed for every person. In such a configuration, the average value of the speaking speed of each speaker is stored separately, and the default value of the indicated speed is changed according to the average value.

話者の発話に応答した応答音声の出力が終了する前に、その話者が次の発話を行った場合には、その発話に対する応答音声に係るポーズ時間や発話速度を早くする制御を行う構成としても良い。この構成によれば、例えば、話者側に対話を急ぎたい事由がある場合や、所謂せっかちな話者に対するケアが可能になるものである。 A configuration in which when the speaker utters the next utterance before the output of the response voice in response to the utterance of the speaker is finished, the control is performed to increase the pause time and the utterance speed related to the response voice for the utterance. It is also good. According to this configuration, for example, when there is a reason for the speaker to rush to talk, or for a so-called impatient speaker, it becomes possible.

請求項１、請求項２に関しては、音声対話装置のみならず、ＥＴＣシステムやカーナビの音声ガイド(案内)などのように音声合成によって情報提供するシステムにおいても有効である。
また、本音声対話装置においては、マイクロホン９またはその近傍、或いは、それを象徴する物をドライバーが注視したことを、車両運転者の視線方向を検出するための視線認識装置によって認識し、その時点で音声認識が可能となるようにしても良い。
請求項２において、応答音声の提供を待機状態とするだけでなく、予め設定された条件に合致したときは、やめるようにしても良い。例えば、ドライバーが運転をやめる(停車)状態となったときのように、情報をやり取りする前提がなくなったような場合である。 Claims 1 and 2 are effective not only in a speech dialogue apparatus but also in a system that provides information by speech synthesis, such as an ETC system or a car navigation system voice guide.
Further, in this voice interactive device, the gaze recognition device for detecting the gaze direction of the vehicle driver recognizes that the driver gazes at the microphone 9 or the vicinity thereof, or an object symbolizing the microphone 9, and at that time Voice recognition may be possible.
In claim 2, the provision of the response voice is not limited to a standby state, but may be stopped when a preset condition is met. For example, this is a case where the premise of exchanging information is lost, such as when the driver stops driving (stops).

１は音声対話装置、２は音声合成部、３は話速制御部（話速制御手段）、４は音声認識部（音声認識手段）、５は話速検出部（話速検出手段）、６は対話エージェント部、７は応答制御部（応答制御手段）、１０は制御対象機器群、１１は車載センサ群（環境条件検出センサ）を示す。 1 is a speech dialogue apparatus, 2 is a speech synthesizer, 3 is a speech speed controller (speech speed controller), 4 is a speech recognizer (speech recognizer), 5 is a speech speed detector (speech speed detector), 6 Is a dialogue agent unit, 7 is a response control unit (response control means), 10 is a control target device group, and 11 is an in-vehicle sensor group (environmental condition detection sensor).

Claims

In a voice dialogue apparatus comprising voice recognition means for recognizing the utterance content of a speaker, and response control means for providing a response voice corresponding to the recognition result by synthesized voice,
Comprising speech speed detecting means for detecting the speaking speed of the speaker;
The response control means determines a response time from the time when the voice recognition means detects the end of the speaker's utterance to the start of providing the response voice according to the utterance speed detected by the speech speed detection means. A voice interactive device characterized by changing.

The response control means is configured to capture a detection signal from an environmental condition detection sensor, and when the detection signal satisfies a predetermined environmental condition, a period until the environmental condition is canceled The voice interactive apparatus according to claim 1, wherein only a waiting state in which the provision of the response voice is postponed is exhibited.

The response control unit is configured to be able to control the operation of the control target device and is configured to capture a detection signal from an environmental condition detection sensor, and the operation related to the control target device in the recognition result by the voice recognition unit When a start command or an operation stop command is included, the execution timing of the command is adjusted so as to be an optimal timing according to the environmental condition indicated by the detection signal from the environmental condition detection sensor. The voice interactive apparatus according to claim 1 or 2.