JP2007264132A

JP2007264132A - Voice detection device and its method

Info

Publication number: JP2007264132A
Application number: JP2006086607A
Authority: JP
Inventors: Hiroshi Kanazawa; 博史金澤; Hideki Hirakawa; 秀樹平川
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2006-03-27
Filing date: 2006-03-27
Publication date: 2007-10-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice detection device for surely detecting the contents of utterance. <P>SOLUTION: When outputs of both an air conductive microphone 1 and a bone conductive microphone 3 exceed a predetermined level for each unit time, the correlation value is calculated. When it is determined that correlation is high, time information is sent to a voice section detection section 7, and thereby, a section which is surely assumed to be a voice section, can be utilized at the time of voice detection. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、複数のマイクロホンを利用した音声検出装置及びその方法に関する。 The present invention relates to an audio detection apparatus and method using a plurality of microphones.

骨伝導マイクロホンと通常の気導音マイクロホンを併用して、音声認識のための区間検出精度を向上させる方法として、特許文献１の技術が提案されている。この方法は各マイクロホン出力に対し、別個の音声区間検出処理を行い、ノイズレベルに応じてどちらの検出結果を優先するかを決め、その優先度に基き、どちらか一方の音声区間検出結果を最終結果とするという方法である。 As a method for improving the interval detection accuracy for speech recognition by using a bone conduction microphone and a normal air conduction sound microphone together, the technique of Patent Document 1 has been proposed. This method performs a separate voice segment detection process for each microphone output, decides which detection result is given priority according to the noise level, and finalizes one of the voice segment detection results based on that priority. It is a method of obtaining a result.

また、特許文献２には、単一指向性マイクロホンで受けた音信号と無指向性マイクロホンで受けた音信号との相関度を得て、その相関結果に基づいて検出対象音を検出する技術も提案されている。
特開平４−２７６７９９号公報特開２００５−２２７５１１公報 Patent Document 2 also discloses a technique for obtaining the degree of correlation between a sound signal received by a unidirectional microphone and a sound signal received by an omnidirectional microphone and detecting a detection target sound based on the correlation result. Proposed.
JP-A-4-276799 JP 2005-227511 A

特許文献１の技術では、背景雑音レベルが低い場合、気導音マイクロホンの音声区間検出結果が優先されることになる。しかし、背景雑音レベルが低い場合でも、例えば隣の人の話し声など非定常雑音がターゲット音声の前後に付加した場合には、その付加した区間も含めた区間が音声区間として検出されるという誤った音声区間検出がなされる場合がある。その場合、誤りの修正ができず、誤った音声区間の音声信号が音声認識装置に入力され、ひいては認識誤りを生じさせるといった問題点がある。 In the technique of Patent Document 1, when the background noise level is low, priority is given to the sound section detection result of the air conduction microphone. However, even when the background noise level is low, for example, when non-stationary noise such as the neighbor's speech is added before and after the target speech, it is erroneous that the section including the added section is detected as the speech section. There is a case where a voice section is detected. In such a case, there is a problem that the error cannot be corrected, and the voice signal in the wrong voice section is input to the voice recognition device, resulting in a recognition error.

また、骨伝導マイクロホン及び気導音マイクロホンから得られる音声信号に対する両方の音声区間検出処理が、全て終了して、初めてどの時刻の音声区間を音声認識装置に供するかが確定する。そのため、音声認識の実時間処理が出来ないという問題点もある。 In addition, when all the voice section detection processes for the voice signals obtained from the bone conduction microphone and the air conduction sound microphone are all finished, it is determined for the first time which voice section is provided to the voice recognition device. Therefore, there is a problem that real-time processing of voice recognition cannot be performed.

さらに、特許文献２の技術では、無指向性マイクロホンによって発話を受音するため、例えば隣の人の話し声なども検出対象音として検出され、音声認識処理に入力されることになり、正確な認識を行えないという問題点がある。 Furthermore, in the technique of Patent Document 2, since an utterance is received by an omnidirectional microphone, for example, a neighbor's speaking voice is detected as a detection target sound and input to the voice recognition processing, and accurate recognition is performed. There is a problem that cannot be performed.

そこで、本発明は、確実に検出すべき対象音のみを検出できる音声検出装置及びその方法を提供する。 Therefore, the present invention provides a voice detection apparatus and method for detecting only a target sound that should be reliably detected.

本発明は、それぞれ特性の異なる複数のマイクロホンと、前記各マイクロホンからそれぞれ入力された入力信号の間の相関値を求める相関値計算手段と、前記相関値により前記各入力信号の間に相関があるか否かを判定し、相関があると判定された時の時刻情報を出力する判定手段と、前記時刻情報に基づいて相関があるとされた区間における前記各入力信号の中の一つの入力信号を音声信号として出力する音声区間検出手段と、を有することを特徴とする音声検出装置である。 The present invention provides a correlation between a plurality of microphones having different characteristics, correlation value calculation means for obtaining a correlation value between input signals respectively input from the microphones, and correlation between the input signals based on the correlation values. A determination means for determining whether or not there is a correlation and outputting time information when it is determined that there is a correlation; and one input signal among the input signals in a section determined to have a correlation based on the time information A voice section detecting means for outputting a voice signal as a voice signal.

本発明であると、確実に発話の内容を検出できる。 According to the present invention, the content of the utterance can be reliably detected.

本発明の一実施形態である音声検出装置１００について図１から図５に基づいて説明する。 A voice detection device 100 according to an embodiment of the present invention will be described with reference to FIGS.

（１）音声検出装置１００の構成
図１は、本実施形態の音声検出装置１００の構成を示すブロック図である。 (1) Configuration of Voice Detection Device 100 FIG. 1 is a block diagram showing the configuration of the voice detection device 100 of the present embodiment.

音声検出装置１００は、気導音マイクロホン１と骨伝導マイクロホン２の２本のマイクロホンを利用する場合について示す。そして、音声を認識したい者（以下、マイクロホン装着者という）が、骨伝導マイクロホン２を身体、特に顔付近に装着し、かつ、気導音マイクロホン１によってマイクロホン装着者の声が受音できる状態になっているとする。 The voice detection device 100 will be described in the case of using two microphones, an air conduction microphone 1 and a bone conduction microphone 2. A person who wants to recognize voice (hereinafter referred to as a microphone wearer) wears the bone conduction microphone 2 near the body, particularly in the vicinity of the face, and the air conduction sound microphone 1 can receive the voice of the microphone wearer. Suppose that

音声検出装置１００は、気導音マイクロホン１、レベル判定部２、骨伝導マイクロホン３、レベル判定部４、相関値計算部５、判定部６、音声区間検出部７、雑音特徴推定部８、雑音キャンセル部９とを備えている。なお、本実施例では、音声検出装置１００に音声認識装置が接続されている状況を想定し、音声検出装置１００の出力が音声認識処理に供されることを踏まえて説明する。 The voice detection device 100 includes an air conduction sound microphone 1, a level determination unit 2, a bone conduction microphone 3, a level determination unit 4, a correlation value calculation unit 5, a determination unit 6, a voice section detection unit 7, a noise feature estimation unit 8, a noise And a cancel unit 9. In this embodiment, it is assumed that a voice recognition device is connected to the voice detection device 100 and that the output of the voice detection device 100 is used for voice recognition processing.

（２）気導音マイクロホン１とレベル判定部２
気導音マイクロホン１には空気中を伝わる音響信号が入力され、この音響信号が電気信号に変換され、さらにＡＤ変換後、ディジタル信号（以下、入力音響信号という）に変換され、レベル判定部２に送られる。 (2) Air conduction sound microphone 1 and level determination unit 2
An acoustic signal transmitted in the air is input to the air conduction microphone 1, this acoustic signal is converted into an electric signal, further AD-converted, and then converted into a digital signal (hereinafter referred to as an input acoustic signal). Sent to.

レベル判定部２では、例えば、１０ｍ秒毎に、入力音響信号の波形の振幅の２乗和によりパワーが計算される。入力音響信号のパワーと、予め設定されたパワーの閾値と比較され、閾値を超えている場合には、１０ｍ秒分の入力音響信号は相関値計算部５へ入力される。 In the level determination unit 2, for example, the power is calculated by the square sum of the amplitude of the waveform of the input acoustic signal every 10 milliseconds. The power of the input acoustic signal is compared with a preset power threshold, and if the threshold is exceeded, the input acoustic signal for 10 milliseconds is input to the correlation value calculation unit 5.

（３）骨伝導マイクロホン３とレベル判定部４
一方、時間的に同期して、骨伝導マイクロホン３へも身体を伝わる振動音が入力され、この振動音が電気信号に変換され、さらにＡＤ変換後、ディジタル信号（以下、入力振動信号という）に変換され、レベル判定部４に送られる。 (3) Bone conduction microphone 3 and level determination unit 4
On the other hand, in synchronization with time, vibration sound transmitted through the body is also input to the bone conduction microphone 3, and this vibration sound is converted into an electric signal, and further converted into a digital signal (hereinafter referred to as an input vibration signal) after AD conversion. It is converted and sent to the level determination unit 4.

レベル判定部４では、例えば、入力振動信号の１０ｍ秒の長さを１フレームと定義して、フレーム毎に入力振動信号の波形の振幅の２乗和によりパワーが計算される。入力振動信号のパワーと、予め設定されたパワーの閾値と比較され、閾値を超えている場合には、１０ｍ秒分の入力振動信号は相関値計算部５へ入力される。 In the level determination unit 4, for example, the length of 10 ms of the input vibration signal is defined as one frame, and the power is calculated by the square sum of the amplitude of the waveform of the input vibration signal for each frame. The power of the input vibration signal is compared with a preset power threshold, and if the threshold is exceeded, the input vibration signal for 10 milliseconds is input to the correlation value calculation unit 5.

なお、各レベル判定部２，４での閾値は入力音響信号と入力振動信号のレベルが異なることから別々の値に設定される。 Note that the threshold values in the level determination units 2 and 4 are set to different values because the levels of the input acoustic signal and the input vibration signal are different.

（４）相関値計算部５と判定部６
相関値計算部５では、レベル判定処理を通過した気導音マイクロホン１からの入力音響信号及び骨伝導マイクロホン３からの入力振動信号が入力される。双方の信号が入力された場合に相関値が計算される。 (4) Correlation value calculation unit 5 and determination unit 6
In the correlation value calculation unit 5, the input acoustic signal from the air conduction microphone 1 and the input vibration signal from the bone conduction microphone 3 that have passed the level determination process are input. A correlation value is calculated when both signals are input.

相関には、例えば各信号を周波数分解し得られた周波数パターン間のユークリッド距離を利用しても良いし、相関係数を求めてもよい。 For correlation, for example, the Euclidean distance between frequency patterns obtained by frequency-resolving each signal may be used, or a correlation coefficient may be obtained.

ここで、実際にマイクロホン装着者が発声した音声の場合、レベル判定部２から得られる入力音響信号は空気中を伝わる当該音声となり、レベル判定部４から得られる入力振動信号は発声された音声により骨が振動した信号となる。双方とも信号源は同じであるので、それぞれの相関値は高い値を示すと考えられる。 Here, in the case of the voice actually uttered by the microphone wearer, the input acoustic signal obtained from the level determination unit 2 is the voice transmitted in the air, and the input vibration signal obtained from the level determination unit 4 depends on the voice that is uttered. The signal is a vibration of the bone. Since both have the same signal source, each correlation value is considered to indicate a high value.

図２に示すように、マイクロホン装着者でない人の発声がなされた場合を考える。レベル判定部２から得られる入力音響信号は空気中を伝わる当該音声となり、一方、骨伝導マイクロホンへの入力はないため、レベル判定部４から出力されない。 As shown in FIG. 2, a case where a person who is not a microphone is uttered is considered. The input acoustic signal obtained from the level determination unit 2 is the sound that travels in the air. On the other hand, since there is no input to the bone conduction microphone, it is not output from the level determination unit 4.

図３に示すように、マイクロホン装着者が発声せず、身体の摩擦などにより骨伝導マイクロホン３３にのみ入力振動信号が入力された場合を考える。気導音マイクロホン１への入力はないか、あるいはあっても非常にレベルの低い信号となり、レベル判定部２からの出力はない。 As shown in FIG. 3, a case is considered where an input vibration signal is input only to the bone conduction microphone 33 due to body friction or the like, without the microphone wearer speaking. There is no input to the air-conduction sound microphone 1, or even if it exists, it becomes a very low level signal, and there is no output from the level determination unit 2.

図４に示すように、マイクロホン装着者ではない人の発声と、マイクロホン装着者の身体から発生された振動の両方が時間的に重なってしまった場合を考える。レベル判定部２，４からはそれぞれ人の声の入力音響信号と、入力振動信号が出力される。それらの信号源は異なるので相関値は非常に小さな値になると考えられる。 As shown in FIG. 4, a case is considered in which both the voice of a person who is not a microphone wearer and the vibration generated from the body of the microphone wearer overlap in time. The level determination units 2 and 4 respectively output an input acoustic signal of human voice and an input vibration signal. Since these signal sources are different, the correlation value is considered to be very small.

相関値計算部５で計算された相関値が、判定部６で予め設定された閾値を超えた場合に両者のマイクロホン１，３に入力された信号には相関があると判定され、その時の時刻情報が音声区間検出部７及び雑音特徴推定部８に出力される。 When the correlation value calculated by the correlation value calculation unit 5 exceeds a threshold set in advance by the determination unit 6, it is determined that the signals input to both microphones 1 and 3 have a correlation, and the time at that time Information is output to the speech section detection unit 7 and the noise feature estimation unit 8.

以上の処理は、背景雑音のない場合に考えられる現象について説明したものである。いずれの場合にも、マイクロホン装着者の発声以外の信号が音声区間検出部７へ入力されるのを妨げることができ、確実にマイクロホン装着者が発声したと考えられる区間の時刻情報を音声区間検出部７へ知らせることができる。逆にいうと、当該時刻の時点ではマイクロホン装着者が発声していると考えてよいということになる。 The above processing describes a phenomenon that can be considered when there is no background noise. In any case, it is possible to prevent a signal other than the voice of the microphone wearer from being input to the voice section detection unit 7, and the time information of the section considered to be uttered by the microphone wearer is reliably detected. Part 7 can be notified. In other words, it can be considered that the microphone wearer is speaking at the time.

但し、マイクロホン装着者により発声された音声のうち、特に語頭の子音などは骨等がほとんど振動せず、骨伝導マイクロホン３では拾うことが難しいため、実際には相関値が高いという結果が得られた時刻より以前の時刻で音声が開始されている場合がある。また、これらの現象は語頭だけでなく、語中や語尾などでも発生する場合がある。これらの問題は、音声区間検出部７及び本音声検出装置１００に接続される音声認識装置で対処することになる。 However, among the voices uttered by the person wearing the microphone, especially the consonant at the beginning of the word hardly oscillates in the bone and the like, and it is difficult to pick up with the bone conduction microphone 3, so that the correlation value is actually high. Audio may have started at a time earlier than the previous time. Moreover, these phenomena may occur not only at the beginning of a word but also at the end of a word or at the end of a word. These problems are dealt with by the speech recognition device connected to the speech section detection unit 7 and the speech detection device 100.

一方、背景雑音がある場合を考えると、マイクロホン装着者の発声がない場合でも、気導音マイクロホン１を通過した入力音響信号は雑音の影響で高いレベルの信号となっていると想定されるので、レベル判定部２を通過して相関値計算部５へ送付される。しかし、骨伝導マイクロホン３は背景雑音の影響は受けないので、マイクロホン装着者の発声がない背景雑音のみの区間ではレベル判定部４からの出力はない。したがって、背景雑音の影響は排除することができる。このように、背景雑音がある場合でも、ない場合でも、雑音に影響されることなく、レベル判定処理及び相関値による判定処理を行うことにより、マイクロホン装着者が確実に発声した時刻の情報を音声区間検出部７に出力することができる。 On the other hand, considering the case where there is background noise, the input acoustic signal that has passed through the air conduction microphone 1 is assumed to be a high-level signal due to the influence of noise even when the microphone wearer does not utter. Then, it passes through the level determination unit 2 and is sent to the correlation value calculation unit 5. However, since the bone conduction microphone 3 is not affected by the background noise, there is no output from the level determination unit 4 in the section of only the background noise where the microphone wearer does not utter. Therefore, the influence of background noise can be eliminated. In this way, whether or not there is background noise, the level determination process and the determination process based on the correlation value are performed without being affected by the noise, so that the information of the time when the microphone wearer is surely uttered is voiced. It can be output to the section detection unit 7.

（５）音声区間検出部７
音声区間検出部７での処理は特に限定するものではないが、図５に基づいてその一例を示す。 (5) Voice section detection unit 7
Although the process in the audio | voice area detection part 7 is not specifically limited, The example is shown based on FIG.

音声区間検出部７には、上記した音声区間と想定される時刻情報の他に、気導音マイクロホン１から出力される入力音響信号が雑音特徴推定部８を経て入力される。 In addition to the time information assumed to be the above-described speech section, an input acoustic signal output from the air conduction microphone 1 is input to the speech section detection unit 7 via the noise feature estimation unit 8.

また、雑音特徴推定部８を経由して雑音キャンセル部１で雑音キャンセル処理された後の信号も入力される。 Further, the signal after the noise cancellation processing by the noise cancellation unit 1 is also input via the noise feature estimation unit 8.

音声区間検出部７では、これら３種類の情報及び信号を用いて、マイクロホン装着者の正しい音声区間を検出する。 The voice section detector 7 detects the correct voice section of the microphone wearer using these three types of information and signals.

上記したように、判定部６から出力される音声区間の時刻情報は、語頭の子音などの区間を含まない可能性があることから、当該時刻よりも例えば３０フレーム（＝３００ｍ秒）前の時点から音声区間検出処理を開始するなどの制御を行う。したがって、音声区間検出部７内に、入力音響信号を例えば３０フレーム分格納する記憶領域を内蔵している。音声区間検出部７では、開始フレーム毎に雑音キャンセル前の入力音響信号と雑音キャンセル後の入力音響信号の両方を用い、例えばフレーム毎のパワー、周波数パターンに基づき、当該フレームがマイクロホン装着者の発声した音声に該当するかどうかを判定する。 As described above, since the time information of the speech section output from the determination unit 6 may not include a section such as a consonant at the beginning of the word, for example, a time point 30 frames (= 300 milliseconds) before the time The voice section detection process is started from the beginning. Therefore, a storage area for storing, for example, 30 frames of the input acoustic signal is built in the voice section detection unit 7. The voice section detection unit 7 uses both the input acoustic signal before noise cancellation and the input acoustic signal after noise cancellation for each start frame. For example, the frame is uttered by the microphone wearer based on the power and frequency pattern for each frame. It is determined whether or not it corresponds to the selected voice.

音声に該当すると判定されたフレームは、音声開始フレームから逐次、音声終了フレームまで音声認識装置に送られ、音声認識のための特徴抽出及び音声認識処理が実行される。 Frames determined to correspond to speech are sequentially sent from the speech start frame to the speech end frame to the speech recognition device, and feature extraction and speech recognition processing for speech recognition are executed.

（６）雑音特徴推定部８
雑音特徴推定部８では、判定部６で当該フレームが音声区間と判定されるまで、雑音区間であるという判断のもとに、当該フレームの信号を雑音特徴推定に利用する。 (6) Noise feature estimation unit 8
The noise feature estimation unit 8 uses the signal of the frame for noise feature estimation based on the determination that the frame is a noise interval until the determination unit 6 determines that the frame is a speech interval.

具体的には、連続する雑音フレームが複数連続する続く場合、当該フレームより前の雑音特徴信号Ｎ（ｎ−１）を、当該フレームの周波数特徴信号Ｆ（ｎ）を用いてある更新係数αで更新することにより、当該フレームまでの雑音特徴信号Ｎ（ｎ）とする。次のフレームが雑音フレームと判定された場合には、同様に更新する。つまり更新式は以下のようになる。 Specifically, when a plurality of continuous noise frames continue, the noise feature signal N (n−1) before the frame is changed by a certain update coefficient α using the frequency feature signal F (n) of the frame. By updating, the noise feature signal N (n) up to the frame is obtained. If the next frame is determined to be a noise frame, the same update is performed. In other words, the update formula is as follows.

Ｎ（ｎ）＝（１−α）・Ｎ（ｎ−１）＋α・Ｆ（ｎ）

この雑音特徴信号は雑音キャンセル部９で利用される。
N (n) = (1-α) · N (n−1) + α · F (n)

This noise feature signal is used by the noise canceling unit 9.

（７）雑音キャンセル部９
雑音キャンセル部９では、雑音特徴推定部８から送られる推定された雑音特徴信号と、当該フレームが雑音なのか音声なのかの判定情報に基づき、音声フレームである場合には、気導音マイクロホン１から出力される入力信号から音声成分だけを抽出する処理を行う。 (7) Noise canceling unit 9
In the noise canceling unit 9, based on the estimated noise feature signal sent from the noise feature estimating unit 8 and determination information on whether the frame is noise or speech, if the frame is a speech frame, the air conduction microphone 1 To extract only the audio component from the input signal output from the.

例えば、スペクトルサブトラクション法などを用い、当該フレームの周波数特徴信号から、雑音特徴信号を引き算するなどの処理を行い、雑音成分をキャンセルし、音声成分だけを抽出する処理を行う。 For example, the spectral subtraction method or the like is used to perform processing such as subtracting the noise feature signal from the frequency feature signal of the frame, canceling the noise component, and extracting only the speech component.

この音声成分信号は音声区間検出で利用されると共に、音声区間検出部７を経由して、音声認識装置にも送られ、音声認識処理に利用される。 This voice component signal is used for voice section detection, and is also sent to the voice recognition device via the voice section detector 7 and used for voice recognition processing.

（８）効果
本実施形態の音声検出装置１００であると、周囲の背景雑音などの定常雑音や、人の話し声、電話音やクラクションなどの非定常雑音の影響を受けず、正確な音声区間検出が可能となり、ひいては音声認識性能を高めることが可能となる。 (8) Effects The voice detection device 100 according to the present embodiment can accurately detect a voice section without being affected by stationary noise such as ambient background noise and non-stationary noise such as human speech, telephone sound, and horn. As a result, the speech recognition performance can be improved.

（９）変更例
本発明は上記各実施形態に限らず、その主旨を逸脱しない限り種々に変更することができる。 (9) Modification Examples The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the gist thereof.

例えば、上記実施形態では、気導音マイクロホン１と骨伝導マイクロホン２の２本のマイクロホンを利用する場合について説明したが、これに代えて、無指向性の気導音マイクロホンと指向性の気導音マイクロホンと骨伝導マイクロホンの３本のマイクロホンを使用してもよい。この場合に音声か雑音かの判定は、複数の入力信号間の組み合わせの内、少なくとも過半数の組み合わせで相関があるとしたときに音声であると判定する。また、過半数の組み合わせで相関がないとしたときに雑音であると判定する。 For example, in the above embodiment, the case where two microphones, the air conduction sound microphone 1 and the bone conduction microphone 2 are used, has been described. Instead of this, a non-directional air conduction sound microphone and a directional air conduction microphone are used. Three microphones, a sound microphone and a bone conduction microphone, may be used. In this case, whether the voice or noise is determined is determined to be voice when there is a correlation among at least a majority of combinations among a plurality of input signals. Further, when there is no correlation in the majority of combinations, it is determined that the noise is present.

本発明の一実施形態の音声検出装置のブロック図である。It is a block diagram of the audio | voice detection apparatus of one Embodiment of this invention. 気導音マイクロホンにマイクロホン装着者の発声以外の音声が入力された場合の例を示す図である。It is a figure which shows the example when audio | voices other than the utterance of a microphone wearer are input into the air conduction sound microphone. 骨伝導マイクロホンにマイクロホン装着者の発声以外の振動が入力された場合の例を示す図である。It is a figure which shows the example at the time of vibrations other than the voice of a microphone wearer being input into the bone conduction microphone. 気導音マイクロホンと骨伝導マイクロホンの両方にマイクロホン装着者の発声以外の音及び振動が同時に入力された場合の例を示す図である。It is a figure which shows the example when the sound and vibration other than a microphone wearer's utterance are simultaneously input into both an air conduction sound microphone and a bone conduction microphone. 音声区間検出処理の一例を示す図である。It is a figure which shows an example of an audio | voice area detection process.

Explanation of symbols

１気導音マイクロホン
２レベル判定部
３骨伝導マイクロホン
４レベル判定部
５相関値計算部
６判定部
７音声区間検出部
８雑音特徴推定部
９雑音キャンセル部
DESCRIPTION OF SYMBOLS 1 Air conduction sound microphone 2 Level judgment part 3 Bone conduction microphone 4 Level judgment part 5 Correlation value calculation part 6 Judgment part 7 Voice area detection part 8 Noise feature estimation part 9 Noise cancellation part

Claims

Multiple microphones with different characteristics,
Correlation value calculation means for obtaining a correlation value between input signals respectively input from the microphones;
A determination means for determining whether or not there is a correlation between the input signals based on the correlation value, and for outputting time information when it is determined that there is a correlation;
Voice section detecting means for outputting one input signal among the input signals in the section determined to be correlated based on the time information as a voice signal;
A voice detecting device characterized by comprising:

The speech detection device according to claim 1, wherein at least two of the plurality of microphones are an air conduction microphone and a bone conduction microphone.

The voice detection device according to claim 2, wherein the input signal output as a voice signal in the voice section means is an input signal of the air conduction microphone.

The voice detection device according to claim 3, further comprising noise canceling means for removing noise from the input signal of the air conduction microphone and outputting the removed input signal to the voice section detection means.

Noise feature estimation means for estimating a noise feature signal from an input signal of the air conduction microphone in a section determined to have no correlation in the determination means;
The voice detection device according to claim 4, wherein the noise cancellation unit removes noise from an input signal of the air conduction microphone based on the noise characteristic signal.

Find the correlation value between the input signals respectively input from multiple microphones with different characteristics,
It is determined whether there is a correlation between the input signals by the correlation value, and outputs time information when it is determined that there is a correlation,
A voice detection method, wherein one input signal among the input signals in a section that is correlated based on the time information is output as a voice signal.