JP4476355B2

JP4476355B2 - Echo and noise cancellation

Info

Publication number: JP4476355B2
Application number: JP2009509908A
Authority: JP
Inventors: マオシャドン
Original assignee: Sony Computer Entertainment Inc
Current assignee: Sony Interactive Entertainment Inc
Priority date: 2006-05-04
Filing date: 2007-03-30
Publication date: 2010-06-09
Anticipated expiration: 2027-03-30
Also published as: JP4833343B2; EP2012725A2; EP2014132A4; EP2012725A4; WO2007130765A3; JP4866958B2; WO2007130766A3; JP2009535997A; WO2007130765A2; WO2007130766A2; JP2009535996A; EP2014132A2; JP2010171985A

Description

［優先権の主張］
本出願は、本出願と譲受人が共通であって本出願と同時に係属する特許文献１の恩恵を主張し、その開示内容全体をここに援用する。本出願は、本出願と譲受人が共通であって本出願と同時に係属する特許文献２の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献３の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献４の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献５の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献６の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献７の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献８の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献９の恩恵を主張し、その開示内容全体をここに援用する。本出願はまた、本出願と譲受人が共通であって本出願と同時に係属する特許文献１０の恩恵を主張し、その開示内容全体をここに援用する。
米国特許出願第11/381,728号,シャドンマオ, "ECHO AND NOISE CANCELATION", 2006年5月4日出願, (代理人整理番号SCEA05064US00) 米国特許出願第11/381,729号,シャドンマオ, "ULTRA SMALL MICROPHONE ARRAY", 2006年5月4日出願, (代理人整理番号SCEA05062US00) 米国特許出願第11/381,725号,シャドンマオ, "METHODS AND APPARATUS FOR TARGETED SOUND DETECTION", 2006年5月4日出願, (代理人整理番号SCEA05072US00), 米国特許出願第11/381,727号,シャドンマオ, "NOISE REMOVAL FOR ELECTRONIC DEVICE WITH FAR FIELD MICROPHONE ON CONSOLE", 2006年5月4日出願, (代理人整理番号SCEA05073US00) 米国特許出願第11/381,724号,シャドンマオ, "METHODS AND APPARATUS FOR TARGETED SOUND DETECTION AND CHARACTERIZATION", 2006年5月4日出願, (代理人整理番号SCEA05079US00) 米国特許出願第11/381,721号,シャドンマオ, "SELECTIVE SOUND SOURCE LISTENING IN CONJUNCTION WITH COMPUTER INTERACTIVE PROCESSING", 2006年5月4日出願, (代理人整理番号SCEA04005 JUMBOUS) PCT出願 PCT/US06/17483号,シャドンマオ, "SELECTIVE SOUND SOURCE LISTENING IN CONJUNCTION WITH COMPUTER INTERACTIVE PROCESSING", 2006年5月4日出願, (代理人整理番号SCEA04005 JUMBOPCT) 米国特許出願第11/418,988号,シャドンマオ, "METHODS AND APPARATUSES FOR ADJUSTING A LISTENING AREA FOR CAPTURING SOUNDS", 2006年5月4日出願, (代理人整理番号SCEA-00300) 米国特許出願第11/418,989号,シャドンマオ, "METHODS AND APPARATUSES FOR CAPTURING AN AUDIO SIGNAL BASED ON VISUAL IMAGE", 2006年5月4日出願, (代理人整理番号SCEA-00400) 米国特許出願第11/429,047号,シャドンマオ, "METHODS AND APPARATUSES FOR CAPTURING AN AUDIO SIGNAL BASED ON A LOCATION OF THE SIGNAL", 2006年5月4日出願, (代理人整理番号SCEA-00500) [Priority claim]
This application claims the benefit of Patent Document 1, which is commonly assigned to this application and is the same as the present application, the entire disclosure of which is incorporated herein by reference. This application claims the benefit of Patent Document 2 that is commonly assigned to this application and is the same as the present application, the entire disclosure of which is incorporated herein. This application also claims the benefit of US Pat. No. 6,057,056, whose assignee is common to this application and is co-pending with this application, the entire disclosure of which is incorporated herein by reference. This application also claims the benefit of US Pat. No. 6,057,056, whose assignee is common to this application and is co-pending with this application, the entire disclosure of which is incorporated herein by reference. This application also claims the benefit of Patent Document 5, which is common to the present application and is assigned at the same time as the present application, the entire disclosure of which is incorporated herein. This application also claims the benefit of US Pat. No. 6,057,097, whose assignee is common to this application and co-pending with this application, the entire disclosure of which is incorporated herein by reference. This application also claims the benefit of Patent Document 7, which is commonly assigned to this application and is pending at the same time as this application, the entire disclosure of which is incorporated herein by reference. This application also claims the benefit of U.S. Pat. No. 6,053,075, whose assignee is common to this application and is co-pending with this application, the entire disclosure of which is incorporated herein by reference. This application also claims the benefit of U.S. Pat. No. 6,053,075, whose assignee is common to this application and is co-pending with this application, the entire disclosure of which is incorporated herein by reference. This application also claims the benefit of US Pat. No. 6,057,056, whose assignee is common to this application and is pending at the same time as this application, the entire disclosure of which is incorporated herein by reference.
US Patent Application No. 11 / 381,728, Shadon Mao, "ECHO AND NOISE CANCELATION", filed May 4, 2006, (Attorney Docket Number SCEA05064US00) US Patent Application No. 11 / 381,729, Shadon Mao, "ULTRA SMALL MICROPHONE ARRAY", filed May 4, 2006, (Attorney Docket Number SCEA05062US00) US Patent Application No. 11 / 381,725, Shadon Mao, "METHODS AND APPARATUS FOR TARGETED SOUND DETECTION", filed May 4, 2006, (Attorney Docket Number SCEA05072US00), US Patent Application No. 11 / 381,727, Shadon Mao, "NOISE REMOVAL FOR ELECTRONIC DEVICE WITH FAR FIELD MICROPHONE ON CONSOLE", filed May 4, 2006, (Attorney Docket Number SCEA05073US00) US Patent Application No. 11 / 381,724, Shadon Mao, "METHODS AND APPARATUS FOR TARGETED SOUND DETECTION AND CHARACTERIZATION", filed May 4, 2006, (Attorney Docket Number SCEA05079US00) US Patent Application No. 11 / 381,721, Shadon Mao, "SELECTIVE SOUND SOURCE LISTENING IN CONJUNCTION WITH COMPUTER INTERACTIVE PROCESSING", filed May 4, 2006, (Attorney Docket Number SCEA04005 JUMBOUS) PCT Application PCT / US06 / 17483, Shadon Mao, "SELECTIVE SOUND SOURCE LISTENING IN CONJUNCTION WITH COMPUTER INTERACTIVE PROCESSING", filed May 4, 2006, (Attorney Docket Number SCEA04005 JUMBOPCT) US Patent Application No. 11 / 418,988, Shadon Mao, "METHODS AND APPARATUSES FOR ADJUSTING A LISTENING AREA FOR CAPTURING SOUNDS", filed May 4, 2006, (Attorney Docket Number SCEA-00300) US Patent Application No. 11 / 418,989, Shadon Mao, "METHODS AND APPARATUSES FOR CAPTURING AN AUDIO SIGNAL BASED ON VISUAL IMAGE", filed May 4, 2006, (Attorney Docket Number SCEA-00400) US Patent Application No. 11 / 429,047, Shadon Mao, "METHODS AND APPARATUSES FOR CAPTURING AN AUDIO SIGNAL BASED ON A LOCATION OF THE SIGNAL", filed May 4, 2006, (Attorney Docket Number SCEA-00500)

[本発明の技術分野]
本発明は、音響信号処理に関し、とくに、音響信号処理におけるエコーおよびキャンセリングに関する。 [Technical Field of the Invention]
The present invention relates to acoustic signal processing, and more particularly to echo and canceling in acoustic signal processing.

インタラクティブテレビゲームコントローラなどのような多くの携帯電子装置は、双方向音響信号を扱うことができる。このような装置は、典型的にはその装置のユーザからのローカルスピーチ信号ｓ（ｔ）を受けるマイクロフォンと、ユーザが聞くことができるスピーカ信号ｘ（ｔ）を発信するスピーカとを備える。テレビゲームコントローラをより小型化するために、マイクロフォンとスピーカは、比較的近く（例えば２０ｃｍ以内など）に設置することが望ましい。これに対してユーザは、マイクロフォンからより離れたところ（例えば３メートルから５メートルなど）に位置するかもしれない。マイクロフォンはローカルスピーチ信号ｓ（ｔ）とスピーカエコー信号ｘ_１（ｔ）との両方を含む信号ｄ（ｔ）を生成する。これに加えて、マイクロフォンはバックグランドノイズｎ（ｔ）を受けるかもしれない。そのため、全体のマイクロフォン信号は、ｄ（ｔ）＝ｓ（ｔ）＋ｘ_１（ｔ）＋ｎ（ｔ）となる。比較的スピーカの近傍にあるため、マイクロフォン信号ｄ（ｔ）は、スピーカエコー信号ｘ_１（ｔ）によって、占められるかもしれない。 Many portable electronic devices, such as interactive video game controllers, can handle bidirectional acoustic signals. Such devices typically include a microphone that receives a local speech signal s (t) from a user of the device and a speaker that emits a speaker signal x (t) that the user can hear. In order to further reduce the size of the video game controller, it is desirable to install the microphone and the speaker relatively close to each other (for example, within 20 cm). In contrast, the user may be located further away from the microphone (eg, 3 to 5 meters). The microphone generates a signal d (t) that includes both the local speech signal s (t) and the speaker echo signal x ₁ (t). In addition to this, the microphone may experience background noise n (t). Therefore, the entire microphone signal is d (t) = s (t) + x ₁ (t) + n (t). Because it is relatively near the speaker, the microphone signal d (t) may be occupied by the speaker echo signal x ₁ (t).

電気通信の応用例において、スピーカエコーは広くみられる現象であり、エコーサプレッションとエコーキャンセレーションは比較的成熟した手法である。エコーサプレッサは、回線において１方向に向かう音声信号の存在を検出した場合に作動し、他の方向に大きな損失を挿入する。通常、回線の遠端にあるエコーサプレッサが回線の近端からの音声を検出した場合に、そのエコーサプレッサがこの損失を加える。この加えられた損失により、スピーカ信号ｘ（ｔ）が、ローカルスピーチ信号ｄ（ｔ）へと再送出されることを阻止することができる。 Speaker echo is a common phenomenon in telecommunications applications, and echo suppression and echo cancellation are relatively mature techniques. The echo suppressor operates when it detects the presence of a voice signal going in one direction on the line and inserts a large loss in the other direction. Normally, when an echo suppressor at the far end of the line detects voice from the near end of the line, the echo suppressor adds this loss. This added loss can prevent the speaker signal x (t) from being retransmitted into the local speech signal d (t).

エコーサプレッションは効果的ではあるが、多くの場合、いくつかの問題につながる。例えば、ローカルスピーチ信号ｓ（ｔ）とリモートスピーカ信号ｘ（ｔ）は、少なくとも短時間に限れば、同時に生ずることがよくある。この状況はダブルトークとも呼ばれる。リモートスピーカ信号のみが存在するような状況は、リモートシングルトークとも呼ばれる。各エコーサプレッサが回路の遠端（far-end）からの音声エネルギを検出するため、その結果、通常、同時に双方向に損失が挿入されることとなり、両側の通話がブロックされる。これを防止するため、エコーサプレッサを、近端のスピーカからの音声アクティビティのみを検出するように設定することができる。これにより、近端話者と遠端話者が同時に話しているときには、損失が挿入されなくなる（または、より小さい損失のみ挿入される）。残念ながら、これは、当初のエコーサプレッサの効果まで、一時的にうち消してしまう。 While echo suppression is effective, it often leads to several problems. For example, the local speech signal s (t) and the remote speaker signal x (t) are often generated at least at the same time. This situation is also called double talk. The situation where only the remote speaker signal exists is also called remote single talk. Since each echo suppressor detects voice energy from the far-end of the circuit, this usually results in loss being inserted in both directions at the same time, blocking both-side calls. To prevent this, the echo suppressor can be set to detect only voice activity from the near-end speaker. This ensures that no loss is inserted (or only a smaller loss is inserted) when the near-end speaker and the far-end speaker are speaking at the same time. Unfortunately, this temporarily disappears until the original echo suppressor effect.

さらに、エコーサプレッサは、交互に、損失を挿入し、除去するため、新たな話者が話し始めたときにしばしば小さな遅延が生じ、その話者のスピーチの初めの方の音がクリッピングされてしまう。さらに、遠端の相手方の周囲がうるさいときには、遠端話者が話しているときには、近端話者にそのバックグラウンド音が聞こえるが、近端話者が話し始めるとエコーサプレッサがそのバックグラウンド音を抑制する。これにより、バックグラウンド音が突然無くなるため、近端のユーザは回線が切れたかのような印象を受けることになる。 In addition, the echo suppressor alternately inserts and removes losses, so there is often a small delay when a new speaker begins speaking, and the sound at the beginning of the speaker's speech is clipped. . In addition, when the far-end party is noisy, when the far-end speaker is speaking, the near-end speaker can hear the background sound, but when the near-end speaker starts speaking, the echo suppressor Suppress. As a result, the background sound suddenly disappears, so that the user at the near end receives the impression that the line is disconnected.

上述の問題に対処するため、エコーキャンセレーション手法が開発された。エコーキャンセレーションは、アナログまたはデジタルフィルタを用いて、望ましくないノイズやエコーを入力信号から取り除き、フィルタリング処理された信号ｅ（ｔ）を生成する。エコーキャンセレーションにおいては、スピーチモデルを計算するために複雑なアルゴリズム手順が用いられる。この手順は、マイクロフォン信号ｄ（ｔ）と、リモート信号ｘ（ｔ）の一部を、エコーキャンセレーションプロセッサに入力するステップと、スピーカエコー信号ｘ_１（ｔ）を予測するステップと、そしてこれをマイクロフォン信号ｄ（ｔ）から差し引ステップとを含む。エコー予測方式は、適用（ａｄａｐｔａｔｉｏｎ）として知られるプロセスにおいて、エコーキャンセレーションプロセッサにより学習されなければならない。 An echo cancellation technique has been developed to address the above problems. Echo cancellation uses an analog or digital filter to remove unwanted noise and echo from the input signal and produces a filtered signal e (t). In echo cancellation, complex algorithm procedures are used to calculate the speech model. The procedure includes inputting a microphone signal d (t) and a portion of a remote signal x (t) into an echo cancellation processor, predicting a speaker echo signal x ₁ (t), and Subtracting from the microphone signal d (t). Echo prediction schemes must be learned by an echo cancellation processor in a process known as adaptation.

このような手法の効果は、エコー抑制比（ＥＳＲ:echo supression ratio）によって測定される。これは単に、マイクロフォンが受ける真のエコーエネルギと、フィルタリング処理された信号ｘ_１（ｔ）に残る残余エコーエネルギとの比である（典型的にはデシベルで表される）。国際電気通信ユニオン（ＩＴＣ）が定めた基準によると、リモートシングルトークの場合、エコーレベルについて、少なくとも４５デシベルの減衰が必要である。ダブルトークの最中（または強いバックグラウンドノイズの最中）には、この減衰レベルは３０デシベルまで低くなってもよい。しかしながら、これらの推奨基準は、ローカルスピーチ信号を発生するユーザが、マイクロフォンに、より近いようなシステムにおいて開発されたものである。したがって録音されたＳＮ比（ターゲット音声エネルギのエコーノイズエネルギに対する比）は、大抵、５デシベルよりも良い。例えばテレビゲームコントローラのような、ユーザが３メートルから５メートルも離れており、オープンマイクロフォンから０．５メートルよりも近傍にあるラウドスピーカが大きなエコーを発生するようなアプリケーションにおいては、これらの推奨基準はあてはまらない。このようなアプリケーションにおいては、ＳＮ比は−１５デシベルから−３０デシベル未満であろう。リモートシングルトークにおいては６０デシベル以上のＥＳＲ、ダブルトークについては３５デシベル以上ＥＳＲが要求されるかもしれない。現存のエコーキャンセレーション手法ではこのような高いＥＳＲレベルを達成することができない。 The effect of such a technique is measured by an echo suppression ratio (ESR). This is simply the ratio of the true echo energy received by the microphone to the residual echo energy remaining in the filtered signal x ₁ (t) (typically expressed in decibels). According to the standards established by the International Telecommunications Union (ITC), for remote single talk, an attenuation of at least 45 decibels is required for the echo level. During double talk (or during strong background noise), this attenuation level may be as low as 30 dB. However, these recommended criteria were developed in a system where the user generating the local speech signal is closer to the microphone. Thus, the recorded signal-to-noise ratio (ratio of target speech energy to echo noise energy) is often better than 5 decibels. For applications where the user is 3 to 5 meters away, such as a video game controller, and a loudspeaker that is closer than 0.5 meters from the open microphone produces a large echo, these recommended criteria Does not apply. In such an application, the signal-to-noise ratio will be from -15 dB to less than -30 dB. Remote single talk may require an ESR of 60 dB or more, and double talk may require an ESR of 35 dB or more. Existing echo cancellation techniques cannot achieve such high ESR levels.

したがって、当該技術分野においては前述の不利な点を克服するエコーキャンセレーションシステムおよび方法が必要とされている。 Accordingly, there is a need in the art for an echo cancellation system and method that overcomes the aforementioned disadvantages.

［発明の概要］
前述の不利な点を克服するため、本発明の実施形態は、スピーカとマイクロフォンを有するシステムにおけるエコーキャンセレーション方法および装置に照準を合わせる。スピーカはスピーカ信号ｘ（ｔ）を受信する。マイクロフォンは、ローカル信号ｓ（ｔ）とエコー信号ｘ_１（ｔ）を含むマイクロフォン信号ｄ（ｔ）を受け取る。エコー信号ｘ_１（ｔ）は、スピーカ信号ｘ（ｔ）に依存する。マイクロフォン信号ｄ（ｔ）は、互いに相補的なエコーキャンセレーション特性を有する第１適応フィルタ、および第２適応フィルタによって、パラレルにフィルタリング処理される。最小エコー出力ｅ_３（ｔ）は、第１適応フィルタからの出力ｅ_１（ｔ）と、第２適応フィルタからの出力ｅ_２（ｔ）か決定される。最小エコー出力のエネルギはより小さく、最小エコー出力とスピーカ信号ｘ（ｔ）との間の相関はより小さい。そして、マイクロフォン出力が、最小エコー出力ｅ_３（ｔ）を用いて生成される。オプションとして、残差エコーキャンセレーション、かつ／または、ノイズキャンセレーションが、最小エコー出力に適用されてもよい。 [Summary of Invention]
To overcome the aforementioned disadvantages, embodiments of the present invention are aimed at echo cancellation methods and apparatus in a system having a speaker and a microphone. The speaker receives a speaker signal x (t). The microphone receives a microphone signal d (t) that includes a local signal s (t) and an echo signal x ₁ (t). The echo signal x ₁ (t) depends on the speaker signal x (t). The microphone signal d (t) is filtered in parallel by a first adaptive filter and a second adaptive filter having mutually complementary echo cancellation characteristics. The minimum echo output e ₃ (t) is determined as the output e ₁ (t) from the first adaptive filter and the output e ₂ (t) from the second adaptive filter. The energy of the minimum echo output is smaller and the correlation between the minimum echo output and the speaker signal x (t) is smaller. A microphone output is then generated using the minimum echo output e ₃ (t). Optionally, residual echo cancellation and / or noise cancellation may be applied to the minimum echo output.

本発明の実施形態にかかるエコーキャンセレーション装置の概略図である。It is the schematic of the echo cancellation apparatus concerning embodiment of this invention. 図１Ａのエコーキャンセレーション装置において用いられうる音声アクティビティ検出適応フィルタの概略図である。1B is a schematic diagram of a voice activity detection adaptive filter that can be used in the echo cancellation apparatus of FIG. 1A. FIG. 図１Ａのエコーキャンセレーション装置において用いられうる相互相関解析を伴う適応フィルタの概要図である。1B is a schematic diagram of an adaptive filter with cross-correlation analysis that can be used in the echo cancellation apparatus of FIG. 1A. FIG. 本発明の実施形態にかかるエコーキャンセレーション方法を説明するフローチャートである。It is a flowchart explaining the echo cancellation method concerning embodiment of this invention. 本発明の実施形態にかかるエコーキャンセレーションのための別の方法を説明するフローチャートである。It is a flowchart explaining another method for the echo cancellation concerning embodiment of this invention. 本発明の別の実施形態にかかるエコーキャンセレーション装置の概略図である。It is the schematic of the echo cancellation apparatus concerning another embodiment of this invention.

［具体的な実施形態の説明］
以下の詳細な説明は、説明の目的のため、具体的な細部を含むが、本発明の範囲内において、後述の細部について多くの変形や変更が可能であることは、当該技術分野において通常の知識を有する者に理解されるところである。したがって、以下に記述される本発明の実施例の説明により、特許請求の範囲に記載されている発明が一般性を失うことなく、また、以下の説明は、特許請求の範囲に記載されている発明について制限を課すものではない。 [Description of Specific Embodiment]
The following detailed description includes specific details for purposes of explanation, but it is common in the art that many variations and modifications of the details described below are possible within the scope of the invention. It will be understood by those who have knowledge. Accordingly, the description of the embodiments of the present invention described below does not lose the generality of the invention described in the claims, and the following description is described in the claims. It does not impose any restrictions on the invention.

本発明の実施形態によると、機能的に同一である二つのフィルタを有する一体型のエコーおよびノイズキャンセラの新しい構成が提案される。これらのフィルタは、直交制御と表現（ｏｒｔｈｏｇｏｎａｌｃｏｎｔｒｏｌｓａｎｄｒｅｐｒｅｓｅｎｔａｔｉｏｎｓ）を伴う。このような構成においては、雑音のあるハンドフリー音声通信において、システム全体のロバスト性（ｒｏｂｕｓｔｎｅｓｓ）を引き上げるように、二つの直交フィルタは互いに補完し合う。 According to an embodiment of the invention, a new configuration of an integrated echo and noise canceller with two functionally identical filters is proposed. These filters involve orthogonal control and representations (orthogonal controls and representations). In such a configuration, the two orthogonal filters complement each other so as to increase the robustness of the entire system in noisy hands-free voice communication.

特に、一体型のエコーノイズキャンセラは、別個に制御される二つのサブシステムを並行に用いる。これらのサブシステムはそれぞれ、直行制御メカニズムを伴う。エコーノイズキャンセラは、フロント・エコーキャンセラと、バックアップ・エコーキャンセラとを含む。フロント・エコーキャンセラは、ダブルトーク検出を用いる。ローカル音声に対して確実にロバストであるようにするために、フロント・エコーキャンセラは、保守的な適応アプローチをとりながらも、提供するエコーサプレッションはより小さく、スピーチ、エコーの変化への適応は遅い。バックアップ・エコーキャンセラは、相互相関を用いて、エラー信号とエコー信号との間の類似性を測定する。バックアップ・エコーキャンセラは、フィルタが迅速に更新されるように、積極的な戦略をとる。バックアップ・エコーキャンセラは、大きなエコーサプレッションを提供しながらも、過剰に適応してしまう可能性があるため、ローカル音声／ノイズに対して不安定である。これらの二つのエコーキャンセラの出力の統合は、どちらのエコーキャンセラとエコー信号との差が大きいかを測定する相互相関解析に基づいて実行される。この統合においてはまた、両方のエコーキャンセラのフィルタ安定性がチェックされる。一のフィルタが過大予測または過小予測されている場合、そのフィルタは他方のフィルタによって補完される。このようなシステムは、いかなるときでも確実に一のフィルタが正しく動作するように設計される。 In particular, the integrated echo noise canceller uses two subsystems that are controlled separately in parallel. Each of these subsystems has a direct control mechanism. The echo noise canceller includes a front echo canceller and a backup echo canceller. The front echo canceller uses double talk detection. To ensure robustness to local speech, the front echo canceller takes a conservative adaptation approach but offers smaller echo suppression and slower adaptation to speech and echo changes . The backup echo canceller uses cross-correlation to measure the similarity between the error signal and the echo signal. The backup echo canceller takes an aggressive strategy so that the filter is updated quickly. Backup echo cancellers are unstable to local speech / noise because they may over-adapt while providing large echo suppression. Integration of the outputs of these two echo cancellers is performed based on a cross-correlation analysis that measures which echo canceller and the difference between the echo signals are large. This integration also checks the filter stability of both echo cancellers. If one filter is over-predicted or under-predicted, it is complemented by the other filter. Such a system is designed to ensure that one filter operates correctly at any time.

本システムはオプションで、同様のアプローチをとるエコー残差ノイズ予測部を含んでもよい。エコー残差ノイズ予測部は、直行制御を伴う二つの独立なサブ予測部を並行に用いる。第１予測部は、ロバストなダブルトーク検出部に依存するエコー距離ミスマッチ（ｅｃｈｏ−ｄｉｓｔａｎｃｅ−ｍｉｓｍａｔｃｈ）に基づく。第１予測部は、比較的正確でありながら、ダブルトーク検出エラーのために不安定である。第２予測部は相互スペクトル解析（ｃｒｏｓｓ−ｓｐｅｃｔｒｕｍ−ａｎａｌｙｓｉｓ）に基づく。第２予測部の予測にはバイアスがかかっているが安定であり、ローカル音声検出に依存せず、一貫性がある。これらの二つの残差エコーの予測の統合においては、遠端通話のみの場合、またはダブルトークの場合にそれぞれ、最小／最大アプローチがとられる。 The system may optionally include an echo residual noise predictor that takes a similar approach. The echo residual noise prediction unit uses two independent sub prediction units with direct control in parallel. The first prediction unit is based on an echo-distance-missmatch that relies on a robust double-talk detection unit. The first predictor is relatively accurate but unstable due to double-talk detection errors. The second predictor is based on cross-spectrum-analysis. The prediction of the second predictor is biased but stable, independent of local speech detection and consistent. In integrating these two residual echo predictions, a min / max approach is taken for far-end calls only or double-talk, respectively.

図１Ａは、本発明の一実施形態にかかるエコーキャンセレーション装置１００を用いたオーディオシステム９９を示す図である。装置１００の動作は、図２Ａに示される方法２００のフローチャート、および図２Ｂに示される方法２２０を参照することによって理解されるであろう。オーディオシステム９９は一般的に、リモート信号ｘ（ｔ）を受け取るスピーカ１０２とマイクロフォン１０４とを含む。ローカル音源１０１は、ローカルスピーチ信号ｓ（ｔ）を発する。マイクロフォン１０４は、ローカルスピーチ信号ｓ（ｔ）と、スピーカ信号ｘ（ｔ）に関連するエコー信号ｘ_１（ｔ）の両方を受け取る。マイクロフォン１０４はまた、マイクロフォン１０４が位置する環境から発生するノイズｎ（ｔ）をも受け取る。そして、マイクロフォン１０４は、マイクロフォン信号ｄ（ｔ）を生成する。マイクロフォン信号ｄ（ｔ）は、ｄ（ｔ）＝ｓ（ｔ）＋ｘ_１（ｔ）＋ｎ（ｔ）によって与えられるだろう。 FIG. 1A is a diagram showing an audio system 99 using an echo cancellation apparatus 100 according to an embodiment of the present invention. The operation of the apparatus 100 will be understood by referring to the flowchart of the method 200 shown in FIG. 2A and the method 220 shown in FIG. 2B. Audio system 99 generally includes a speaker 102 and a microphone 104 that receive a remote signal x (t). The local sound source 101 emits a local speech signal s (t). The microphone 104 receives both the local speech signal s (t) and the echo signal x ₁ (t) associated with the speaker signal x (t). The microphone 104 also receives noise n (t) generated from the environment in which the microphone 104 is located. Then, the microphone 104 generates a microphone signal d (t). The microphone signal d (t) will be given by d (t) = s (t) + x ₁ (t) + n (t).

エコーキャンセレーション装置１００は、一般的に、第１適応エコーキャンセレーションフィルタＥＣ（１）と第２適応エコーキャンセレーションフィルタＥＣ（２）とを含む。それぞれの適応フィルタは、マイクロフォン信号ｄ（ｔ）とスピーカ信号ｘ（ｔ）とを受け取る。図２Ａ−２Ｂに示されるように、フィルタＥＣ（１）はステップ２０２に示されるようにマイクロフォン信号ｄ（ｔ）を適応フィルタリング処理し、フィルタＥＣ（２）は、ステップ２０４に示されるように、第１フィルタＥＣ（１）と並行してマイクロフォン信号ｄ（ｔ）を適応フィルタリング処理する。ここで用いられているように、フィルタが「並行にオペレーションする」とは、実質的に同じ入力ｄ（ｔ）を受け取ることをいう。並行オペレーションは、一のフィルタの出力が、他方のフィルタの入力となるシリアルオペレーションとは、区別される。二つのフィルタＥＣ（１）、ＥＣ（２）の状態によって、一のフィルタが、主要な「フロント」フィルタの役目を果たし、他方のフィルタが「バックアップ」フィルタの役目を果たす。一のフィルタは、エコーキャンセレーションに対して慎重なアプローチをとる一方、他方のフィルタはより積極的なアプローチをとる。 The echo cancellation apparatus 100 generally includes a first adaptive echo cancellation filter EC (1) and a second adaptive echo cancellation filter EC (2). Each adaptive filter receives a microphone signal d (t) and a speaker signal x (t). 2A-2B, filter EC (1) adaptively filters the microphone signal d (t) as shown in step 202, and filter EC (2) is shown in step 204 as The microphone signal d (t) is adaptively filtered in parallel with the first filter EC (1). As used herein, “operating in parallel” by a filter means receiving substantially the same input d (t). Parallel operations are distinguished from serial operations in which the output of one filter is the input of the other filter. Depending on the state of the two filters EC (1), EC (2), one filter serves as the main “front” filter and the other filter serves as the “backup” filter. One filter takes a cautious approach to echo cancellation, while the other filter takes a more aggressive approach.

フィルタＥＣ（１）、ＥＣ（２）の状態は、以下の信号モデルに関連して理解されるであろう。
ｙ（ｔ）＝ｘ（ｔ）^＊ｈ（ｎ）
ｄ（ｔ）＝ｙ_０（ｔ）＋ｓ（ｔ）
ｅ（ｔ）＝ｄ（ｔ）−ｙ（ｔ）
ここで、ｙ（ｔ）は、エコーキャンセラフィルタによって合成されたエコーである。
ｘ（ｔ）は、ラウドスピーカにおいてプレイするエコーである。
ｈ（ｎ）は、エコーキャンセラフィルタの適応フィルタ関数である。
ｄ（ｔ）は、マイクロフォンが受けた雑音の多い信号である。
ｙ_０（ｔ）は、マイクロフォンにおいて現れる、真のエコーである。
ｓ（ｔ）は、ローカル音声である。
そして、ｅ（ｔ）は、エコーキャンセラフィルタによって生成されたエコーキャンセル済み残差信号である。 The state of the filters EC (1), EC (2) will be understood in connection with the following signal model.
y (t) = x (t) ^* h (n)
d (t) = y ₀ (t) + s (t)
e (t) = d (t) -y (t)
Here, y (t) is an echo synthesized by the echo canceller filter.
x (t) is an echo played in the loudspeaker.
h (n) is an adaptive filter function of the echo canceller filter.
d (t) is a noisy signal received by the microphone.
y ₀ (t) is the true echo that appears at the microphone.
s (t) is local voice.
E (t) is an echo-cancelled residual signal generated by the echo canceller filter.

二つのフィルタＥＣ（１）、ＥＣ（２）は、相補的なエコーキャンセレーション特質を有する。ここで用いられるように、「相補的エコーキャンセレーションを有する」とは、同じ入力を受け取る二つの適応フィルタにおいて、一のフィルタが入力にうまく適応していないときに、他方のフィルタが入力にうまく適応しているような場合をいう。本アプリケーションの文脈において、フィルタ関数ｈ（ｎ）が、「うまく適応している」とは、そのフィルタ関数ｈ（ｎ）が安定であり、真のエコーパスフィルタ（ｅｃｈｏ−ｐａｔｈ−ｆｉｌｔｅｒ）に収束しており、過大予測でもなく過小予測でもないときをいう。 The two filters EC (1), EC (2) have complementary echo cancellation characteristics. As used herein, “having complementary echo cancellation” means that in two adaptive filters that receive the same input, when one filter is not well adapted to the input, the other filter is The case where it is adapted. In the context of this application, the filter function h (n) is “adapted well” that the filter function h (n) is stable and converges to a true echo-path filter. This is when it is neither an overestimation nor underestimation.

ｈ（ｎ）が真のエコーパスフィルタに収束している（ｙ（ｔ）〜＝ｙ_０（ｔ））場合、すなわち、予測されたエコーが真のエコーと近似的に等しい場合、コヒーレンス関数αを用いて、エコーキャンセラフィルタＥＣ（１）、ＥＣ（２）の状態が定量化されるだろう。αは、ｙ（ｔ）とｅ（ｔ）の間の相互相関に関連し、式１が成り立つ。
（式１）

ここで”Ｅ”は、統計的期待値である。
式２に示す演算子は、相互相関演算を表す。
（式２）

離散的な関数ｆ_ｉとｇ_ｉについて、相互相関は式３で定義される。
（式３）

ここで、和は適切な値の整数ｊについてとられており、アスタリスクは、複数共役を表す。連続関数ｆ（ｘ）とｇ（ｘ）について、相互相関は式４で定義される。
（式４）

ここで積分は適切なｔの値についてとられる。 If h (n) has converged to a true echo path filter (y (t) ˜ = y ₀ (t)), that is, if the predicted echo is approximately equal to the true echo, the coherence function α , The state of the echo canceller filters EC (1), EC (2) will be quantified. α is related to the cross-correlation between y (t) and e (t), and Equation 1 is established.
(Formula 1)

Here, “E” is a statistical expectation value.
The operator shown in Equation 2 represents a cross-correlation operation.
(Formula 2)

For the discrete functions f _i and g _i , the cross-correlation is defined by Equation 3.
(Formula 3)

Here, the sum is taken for an integer j of an appropriate value, and the asterisk represents a plurality of conjugates. For continuous functions f (x) and g (x), the cross-correlation is defined by Equation 4.
(Formula 4)

Here, the integral is taken for the appropriate value of t.

コヒーレンス関数αにおいて、分子は、ｅ（ｔ）とｙ（ｔ）の相互相関を表す。分母は、ｙ（ｔ）の自己相関を表し、正規化項の役目を果たす。 In the coherence function α, the numerator represents the cross-correlation between e (t) and y (t). The denominator represents the autocorrelation of y (t) and serves as a normalization term.

理想的には、ｈ（ｎ）が収束するならば、αは「０」に近いはずである（残差信号ｅ（ｔ）はｙ（ｔ）を含まないからである）。ｈ（ｎ）が収束しないならば、αは「１」に近いはずである（ｅ（ｔ）はｙ（ｔ）の強いエコーを含むからである）。ｈ（ｎ）がおかしな挙動をし、または発散するならば、αは負であるはずである（フィルタの発散のため、ｅ（ｔ）は、位相が１８０度シフトした強いエコーを含むからである）。 Ideally, if h (n) converges, α should be close to “0” (since the residual signal e (t) does not include y (t)). If h (n) does not converge, α should be close to “1” (since e (t) contains a strong echo of y (t)). If h (n) behaves strangely or diverges, α should be negative (because of the divergence of the filter, e (t) contains a strong echo whose phase is shifted by 180 degrees. ).

したがって、例えば、コヒーレンス関数αの値は、フィルタＥＣ（１）、ＥＣ（２）の状態について、四つの可能な状態を定義するために用いられてもよい。ただしこれに制限されるものではない。
（１）フィルタｈ（ｎ）が安定であり、収束し、過大予測でも過小予測でもない場合には、０＜＝α＜＝０．１
（２）フィルタｈ（ｎ）が安定ではあるが、過小予測されているときには（まだ収束していない）α＞０．２
（３）フィルタｈ（ｎ）が過大予測されているときは、α＜−０．１
（４）フィルタｈ（ｎ）が発散するときには、α＜−０．２５
これらの異なる状態について、異なるαの値の範囲が決定されうることは当業者には理解されるであろう。 Thus, for example, the value of the coherence function α may be used to define four possible states for the states of the filters EC (1), EC (2). However, it is not limited to this.
(1) If the filter h (n) is stable and converges and is neither overpredicted nor underpredicted, then 0 <= α <= 0.1
(2) When the filter h (n) is stable but underestimated (not yet converged) α> 0.2
(3) When the filter h (n) is overestimated, α <−0.1
(4) When the filter h (n) diverges, α <−0.25.
One skilled in the art will appreciate that for these different states, different α value ranges can be determined.

フィルタの状態がよいならば（例えば状態（１））、その後に発散したときのリカバリのために、その設定が保存されてもよい。フィルタが発散し、または過小予測され、または過大予測されている場合には、フロントおよびバックアップ・エコーキャンセラはその役割を交換する。フロントフィルタがバックアップとなる一方、バックアップフィルタがフロントフィルタの役割を担う。一のフィルタが慎重な適応アプローチをとり、他方が積極的な適応アプローチをとるため、この交換により、最終的には、両方のフィルタがより早く収束し、よりダイナミックに安定する。 If the filter is in good condition (e.g., state (1)), the setting may be saved for recovery when it subsequently diverges. If the filter is diverging, underestimated, or overpredicted, the front and backup echo cancellers exchange their roles. While the front filter serves as a backup, the backup filter serves as a front filter. Since one filter takes a careful adaptive approach and the other takes a positive adaptive approach, this exchange eventually causes both filters to converge faster and more dynamically stable.

さらにフィルタが過小予測または過大予測されている場合、より早い収束、またはトラッキングのよりよい安定のために、適応スピードを加速させ、または減速させるように、適応ステップサイズが小さなデルタ値で増加または減少されてもよい。通常、収束を速くするためにはより大きなステップサイズが必要である。これにより、細部に関するよいトラッキングは犠牲となり、エコーサプレッション比ＥＳＲは、低くなる。小さなステップサイズを用いてよりゆっくりと収束させる場合は、より安定的であり、わずかな変化もトラックする機能を有するが、エコーディスロケーションを速くトラッキングするには適さない。 In addition, if the filter is under-predicted or over-predicted, the adaptation step size will increase or decrease with a small delta value to accelerate or decelerate the adaptation speed for faster convergence or better tracking stability. May be. Usually, a larger step size is required for faster convergence. This sacrifices good tracking of details and lowers the echo suppression ratio ESR. Converging more slowly with a small step size is more stable and has the ability to track even small changes, but is not suitable for tracking echo dislocation fast.

動的なステップサイズと、フロント／バックアップでのフィルタ交換を組み合わせることにより、速いトラッキング対詳細なトラッキング、安定性対収束の観点において、システム全体のバランスが良くなる。この二つが、適応システム設計において本当に重要な双子の課題である。 The combination of dynamic step size and front / backup filter exchange improves the overall balance of the system in terms of fast tracking versus detailed tracking and stability versus convergence. These are the twin issues that are really important in adaptive system design.

フィルタの一が発散した場合において、他方のフィルタがよい状態にあるならば、発散したフィルタを再初期化するために、その他方のフィルタの設定が複製されてもよい。別の方法では、発散したフィルタは、以前に保存された、よい状態のフィルタ設定を用いて復旧されて（ｒｅｃｏｖｅｒｅｄ）もよい。 If one filter diverges and the other filter is in good condition, the settings of the other filter may be replicated to reinitialize the divergence filter. Alternatively, the diverged filter may be recovered using a previously saved, good state filter setting.

例えば、エコーキャンセリング適応フィルタＥＣ（１）とＥＣ（２）は、周波数領域正規化最小二乗適応フィルタに基づいてもよい。ただし、これに制限されるものではない。各フィルタは、ハードウェア、ソフトウェア、またはハードウェアとソフトウェアの組み合わせとして実装されうる。 For example, the echo canceling adaptive filters EC (1) and EC (2) may be based on a frequency domain normalized least squares adaptive filter. However, it is not limited to this. Each filter may be implemented as hardware, software, or a combination of hardware and software.

図１Ｂと図１Ｃは、適切な相補適応フィルタの例を示す。具体的には、図１Ｂは、音声アクティビティ検出を伴う適応エコーキャンセレーションフィルタ１２０を示す。フィルタ１２０は、第１適応フィルタＥＣ（１）として用いることができる。フィルタ１２０は、フィルタ係数ｗ_ｔによって特徴付けられる有限インパルス応答（ＦＩＲ）フィルタを有する可変フィルタ１２２を含む。可変フィルタ１２２は、マイクロフォン信号ｄ（ｔ）を受け取り、フィルタ係数ｗ_ｔの値に従ってフィルタリング処理し、フィルタリング処理された信号ｄ’（ｔ）を生成する。可変フィルタ１２４は、入力信号を、係数ｗ_ｔ1によって決定されるインパルス応答で畳み込むことにより、望ましい信号を予測する。各フィルタ係数ｗ_ｔ１は、更新アルゴリズム１２４にしたがって、量Δｗ_ｔの規則的な間隔で更新される。一例として、フィルタ信号ｄ’（ｔ）が、望ましい信号としてスピーカエコー信号ｘ_１（ｔ）を予測しようと試みるように、フィルタ係数ｗ_ｔが選択されてもよい。差分ユニット１２６は、マイクロフォン信号ｄ（ｔ）からフィルタリング処理された信号ｄ’（ｔ）を差し引いて、予測信号ｅ_１（ｔ）を供給する。予測信号ｅ_１（ｔ）は、ローカルスピーチ信号ｓ（ｔ）を予測する。フィルタリング処理された信号ｄ’（ｔ）をリモート信号ｘ（ｔ）から差し引いて、誤差信号ｅ（ｔ）を生成してもよい。誤差信号ｅ（ｔ）は、更新アルゴリズム１２４によってフィルタ係数ｗ_ｔを調整するために用いられる。適応アルゴリズム１２４は、リモート信号ｘ（ｔ）と誤差信号に基づいて、補正因子（ｃｏｒｒｅｃｔｉｏｎｆａｃｔｏｒ）を生成する。係数更新アルゴリズムの例には、最小２乗法（ＬＭＳ）と再帰最小２乗法（ＲＬＳ：ｒｅｃｕｒｓｉｖｅｌｅａｓｔｓｑｕａｒｅｓ）が含まれる。ＬＭＳ更新アルゴリズムにおいては、例えば、フィルタ係数は、式ｗ_ｔ１＋１＝ｗ_ｔ１＋μｅ（ｔ）ｘ（ｔ）に基づいて更新される。ここで、μはステップサイズである。初めは、すべてのｗ_ｔ１について、ｗ_ｔ１＝０である。この例において、量μｅ（ｔ）ｘ（ｔ）は、量Δｗ_ｔであることに注意されたい。上述のように、ステップサイズμは、適応フィルタの状態によって、動的に調整されてもよい。具体的には、フィルタが過小予測されている場合には、早く収束するように適応スピードを加速するために、ステップサイズμを、小さなデルタ量増加させてもよい。フィルタが過大予測されている場合には、この代わりに、トラッキングがよりよく安定するように適応スピードを減速させるために、適応ステップサイズμが、小さなデルタ量でそれぞれ引き下げられてもよい。 1B and 1C show examples of suitable complementary adaptive filters. Specifically, FIG. 1B shows an adaptive echo cancellation filter 120 with voice activity detection. The filter 120 can be used as the first adaptive filter EC (1). Filter 120 includes a variable filter 122 having a finite impulse response (FIR) filter characterized by a filter coefficient w _t . The variable filter 122 receives the microphone signal d (t) and performs filtering according to the value of the filter coefficient w _t to generate a filtered signal d ′ (t). The variable filter 124 predicts the desired signal by convolving the input signal with an impulse response determined by the coefficient w _t1 . Each filter coefficient w _t1 is updated at regular intervals of the amount Δw _t according to the update algorithm 124. As an example, the filter coefficient w _t may be selected such that the filter signal d ′ (t) attempts to predict the speaker echo signal x ₁ (t) as the desired signal. The difference unit 126 subtracts the filtered signal d ′ (t) from the microphone signal d (t) to provide the predicted signal e ₁ (t). The prediction signal e ₁ (t) predicts the local speech signal s (t). The error signal e (t) may be generated by subtracting the filtered signal d ′ (t) from the remote signal x (t). The error signal e (t) is used by the update algorithm 124 to adjust the filter coefficient w _t . The adaptive algorithm 124 generates a correction factor based on the remote signal x (t) and the error signal. Examples of coefficient update algorithms include least squares (LMS) and recursive least squares (RLS). In the LMS update algorithm, for example, the filter coefficient is updated based on the expression w _{t1 + 1} = w _t1 + μe (t) x (t). Here, μ is a step size. Initially, w _t1 = 0 for all w _t1 . Note that in this example the quantity μe (t) x (t) is the quantity Δw _t . As described above, the step size μ may be dynamically adjusted according to the state of the adaptive filter. Specifically, when the filter is underestimated, the step size μ may be increased by a small delta amount in order to accelerate the adaptation speed so as to converge quickly. If the filter is overestimated, the adaptation step size μ may instead be reduced by a small delta amount to reduce the adaptation speed so that tracking is better stabilized.

時間領域表現ｅ（ｔ）ｘ（ｔ）は、乗算である。この計算は、以下のように周波数領域において実装されてもよい。初めに、ｅ（ｔ）、ｘ（ｔ）、およびｈ（ｎ）は、時間領域から周波数領域に、例えば高速フーリエ変換（ＦＦＴ）によって変換されてもよい。
Ｅ（ｋ）＝ｆｆｔ（ｅ（ｔ））
Ｘ（ｋ）＝ｆｆｔ（ｘ（ｔ））
Ｈ（ｋ）＝ｆｆｔ（ｈ（ｎ）） The time domain representation e (t) x (t) is a multiplication. This calculation may be implemented in the frequency domain as follows. Initially, e (t), x (t), and h (n) may be transformed from the time domain to the frequency domain, for example, by Fast Fourier Transform (FFT).
E (k) = fft (e (t))
X (k) = fft (x (t))
H (k) = fft (h (n))

実際の周波数領域におけるＬＭＳ更新アルゴリズムは、以下のようになる。
Ｈ（ｋ）＝Ｈ（ｋ）＋（μ^＊ｃｏｎｊ（Ｘ（ｋ））．^＊Ｅ（ｋ））／（Δ＋Ｘ（ｋ）^＊ｃｏｎｊ（Ｘ（ｋ））
ここで、μはフィルタ適応ステップサイズであり、動的である。
ｃｏｎｊ（ａ）は、複素数ａの複素共役を示す。
^＊は、複素乗算（ｃｏｍｐｌｅｘｍｕｌｔｉｐｌｉｃａｔｉｏｎ）を示す。
そして、Δは、分母が数量的に不安定になるのを防ぐレギュレータ（ｒｅｇｕｌａｔｏｒ）である。 The LMS update algorithm in the actual frequency domain is as follows.
H (k) = H (k) + (μ ^* conj (X (k)). ^* E (k)) / (Δ + X (k) ^* conj (X (k))
Where μ is the filter adaptation step size and is dynamic.
conj (a) represents a complex conjugate of the complex number a.
^{* Indicates} a complex multiplication.
Δ is a regulator that prevents the denominator from becoming quantitatively unstable.

上の方程式において、「ｃｏｎｊ（Ｘ（ｋ））．^＊Ｅ（ｋ）」は、「ｅ（ｔ）ｘ（ｔ）」タスクを実行する。分母において、「Ｘ（ｋ）^＊ｃｏｎｊ（Ｘ（ｋ））」は、安定性を高める目的で正規化する役割を果たす。 In the above equation, “conj (X (k)). ^* E (k)” performs the “e (t) x (t)” task. In the denominator, “X (k) ^* conj (X (k))” serves to normalize for the purpose of increasing stability.

音声アクティブ化された検出ＶＡＤは、更新アルゴリズム１２４を調整して、リモート信号ｘ（ｔ）が存在するときに（例えば所定の閾値以上であるならば）、可変フィルタ１２２が、マイクロフォン信号ｄ（ｔ）のみを適応的にフィルタリング処理するようにしてもよい。図１Ｂに示される音声アクティブ化された検出（ダブルトーク検出と呼ばれることもある）を用いる適応フィルタは、比較的ゆっくりと適応するフィルタである。しかし、このフィルタはまた、擬陽性をほとんど生じないという点において、非常に正確である。フィルタ１２０に対する相補適応フィルタは、例えば、比較的早く適応するが、しばしば擬陽性を生じる傾向があるフィルタであるかもしれない。 The voice activated detection VAD adjusts the update algorithm 124 so that when the remote signal x (t) is present (eg, greater than or equal to a predetermined threshold), the variable filter 122 causes the microphone signal d (t ) May be adaptively filtered. The adaptive filter using the voice activated detection (sometimes referred to as double talk detection) shown in FIG. 1B is a filter that adapts relatively slowly. However, this filter is also very accurate in that it produces few false positives. A complementary adaptive filter for filter 120 may be, for example, a filter that adapts relatively quickly but often tends to produce false positives.

一例として、図１Ｃは、図１Ｂのフィルタ１２０に対して相補的な適応フィルタ１３０を示す。適応フィルタ１３０は、フィルタ係数ｗ_ｔ２と更新アルゴリズム１３４（例えば上述のＬＭＳ更新アルゴリズム）によって特徴づけられる可変フィルタを含む。フィルタ１３２は、スピーカエコー信号ｘ_１（ｔ）を望ましい信号として予測しようと試みる。差分ユニット１３６は、フィルタリング処理された信号ｄ’（ｔ）をマイクロフォン信号ｄ（ｔ）から差し引いて、ローカルスピーチ信号ｓ（ｔ）を予測する予測信号ｅ_２（ｔ）を提供する。フィルタリング処理された信号ｄ’（ｔ）をリモート信号ｘ（ｔ）から差し引いて、誤差信号ｅ（ｔ）を発生させてもよい。誤差信号ｅ（ｔ）はフィルタ係数ｗ_ｔ２を調整するために更新アルゴリズム１３４によって用いられる。フィルタ１３０において相互相関解析ＣＣＡは、可変フィルタ１３２が、予測信号ｅ_２（ｔ）とスピーカエコー信号ｘ（ｔ）との間の相互相関を低減させようとするように、更新アルゴリズム１３４を調整する。 As an example, FIG. 1C shows an adaptive filter 130 that is complementary to the filter 120 of FIG. 1B. The adaptive filter 130 includes a variable filter characterized by a filter coefficient w _t2 and an update algorithm 134 (eg, the LMS update algorithm described above). Filter 132 attempts to predict speaker echo signal x ₁ (t) as the desired signal. The difference unit 136 subtracts the filtered signal d ′ (t) from the microphone signal d (t) to provide a prediction signal e ₂ (t) that predicts the local speech signal s (t). The filtered signal d ′ (t) may be subtracted from the remote signal x (t) to generate the error signal e (t). The error signal e (t) is used by the update algorithm 134 to adjust the filter coefficient w _t2 . In the filter 130, the cross-correlation analysis CCA adjusts the update algorithm 134 so that the variable filter 132 attempts to reduce the cross-correlation between the predicted signal e ₂ (t) and the speaker echo signal x (t). .

ｅ_２（ｔ）とｘ（ｔ）が非常に強く相関しているとき、フィルタリング処理は過小予測されているといわれ、更新アルゴリズム１３４は、Δｗ_ｔ２を増加させるように調整される。ｅ_２（ｔ）とｘ（ｔ）との間の相互相関が閾値未満であるとき、フィルタリング処理は過大予測されているといわれ、更新アルゴリズム１３４は、Δｗ_ｔ２を減少させるように調整される。 When e ₂ (t) and x (t) are very strongly correlated, the filtering process is said to be underestimated and the update algorithm 134 is adjusted to increase Δw _t2 . When the cross-correlation between e ₂ (t) and x (t) is below the threshold, the filtering process is said to be over-predicted and the update algorithm 134 is adjusted to reduce Δw _t2 .

図１Ｃに示されるタイプの相互相関解析（クロススペクトラム解析ともいわれる）を用いる適応フィルタは、比較的速くフィルタを適応させる。しかし、このフィルタはまた、しばしば擬陽性を生じるという点において、不安定である。したがって、フィルタ１２０とフィルタ１３０は、相補フィルタの例となる。 An adaptive filter using the type of cross-correlation analysis (also referred to as cross-spectrum analysis) shown in FIG. 1C adapts the filter relatively quickly. However, this filter is also unstable in that it often produces false positives. Therefore, the filter 120 and the filter 130 are examples of complementary filters.

再び図１Ａを参照する。インテグレータ１０６は、第１適応フィルタＥＣ（１）と第２適応フィルタＥＣ（２）に接続される。インテグレータ１０６は、第１および第２適応フィルタのそれぞれの出力ｅ_１（ｔ）、ｅ_２（ｔ）から、最小エコー出力ｅ_３（ｔ）を決定するように構成されている。最小エコー出力ｅ_３（ｔ）は、ｅ_１（ｔ）とｅ_２（ｔ）のいずれかであり、エネルギがより小さく、スピーカ信号ｘ（ｔ）との相関がより小さい方である。ｅ_１（ｔ）とｅ_２（ｔ）のうちの一方のエネルギの方がより小さいが、ｘ（ｔ）との相関は、他方がより小さい場合には、相関がより小さい方を最小エコー出力ｅ_３（ｔ）として用いる。例えば、フィルタのうちの一が過大予測されている（すなわち目標音声をキャンセルしがちであるためにエネルギ出力が小さい）とき、エネルギにかかわらず相関が小さいほうがよい。最小エネルギは、Ｅ｛ｅ_１（ｔ）｝とＥ｛ｅ_２（ｔ）｝との最小値を決定することにより決定されてもよい。ここで、Ｅ｛｝はカッコ内の量の期待値を決定する演算を示す。再び図２Ａ−２Ｂを参照する。ステップ２０６において、ｅ_１（ｔ）とｅ_２（ｔ）のどちらがスピーカ信号ｘ_１（ｔ）との相互相関が小さいか決定するために、ｅ_１（ｔ）とｅ_２（ｔ）について相互相関解析が実行されてもよい。相互相関解析は、下記の式５と式６の最小値を決定するステップを含んでもよい。
（式５）

（式６）

ここで、式７の演算子
（式７）

は、例えば、上で定義されたように演算子の両側の量について、その間の相互相関をとる演算を表現する。最小エコー出力ｅ_３（ｔ）は、マイクロフォン１０４のフィルタリング処理された出力として用いられてもよい。 Reference is again made to FIG. 1A. The integrator 106 is connected to the first adaptive filter EC (1) and the second adaptive filter EC (2). The integrator 106 is configured to determine the minimum echo output e ₃ (t) from the respective outputs e ₁ (t), e ₂ (t) of the first and second adaptive filters. The minimum echo output e ₃ (t) is either e ₁ (t) or e ₂ (t), and has the smaller energy and the smaller correlation with the speaker signal x (t). The energy of one of e ₁ (t) and e ₂ (t) is smaller, but when the correlation with x (t) is smaller, the smaller correlation is the smallest echo output. Used as e ₃ (t). For example, when one of the filters is overestimated (ie, the energy output is small because the target speech tends to be canceled), the correlation should be small regardless of the energy. The minimum energy may be determined by determining the minimum value of E {e ₁ (t)} and E {e ₂ (t)}. Here, E {} represents an operation for determining the expected value of the quantity in parentheses. Reference is again made to FIGS. 2A-2B. In step _206, cross-correlation to either _e 1 (t) and _e 2 (t) to determine whether the cross-correlation is small between the loudspeaker signal _x 1 _(t), the _e 1 (t) and _e 2 (t) An analysis may be performed. The cross-correlation analysis may include a step of determining a minimum value of Equation 5 and Equation 6 below.
(Formula 5)

(Formula 6)

Here, the operator of Expression 7 (Expression 7)

Represents, for example, an operation that takes the cross-correlation between the quantities on both sides of the operator as defined above. The minimum echo output e ₃ (t) may be used as the filtered output of the microphone 104.

いくつかの状況においては、フィルタＥＣ（１）、ＥＣ（２）のうちの一が、ローカル信号を過度にフィルタリング処理するかもしれない。そのような状況においては、そのフィルタは「発散した」といわれる。これは、特にＥＣ（２）が、例えば図１Ｃに示されるようなタイプの相互相関フィルタであるときに実際に起こりうる。この可能性に対処するために、ステップ２０８においてＥＣ（２）が発散するかどうか、決定される。一例としてインテグレータ１０６は、第２適応エコーキャンセレーションフィルタが、過度にフィルタリング処理することにより、ローカル信号ｓ（ｔ）を除去していないか、決定するように構成されてもよい。これはｅ_２（ｔ）とスピーカエコー信号ｘ_１（ｔ）との間の相互相関の期待値を調べることにより実行することができる。すなわち、式８で表される。
（式８）

典型的には、式９が成り立つ。
（式９）

しかしながら、式１０が、ある閾値（例えば約０．２）未満であるときには、ＥＣ（２）が過度にフィルタリング処理することにより、ローカル信号ｓ（ｔ）が除去されている。
（式１０）

このような状況において、インテグレータ１０６は、ｅ_１（ｔ）を最小エコー出力ｅ_３（ｔ）として選択してもよい。適応フィルタリング処理を安定させるために、ステップ２１２において、ＥＣ（２）のフィルタ係数ｗ_ｔ２が、ＥＣ（１）のフィルタ係数ｗ_ｔ１として設定されてもよい。そしてステップ２１５において、ＥＣ（２）は、０、または、以前のうまく適応したことが知られている状態に、再初期化されてもよい。例えば、フィルタ係数は、規則的な間隔で（例えば約１０秒から２０秒ごとに）保存されて、ＥＣ（２）が発散したときにこれを再初期化するために用いられてもよい。
In some situations, one of the filters EC (1), EC (2) may over-filter the local signal. In such a situation, the filter is said to be “divergent”. This can happen in particular when EC (2) is a cross-correlation filter of the type shown for example in FIG. 1C. To address this possibility, it is determined in step 208 whether EC (2) diverges. As an example, the integrator 106 may be configured to determine whether the second adaptive echo cancellation filter has removed the local signal s (t) by excessive filtering. This can be done by examining the expected value of the cross-correlation between e ₂ (t) and the speaker echo signal x ₁ (t). That is, it is expressed by Formula 8.
(Formula 8)

Typically, Equation 9 holds.
(Formula 9)

However, when Equation 10 is less than a certain threshold (eg, about 0.2), EC (2) is excessively filtered to remove the local signal s (t).
(Formula 10)

In such a situation, the integrator 106 may select e ₁ (t) as the minimum echo output e ₃ (t). To stabilize the adaptive filtering process, in step 212, the filter coefficients _{w t2} of EC (2) may be set as the filter coefficients _{w t1} of EC (1). Then, in step 215, EC (2) may be re-initialized to 0, or a state known to have been previously well adapted. For example, the filter coefficients may be stored at regular intervals (eg, approximately every 10 to 20 seconds) and used to reinitialize EC (2) when it diverges.

通常、相互相関フィルタが発散しないときに、そのフィルタはうまく適応していると言われる。ＥＣ（２）とＥＣ（１）は相補的なフィルタリング特性を有するため、ＥＣ（２）がうまく適応しているとき、ＥＣ（１）は過小予測されていることになる。適応フィルタリング処理を安定化させるため、ステップ２１４に示されるように、第１適応フィルタＥＣ（１）のフィルタ係数ｗ_ｔ１が、第２適応フィルタＥＣ（２）のフィルタ係数ｗ_ｔ２と交換される。フィルタをソフトウェアに実装する際には、係数ｗ_ｔ１、ｗ_ｔ２は、メモリにおいてポインタによって特定される位置に格納されてもよい。係数ｗ_ｔ１、ｗ_ｔ２は、例えば、ｗ_ｔ１およびｗ_ｔ２へのポインタを切り替えることによって、交換されてもよい。 Usually, when a cross-correlation filter does not diverge, it is said to be well adapted. Since EC (2) and EC (1) have complementary filtering characteristics, when EC (2) is well adapted, EC (1) will be underestimated. In order to stabilize the adaptive filtering process, the filter coefficient w _{t1 of} the first adaptive filter EC (1) is exchanged with the filter coefficient w _{t2 of} the second adaptive filter EC (2), as shown in step 214. When the filter is implemented in software, the coefficients w _t1 and w _t2 may be stored at a position specified by the pointer in the memory. The coefficients w _t1 and w _t2 may be exchanged, for example, by switching the pointers to w _t1 and w _t2 .

最小エコー出力ｅ_３（ｔ）は、いくばくかの、スピーカ信号ｘ（ｔ）からの残差エコーｘｅ（ｔ）を含むかもしれない。装置１００は、オプションで、インテグレータ１０６に接続された第１および第２エコー残差予測部ＥＲ（１）とＥＲ（２）、および、エコー残差予測部ＥＲ（１）とＥＲ（２）に接続された残差エコーキャンセレーションモジュール１０８を含んでもよい。 The minimum echo output e ₃ (t) may include some residual echo xe (t) from the speaker signal x (t). The apparatus 100 is optionally connected to first and second echo residual prediction units ER (1) and ER (2) and echo residual prediction units ER (1) and ER (2) connected to the integrator 106. A connected residual echo cancellation module 108 may be included.

第１エコー残差予測部ＥＲ（１）は、最小エコー出力ｅ_３（ｔ）とスピーカ信号ｘ（ｔ）との間の相互相関解析を含む第１残差エコー予測ＥＲ_１（ｔ）を生成するように構成されてもよい。図２Ｂのステップ２２２に示されるように、最小エコー出力ｅ_３（ｔ）とスピーカ信号ｘ（ｔ）との間の相互相関解析から、例えば、式１１の値を決定することにより、第１残差エコー予測ＥＲ_１（ｔ）が決定されてもよい。
（式１１）

ここで、式１１の値は、ｅ_３（ｔ）が式１２の相互相関の期待値を最小化するときに、真である。
（式１２）

この最小化問題は、本質的に、適応により実現されるであろう。例えば、エコー残差予測部ＥＲ（１）が、初期状態においては単位フィルタ（すべて値”１”）であると仮定されたい。すべてのフレームにおいて、サーチサーフェス（ｓｅａｒｃｈｓｕｒｐｈａｃｅ）の接線方向（ｔａｎｇｅｎｔｄｉｒｅｃｔｉｏｎ）に向かうにつれて、第１残差エコー予測ＥＲ_１（ｔ）は、増加するかもしれない。これは、ニュートンソルバ（Ｎｅｗｔｏｎｓｏｌｖｅｒ）アルゴリズムによって実現されてもよい。第２残差エコー予測部ＥＲ（２）は、最小エコー出力ｅ_３（ｔ）とスピーカ信号ｘ（ｔ）との間のエコー距離ミスマッチ（ｅｃｈｏ−ｄｉｓｔａｎｃｅｍｉｓｍａｔｃｈ）を含む第２残差エコー予測ＥＲ_２（ｔ）を決定するように構成されてもよい。図２Ｂのステップ２２４に示されるように、最小エコー出力ｅ_３（ｔ）とスピーカ信号ｘ（ｔ）との間のエコー距離ミスマッチから、例えば、ａｒｇｍｉｎ（Ｅ｛（ｅ_３（ｔ））^２／（ｘ（ｔ））^２｝）を決定することにより、第２残差エコー予測ＥＲ_２（ｔ）が決定されてもよい。ここで、ｅ_３（ｔ）が商（ｅ_３（ｔ））^２／（ｘ（ｔ））^２の期待値を最小化するとき、ａｒｇｍｉｎ（Ｅ｛（ｅ_３（ｔ））^２／（ｘ（ｔ））^２｝）は真である。ここでも再び、最小化は、ニュートンソルバアルゴリズムを用いて実現されてもよい。 The first echo residual prediction unit ER (1) generates a first residual echo prediction ER ₁ (t) including a cross-correlation analysis between the minimum echo output e ₃ (t) and the speaker signal x (t). It may be configured to. As shown in step 222 of FIG. 2B, from the cross-correlation analysis between the minimum echo output e ₃ (t) and the speaker signal x (t), for example, by determining the value of Equation 11, The difference echo prediction ER ₁ (t) may be determined.
(Formula 11)

Here, the value of Equation 11 is true when e ₃ (t) minimizes the expected cross-correlation value of Equation 12.
(Formula 12)

This minimization problem will essentially be realized by adaptation. For example, assume that the echo residual prediction unit ER (1) is a unit filter (all values “1”) in the initial state. In all frames, the first residual echo prediction ER ₁ (t) may increase as it goes toward the tangent direction of the search surface. This may be realized by a Newton solver algorithm. The second residual echo prediction unit ER (2) includes a second residual echo prediction ER including an echo distance mismatch (echo-distance mismatch) between the minimum echo output e ₃ (t) and the speaker signal x (t). ₂ (t) may be determined. As shown in step 224 of FIG. 2B, from the echo distance mismatch between the minimum echo output e ₃ (t) and the speaker signal x (t), for example, argmin (E {(e ₃ (t)) ² / The second residual echo prediction ER ₂ (t) may be determined by determining (x (t)) ² }). Here, when e ₃ (t) minimizes the expected value of the quotient (e ₃ (t)) ² / (x (t)) ² , argmin (E {(e ₃ (t)) ² / (x (T)) ² }) is true. Again, minimization may be achieved using a Newton solver algorithm.

残差エコーキャンセレーションモジュール１０８は、二つの残差エコー予測ＥＲ_１（ｔ）とＥＲ_２（ｔ）の最小残差エコー予測ＥＲ_３（ｔ）を決定して、その最小値ＥＲ_３（ｔ）に従ってフィルタリング処理された信号ｅ_３（ｔ）を調整してもよい。一例として、最小残差エコー予測ＥＲ_３（ｔ）は、ＥＲ_１（ｔ）とＥＲ_２（ｔ）のうち、エネルギが最小であり、ｘ（ｔ）に対する相関が最小であるものであってもよい。例えば図２Ｂのステップ２２６に示されるように、ＥＲ_１（ｔ）とＥＲ_２（ｔ）のうちの最小値に設定され、ステップ２２８に示されるように、その結果であるＥＲ_３の値がｅ_３（ｔ）から差しひかれて、残差エコーキャンセルフィルタリング処理された信号ｅ_３’（ｔ）が生成される。ＥＲ_３がＥＲ_１（ｔ）に等しいならば、残差エコーｘｅ（ｔ）は、ローカルスピーチ信号ｓ（ｔ）の強度が０でないときに、最小限に除去される。ＥＲ_３（ｔ）がＥＲ_２（ｔ）に等しいならば、残差エコーｘｅ（ｔ）は、遠端のエコーｘ（ｔ）のみが存在するとき（遠端発話のみの期間）最大限に除去される。 The residual echo cancellation module 108 determines the minimum residual echo prediction ER ₃ (t) of the two residual echo predictions ER ₁ (t) and ER ₂ (t), and its minimum value ER ₃ (t) The filtered signal e ₃ (t) may be adjusted according to As an example, the minimum residual echo prediction ER ₃ (t) is the _one of ER ₁ (t) and ER ₂ (t) that has the smallest energy and the smallest correlation to x (t). Good. For example, as shown in step 226 of FIG. 2B, the minimum value of ER ₁ (t) and ER ₂ (t) is set, and as shown in step 228, the resulting value of ER ₃ is e Subtracted from ₃ (t), a residual echo cancellation filtered signal e ₃ ′ (t) is generated. If ER ₃ is equal to ER ₁ (t), the residual echo xe (t) is minimally removed when the intensity of the local speech signal s (t) is not zero. If ER ₃ (t) is equal to ER ₂ (t), the residual echo xe (t) is maximally removed when only the far-end echo x (t) is present (period of far-end utterance only). Is done.

一例として、２次のノルムＮ（１）とＮ（２）が、二つのエコー残差予測部ＥＲ（１）とＥＲ（２）のためにそれぞれ計算されてもよい。
Ｎ（１）＝‖ＥＲ（１）‖
Ｎ（２）＝‖ＥＲ（２）‖ As an example, second-order norms N (1) and N (2) may be calculated for the two echo residual prediction units ER (1) and ER (2), respectively.
N (1) = ‖ER (1) ‖
N (2) = ‖ER (2) ‖

ダブルトーク状況下においては、より小さいノルムを有するエコー残差予測部が、エコー残差ノイズを取り除くために、ｅ_３（ｔ）に適用されてもよい。シングルトーク状況下においては、より大きいノルムを有するエコー残差予測部が、エコー残差ノイズを取り除くために、ｅ_３（ｔ）に適用されてもよい。 Under double-talk situations, an echo residual prediction unit with a smaller norm may be applied to e ₃ (t) to remove echo residual noise. Under a single talk situation, an echo residual predictor with a larger norm may be applied to e ₃ (t) to remove echo residual noise.

エコーキャンセレーションにおいては、フィルタリング処理された信号ｅ_３（ｔ）、または、残差エコーキャンセルフィルタリング処理された信号ｅ_３’（ｔ）から、ノイズｎ（ｔ）が除去されてもよい。ただし、このようなノイズキャンセレーションは、望ましくないかもしれない。なぜならば、信号ｅ_３（ｔ）またはｅ_３’（ｔ）のリモート受信者は、ノイズがない状態を、マイクロフォン１０４からのすべての通信が失われた徴候であると解釈するかもしれないからである。この問題に対処するために、装置１００はオプションで、ノイズキャンセラユニット１１０を含んでもよい。ノイズキャンセレーションモジュール１１０は、例えば図２Ａ−２Ｂのステップ２１７に示されるように、マイクロフォン信号ｄ（ｔ）から予測ノイズ信号ｎ’（ｔ）を計算するように構成されてもよい。予測ノイズ信号ｎ’（ｔ）は、減衰係数αで減衰されて、低減されたノイズ信号ｎ”（ｔ）＝αｎ’（ｔ）を形成してもよい。減衰されたノイズ信号ｎ”（ｔ）は、図２Ａのステップ２１８に示されるようにｅ_３（ｔ）に加算されることにより、または、図２Ｂのステップ２３０に示されるようにｅ_３’（ｔ）に加算されることにより、マイクロフォン出力信号ｓ’（ｔ）に組み込まれてもよい。 In the echo cancellation, noise n (t) may be removed from the filtered signal e ₃ (t) or the residual echo cancellation filtered signal e ₃ ′ (t). However, such noise cancellation may not be desirable. This is because the remote recipient of signal e ₃ (t) or e ₃ ′ (t) may interpret the noise-free condition as an indication that all communication from microphone 104 has been lost. is there. To address this issue, the apparatus 100 may optionally include a noise canceller unit 110. The noise cancellation module 110 may be configured to calculate a predicted noise signal n ′ (t) from the microphone signal d (t), for example, as shown in step 217 of FIGS. 2A-2B. The predicted noise signal n ′ (t) may be attenuated by an attenuation factor α to form a reduced noise signal n ″ (t) = αn ′ (t). The attenuated noise signal n ″ (t) ) Is added to e ₃ (t) as shown in step 218 of FIG. 2A, or added to e ₃ ′ (t) as shown in step 230 of FIG. 2B, It may be incorporated into the microphone output signal s ′ (t).

本発明の実施形態においては、図１Ａ−１Ｃに関連して説明された装置、および図２Ａ−２Ｃに関連して説明された方法は、プログラマブルなプロセッサとメモリを有するシステム上のソフトウェアとして実装されてもよい。 In an embodiment of the invention, the apparatus described in connection with FIGS. 1A-1C and the method described in connection with FIGS. 2A-2C are implemented as software on a system having a programmable processor and memory. May be.

本発明の実施形態によると、図１および図２Ａ−Ｂに関連して説明されたタイプの、前述のように動作する信号処理方法は、図３に示されるように、信号処理装置３００の一部として実装されてもよい。システム３００は、プロセッサ３０１とメモリ３０２（例えば、ＲＡＭ、ＤＲＡＭ、ＲＯＭなど）を含んでもよい。信号処理装置３００はさらに、並行処理が実装される場合には、複数のプロセッサ３０１を有してもよい。メモリ３０２は前述のように構成されたデータおよびコードを含む。具体的には、メモリ３０２には、プログラムコード３０４と信号データ３０６が格納されてもよい。コード３０４は、上述の、エコーキャンセリング適応フィルタＥＣ（１）、ＥＲ（２）、インテグレータ１０６、エコー残差フィルタＥＲ（１）、ＥＲ（２）、残差エコーキャンセレーションモジュール１０８、ノイズキャンセラ１１０を実装してもよい。信号データ３０６は、マイクロフォン信号ｄ（ｔ）、かつ／または、スピーカ信号ｘ（ｔ）のデジタル表現を含んでもよい。 According to an embodiment of the present invention, a signal processing method of the type described in connection with FIGS. 1 and 2A-B and operating as described above is shown in FIG. It may be implemented as a part. The system 300 may include a processor 301 and a memory 302 (eg, RAM, DRAM, ROM, etc.). The signal processing device 300 may further include a plurality of processors 301 when parallel processing is implemented. Memory 302 includes data and code configured as described above. Specifically, the program code 304 and signal data 306 may be stored in the memory 302. Code 304 includes the echo canceling adaptive filters EC (1), ER (2), integrator 106, echo residual filters ER (1), ER (2), residual echo cancellation module 108, and noise canceller 110 described above. May be implemented. The signal data 306 may include a digital representation of the microphone signal d (t) and / or the speaker signal x (t).

装置３００はまた、入出力（Ｉ／Ｏ）エレメント３１１、電源（Ｐ／Ｓ）３１２、クロック（ＣＬＫ）３１３、キャッシュメモリ３１４といった、周知のサポート機能３１０を含んでもよい。装置３００は、プログラム、かつ／または、データを格納するためのディスクドライブ、ＣＤ−ＲＯＭドライブ、テープドライブといった大容量記憶装置３１５をオプションで含んでもよい。コントローラは、また、オプションで、コントローラ３００とユーザの間の対話を手助けするためのディスプレイユニット３１６と、ユーザインタフェイスユニット３１８を含んでもよい。ディスプレイユニット３１６は、ブラウン管型でもよく、またフラットパネルスクリーンでもよい。これらはテキスト、数値、グラフィックシンボル、画像を表示する。ユーザインタフェイス３１８は、キーボード、マウス、ジョイスティック、ライトペン（ｌｉｇｈｔｐｅｎ）やそのほかの装置を含んでもよい。さらに、スピーカ３２２とマイクロフォン３２４は、入出力構成エレメント３１１を介してプロセッサ３０１に接続されていてもよい。プロセッサ３０１、メモリ３０２、そしてシステム３００のほかの構成要素は、図３に示されるようにシステムバス３２０を介して互いに信号（例えば、コード・インストラクションとデータ）を交換してもよい。 The apparatus 300 may also include well-known support functions 310 such as an input / output (I / O) element 311, a power supply (P / S) 312, a clock (CLK) 313, and a cache memory 314. The device 300 may optionally include a mass storage device 315 such as a disk drive, CD-ROM drive, tape drive for storing programs and / or data. The controller may also optionally include a display unit 316 and a user interface unit 318 to facilitate interaction between the controller 300 and the user. The display unit 316 may be a cathode ray tube type or a flat panel screen. They display text, numbers, graphic symbols, and images. The user interface 318 may include a keyboard, mouse, joystick, light pen, and other devices. Further, the speaker 322 and the microphone 324 may be connected to the processor 301 via the input / output configuration element 311. Processor 301, memory 302, and other components of system 300 may exchange signals (eg, code instructions and data) with each other via system bus 320 as shown in FIG.

ここで用いられるように、入出力という言葉は、一般的に、システム３００への、またはシステム３００からの、および周辺装置への、または周辺装置からのデータを転送する任意のプログラム、オペレーション、または装置を指す。すべてのデータ転送が、一の装置からの出力であり、他の一の装置への入力であると見なすことができるであろう。周辺装置は、キーボードやマウスなどの入力のみの装置や、プリンタなどの出力のみの装置、そして上書き可能ＣＤ−ＲＯＭなどの入力および出力装置として動作する装置を含む。周辺装置という言葉には、マウス、キーボード、プリンタ、モニタ、マイクロフォン、ゲームコントローラ、カメラ、外部Ｚｉｐドライブ、スキャナなどの外部装置と、ＣＤ−ＲＯＭドライブ、ＣＤ−Ｒドライブ、内部モデムなどの内部装置、および、フラッシュメモリ用リーダ／ライタ、ハードドライブなどのそのほかの周辺装置を含む。 As used herein, the term input / output generally refers to any program, operation, or that transfers data to or from system 300 and to and from peripheral devices. Refers to the device. All data transfers could be considered output from one device and input to another device. Peripheral devices include devices that operate as input and output devices such as keyboards and mice, output only devices such as printers, and overwriteable CD-ROMs. Peripheral devices include external devices such as mice, keyboards, printers, monitors, microphones, game controllers, cameras, external Zip drives, scanners, internal devices such as CD-ROM drives, CD-R drives, internal modems, And other peripheral devices such as a flash memory reader / writer and a hard drive.

プロセッサ３０１は、信号データ３０６およびメモリ３０２によって格納され、獲得されプロセッサモジュール３０１によって実行されるプログラム３０４のプログラムコード命令に応えて、信号データ３０６にデジタル信号処理を実行する。プログラム３０４のコードの一部はアセンブリ、Ｃ＋＋、Ｊａｖａ（登録商標）またはそのほかの多くの言語のような様々な異なるプログラミング言語のうちの一であってよい。プロセッサモジュール３０１は、プログラムコード３０４のようなプログラムを実行するときには特別な目的のコンピュータとなる汎用コンピュータを構成する。プログラムコード３０４は、ここでは、汎用コンピューター上で実行されるソフトウェアとして実装されるものとして説明されたが、これに代えて、アプリケーション特定集積回路（ＡＳＩＣ）のようなハードウェアを用いて、タスク管理方法が実現されることは当業者には理解されるであろう。そのように、本発明の実施形態は、全体的にまたは部分的に、ソフトウェア、ハードウェア、またはこれらの組合せによって実現されることは理解されるであろう。 The processor 301 performs digital signal processing on the signal data 306 in response to program code instructions of the program 304 stored and acquired by the signal data 306 and the memory 302 and executed by the processor module 301. Some of the code of program 304 may be one of a variety of different programming languages such as assembly, C ++, Java, or many other languages. The processor module 301 constitutes a general-purpose computer that becomes a special purpose computer when executing a program such as the program code 304. Although the program code 304 has been described here as being implemented as software executed on a general-purpose computer, task management is performed using hardware such as an application specific integrated circuit (ASIC) instead. One skilled in the art will appreciate that the method is implemented. As such, it will be understood that embodiments of the present invention may be implemented in whole or in part by software, hardware, or a combination thereof.

ある実施形態においては、とりわけプログラムコード３０４は、図２Ａの方法２００や図２Ｂの方法２２０に共通な特徴を有する方法を実現するためのプロセッサ可読命令のセットを含んでもよい。プログラムコード３０４は、一般的に、以下のような命令を含んでもよい。すなわち、プロセッサ３０１に、相補的エコーキャンセレーション特性を有する第１および第２適応フィルタによって並行にマイクロフォン信号ｄ（ｔ）をフィルタリング処理させ、エコーキャンセル処理された出力ｅ_１（ｔ）とｅ_２（ｔ）を生成させる命令、ｅ_１（ｔ）とｅ_２（ｔ）から最小エコー出力ｅ_３（ｔ）を決定する命令、最小エコー出力を用いてマイクロフォン出力を生成する命令である。 In some embodiments, among other things, the program code 304 may include a set of processor readable instructions for implementing a method having features common to the method 200 of FIG. 2A and the method 220 of FIG. 2B. Program code 304 may generally include instructions such as: That is, the processor 301 causes the microphone signal d (t) to be filtered in parallel by the first and second adaptive filters having complementary echo cancellation characteristics, and outputs e ₁ (t) and e ₂ ( an instruction for generating t), an instruction for determining the minimum echo output e ₃ (t) from e ₁ (t) and e ₂ (t), and an instruction for generating a microphone output using the minimum echo output.

本発明の実施形態によると、相互相関解析のみ、または音声アクティビティ検出（ダブルトーク検出）のみで可能な、よりロバストでありながら正確なエコーキャンセレーションが可能となる。このような改良されたエコーキャンセレーションによると、スピーカエコーｘ（ｔ）に大部分を占められているマイクロフォン信号ｄ（ｔ）からローカルスピーチをｓ（ｔ）を抽出することが可能となる。 According to the embodiment of the present invention, more robust yet accurate echo cancellation is possible, which is possible only by cross-correlation analysis or voice activity detection (double talk detection). According to such an improved echo cancellation, it is possible to extract local speech s (t) from the microphone signal d (t) that is mostly occupied by the speaker echo x (t).

本発明の実施形態は、ここで提示されたように用いられてもよく、また他のユーザ入力メカニズムと共に用いられてもよい。方位角方向や音声のボリュームを追跡したり測定したりするメカニズム、かつ／または、能動的または受動的にオブジェクトの位置を追跡するメカニズム、マシン・ビジョンを用いるメカニズム、これらの組み合わせなどである。追跡されるオブジェクトは、システムへのフィードバックを操作する補助的なコントロール装置やボタンを含んでもよい。そのようなフィードバックには、光源からの光の放射、音質の歪曲手段、その他の適切な送信機、変調器、コントロール装置、ボタン、圧力パッドなどが含まれてもよいが、これらに制限されるものではない。それは、同じ符号化状態の転送や変調に影響を及ぼしてもよく、かつ／または、システムによって追跡されている装置への命令や、その装置からの命令を転送してもよい。そのような装置は、本発明の実施形態に関連して用いられるシステムの一部であったり、またはシステムと相互作用したり、またはシステムに影響を与えたりする。 Embodiments of the present invention may be used as presented herein and may be used with other user input mechanisms. A mechanism for tracking and measuring the azimuth direction and volume of sound, and / or a mechanism for actively or passively tracking the position of an object, a mechanism using machine vision, a combination thereof, and the like. The tracked object may include auxiliary controls and buttons that manipulate feedback to the system. Such feedback may include, but is not limited to, emission of light from the light source, sound quality distortion means, other suitable transmitters, modulators, control devices, buttons, pressure pads, etc. It is not a thing. It may affect the transfer and modulation of the same coding state and / or transfer instructions to and from the device being tracked by the system. Such devices are part of, or interact with, or affect the system used in connection with embodiments of the present invention.

上記は、本発明の好ましい実施形態の完全な記述であるが、他の様々な変形、変更、等価物への置換が可能である。それゆえ、本発明の範囲は、上記の記述によって決定されるのではなく、以下の請求項によって決定されるべきであり、その完全な等価物もその範囲に含まれる。ここで記述された特徴は、好ましいものであるか否かに関わらず、ここで述べたいずれの特徴と組み合わされてもよい。以下の請求項においては、特に明示的に断らない限りは、各要素の数量は一以上である。ここに、添付される請求項は、所与の請求項において、「〜ための手段」との語句を用いて明示的に示される場合の他は、ミーンズ・プラス・ファンクションの制限を含むと解されてはならない。 While the above is a complete description of the preferred embodiment of the present invention, it is possible to make various other variations, modifications, and equivalents. The scope of the invention should, therefore, be determined not by the above description, but should be determined by the following claims, including their full equivalents. The features described herein may be combined with any of the features described herein, whether or not they are preferred. In the following claims, unless expressly stated otherwise, the quantity of each element is one or more. The claims appended hereto are understood to include means plus function limitations in the given claims, except where explicitly indicated using the phrase “means for”. Must not be done.

Claims

An echo cancellation method in a system having a speaker receiving a speaker signal x (t) and a microphone receiving a microphone signal d (t) including a local signal s (t) and an echo signal x ₁ (t), The echo signal x ₁ (t) depends on the speaker signal x (t),
Filtering the microphone signal d (t) in parallel with a first adaptive filter having a cancellation characteristic complementary to a second adaptive filter and the second adaptive filter;
The filtering characteristics of the first and second adaptive filters are such that when one of the first and second adaptive filters is not well adapted to the input, the other is well adapted to the input. ,
One of the first and second adaptive filters is well adapted because its filter function h (n) is stable, converges to a true echo path filter, and is neither overpredicted nor underpredicted When
The first adaptive filter is a voice activity detection filter;
The second adaptive filter is a cross-correlation analysis filter;
The method further includes
From the output e ₁ (t) from the first adaptive filter and the output e ₂ (t) from the second adaptive filter, the minimum echo output e ₃ () having a smaller correlation with the speaker signal x (t). determining t);
Generating a microphone output using the minimum echo output e ₃ (t);
A method comprising:

Filtering the microphone signal d (t) in parallel with the first adaptive filter and the second adaptive filter,
adapting a set of filter coefficients of the first adaptive filter when the intensity of x (t) exceeds a threshold;
The method of claim 1, comprising analyzing a cross-correlation between e ₂ (t) and x (t) with the second adaptive filter.

Determining the minimum echo output e ₃ (t) determining whether the second adaptive filter has not excessively removed the local signal s (t) by a filtering process;
The output of the first adaptive filter is used as the minimum echo output when the second adaptive filter excessively removes the local signal s (t) by a filtering process. Method.

Determining whether the second adaptive filter has not excessively removed the local signal s (t) by a filtering process;
Taking a cross-correlation between the output e ₂ (t) of the second adaptive filter and the speaker signal x (t);
Determining whether an expected value of cross-correlation between the output e ₂ (t) of the second adaptive filter and the speaker signal x (t) is less than a predetermined threshold;
Determining that the second adaptive filter excessively removes the local signal s (t) by a filtering process when the expected value of the cross-correlation is less than the threshold value. The method described.

Replacing the set of filter coefficients of the second adaptive filter with the set of filter coefficients of the first adaptive filter when the second adaptive filter excessively removes the local signal s (t) by a filtering process; The method of claim 3 further comprising:

Generating said microphone output using the minimum echo output e _{3 (t),} above, the minimum echo output e ₃ first containing an analysis of the cross-correlation between _(t) and the loudspeaker signal x (t) Determining a residual prediction ER ₁ (t) in parallel, and a second residual prediction ER ₂ (including an echo distance mismatch between the minimum echo output e ₃ (t) and the speaker signal x (t) ( 2. The method of claim 1, comprising determining t).

The method of claim 6, wherein the cross-correlation analysis includes calculating an expected value of cross-correlation between e ₃ (t) and x (t).

The method of claim 6, wherein the echo distance mismatch includes calculating an expected value of {e ₃ ² (t) / x ² (t)}.

Determining the first residual prediction ER ₁ (t) includes calculating a second-order norm N (1) of the first residual prediction unit ER (1);
Determining the second residual prediction ER ₂ (t) includes calculating a second-order norm N (2) of the second residual prediction unit ER (2);
The method further applies the echo residual predictor ER (1) or ER (2) with a smaller corresponding norm N (1) or N (2) to e ₃ (t) during double talk. the method according to steps including claim 6.

Determining the first residual prediction ER ₁ (t) includes calculating a second-order norm N (1) of the first residual prediction unit ER (1);
Determining the second residual prediction ER ₂ (t) includes calculating a second-order norm N (2) of the second residual prediction unit ER (2);
The method further applies the echo residual predictor ER (1) or ER (2) with a larger corresponding norm N (1) or N (2) to e ₃ (t) during a single talk. the method according to steps including claim 6.

Using the minimum echo output e ₃ (t) to generate a microphone output further includes determining a minimum residual echo prediction ER ₃ (t);
The minimum residual echo prediction ER ₃ (t) is one of ER ₁ (t) and ER ₂ (t) and has a minimum energy and a minimum correlation with x (t). The method described.

The method of claim 11, further comprising the step of selectively removing _ER 3 (t) of from the minimum echo output _e 3 (t).

The step of selectively removing the ER 3 _(t) from the microphone output, the duration of only far-end speech, according to claim 12 including the step of removing the most ER 3 a _(t) from the microphone output Method.

The step of selectively removing ER 3 a _(t) from the microphone output, when the strength of the local signal s (t) is not 0, the step of removing a minimum ER 3 a _(t) from the microphone output The method of claim 12 comprising.

Calculating a predicted noise signal n ′ (t) from the microphone signal d (t);
Reducing the level of the predicted noise signal n ′ (t) to form a reduced noise signal n ″ (t);
The method of claim 1, further comprising incorporating the reduced noise signal n "(t) into the microphone output signal.

An echo cancellation device used in a system having a speaker and a microphone, the speaker being adapted to receive a speaker signal x (t), wherein the microphone is a local signal s (t) and an echo signal x ₁ (t). The echo signal x ₁ (t) is dependent on the speaker signal x (t), and is adapted to receive a microphone signal d (t) comprising
This device
A first adaptive filter connected to the speaker and the microphone;
A second adaptive filter connected to the speaker and the microphone in parallel with the first adaptive filter;
The second adaptive filter has an echo cancellation characteristic complementary to the first adaptive filter;
The filtering characteristics of the first and second adaptive filters are such that when one of the first and second adaptive filters is not well adapted to its input, the other is well adapted;
One of the first and second adaptive filters is well adapted because its filter function h (n) is stable, converges to a true echo path filter, and is neither overpredicted nor underpredicted When
The first adaptive filter is a voice activity detection filter;
The second adaptive filter is a cross-correlation analysis filter;
The device further includes
An integrator connected to the first adaptive filter and the second adaptive filter;
The integrator is configured to determine a minimum echo output e ₃ (t) from the output e ₁ (t) from the first adaptive filter and the output e ₂ (t) from the second adaptive filter. And
An apparatus in which the correlation between the minimum echo output e ₃ (t) and the speaker signal x (t) is smaller.

The integrator is configured to determine whether the second echo cancellation filter is excessively removing the local signal s (t) by a filtering process, and the second adaptive filter is the local signal s (t). The apparatus according to claim 16, wherein the integrator selects the output e ₁ (t) as the minimum echo output e ₃ (t).

A first echo residual prediction unit ER (1) connected to the integrator;
The apparatus according to claim 16, further comprising a second echo residual prediction unit ER (2) connected to the integrator.

The first echo residual prediction unit ER (1) performs a first residual prediction ER ₁ (t) including analysis of a cross correlation between the minimum echo output e ₃ (t) and the speaker signal x (t). Configured to generate,
The second echo residual prediction unit ER (2) includes a second residual prediction ER ₂ (t) including an echo distance mismatch between the minimum echo output e ₃ (t) and the speaker signal x (t). The apparatus of claim 18 , wherein the apparatus is configured to determine

The apparatus of claim 19 , wherein the cross-correlation analysis includes calculating an expected value of cross-correlation between e ₃ (t) and x (t).

The apparatus of claim 19 , wherein the echo distance mismatch includes a calculation of an expected value of {e ₃ ² (t) / x ² (t)}.

An apparatus further comprising a residual echo cancellation module connected to the first and second echo residual prediction units,
The residual echo cancellation module is
Calculating a secondary norm N (1) of the first residual prediction unit ER (1) and calculating a secondary norm N (2) of the second residual prediction ER (2);
The residual echo cancellation module performs the echo residual prediction unit ER (1) or ER (2) having a smaller norm N (1) or N (2) corresponding to the e ₃ (t 20. The apparatus of claim 19 , wherein the apparatus is configured to apply to:

An apparatus further comprising a residual echo cancellation module connected to the first and second echo residual prediction units,
The residual echo cancellation module is
Calculating a secondary norm N (1) of the first residual prediction unit ER (1) and calculating a secondary norm N (2) of the second residual prediction ER (2);
The residual echo cancellation module performs the echo residual prediction unit ER (1) or ER (2) having a larger corresponding norm N (1) or N (2) during a single talk, e ₃ (t 20. The apparatus of claim 19 , wherein the apparatus is configured to apply to:

An apparatus further comprising a residual echo cancellation module connected to the first and second echo residual prediction units,
The residual echo cancellation module is configured to determine a minimum residual echo prediction ER ₃ (t);
The minimum residual echo prediction _ER 3 (t), in one of the _ER 1 (t) and _ER 2 (t), in claim 19 correlation is minimum, and is x (t) the energy is minimum The device described.

The apparatus of claim 24 , wherein the residual echo cancellation module is configured to selectively remove ER ₃ (t) from the minimum echo output e ₃ (t).

The residual echo cancellation module, a period of only far-end speech, according to claim 25 that is configured to remove maximize ER 3 a _(t) from the microphone output.

The residual echo cancellation module, according to the local signal s when the intensity of the (t) is not zero, the claim 25, which is configured to remove minimize the microphone output ER 3 a _(t) Equipment.

A noise cancellation module connected to the microphone; and the noise cancellation module includes:
Calculating a predicted noise signal n ′ (t) from the microphone signal d (t);
Reducing the predicted noise signal n ′ (t) level to form a reduced noise signal n ″ (t);
The apparatus of claim 16, wherein the apparatus is configured to incorporate the reduced noise signal n "(t) into the microphone output signal.

A microphone,
Speakers,
A processor connected to the microphone and a speaker;
An acoustic signal processing system comprising a memory connected to the processor,
The memory is
A speaker receiving a speaker signal x (t);
For implementing an echo cancellation method in a system having a microphone receiving a microphone signal d (t) including a local signal s (t) and an echo signal x ₁ (t) that depends on the speaker signal x (t) Stores a set of processor-readable instructions;
The processor readable instructions are:
Instructions for filtering the microphone signal d (t) in parallel with a first adaptive filter having a cancellation characteristic complementary to a second adaptive filter and the second adaptive filter;
The filtering characteristics of the first and second adaptive filters are such that when one of the first and second adaptive filters is not well adapted to its input, the other is well adapted;
One of the first and second adaptive filters is well adapted because its filter function h (n) is stable, converges to a true echo path filter, and is neither overpredicted nor underpredicted When
The first adaptive filter is a voice activity detection filter;
The second adaptive filter is a cross-correlation analysis filter;
The processor readable instructions further include:
From the output e ₁ (t) from the first adaptive filter and the output e ₂ (t) from the second adaptive filter, a minimum echo output e ₃ (t) with less correlation with the speaker signal x (t) is obtained. Instructions to decide,
Instructions for generating a microphone output using the minimum echo output e ₃ (t);
An acoustic signal processing system.

A processor readable medium comprising a memory connected to a processor comprising:
The memory is
A speaker receiving a speaker signal x (t);
A processor for implementing an echo cancellation method in a system having a microphone that receives a microphone signal d (t) including a local signal s (t) and an echo signal x ₁ (t) that depends on a speaker signal x (t) Stores a set of readable instructions,
The processor readable instructions are:
Instructions for filtering the microphone signal d (t) in parallel with a first adaptive filter having a cancellation characteristic complementary to a second adaptive filter and the second adaptive filter;
The filtering characteristics of the first and second adaptive filters are such that when one of the first and second adaptive filters is not well adapted to its input, the other is well adapted;
One of the first and second adaptive filters is well adapted because its filter function h (n) is stable, converges to a true echo path filter, and is neither overpredicted nor underpredicted When
The first adaptive filter is a voice activity detection filter;
The second adaptive filter is a cross-correlation analysis filter;
The processor readable instructions further include:
From the output e ₁ (t) from the first adaptive filter and the output e ₂ (t) from the second adaptive filter, a minimum echo output e ₃ (t) having a smaller correlation with the speaker signal x (t) is obtained. Instructions to decide,
Instructions for generating a microphone output using the minimum echo output e ₃ (t);
Media containing.