JPS63234300A

JPS63234300A - Voice recognition

Info

Publication number: JPS63234300A
Application number: JP62068436A
Authority: JP
Inventors: 泰助渡辺
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1987-03-23
Filing date: 1987-03-23
Publication date: 1988-09-29

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は人間の声を機械に認識させる音声認識方法に関
するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a voice recognition method that allows a machine to recognize a human voice.

従来の技術近年音声認識技術の開発が活発に行なわれ、商品化され
ているが、これらのほとんどは声を登録した人のみを認
識対象とする特定話者用である。2. Description of the Related Art Speech recognition technologies have been actively developed and commercialized in recent years, but most of these are for specific speakers whose voices are recognized only by those who have registered their voices.

特定話者用の装置は認識すべき言葉をあらかじめ装置に
登録する手間を要するため、連続的に長時間使用する場
合を除けば、使用者にとって大きな負担となる。これに
対し、声の登録を必要とせず、使い勝手のよい不特定話
者用の認識技術の研究が最近では精力的に行なわれるよ
うになった。Devices for specific speakers require time and effort to register the words to be recognized in the device in advance, which puts a heavy burden on the user unless the device is used continuously for a long time. In response to this, research has recently been actively conducted on recognition technology for non-specific speakers that is easy to use and does not require voice registration.

音声認識方法を一般的に言うと、入力音声と辞書中に格
納しである標準的な音声（これらはパラメータ化しであ
る）のパターンマツチングを行なって、類似度が最も高
い辞書中の音声を認識結果として出力するということで
ある。この場合、入力音声と辞書中の音声が物理的に全
く同じものならば問題はないわけであるが、一般には同
一音声であっても、人が違ったり、言い方が違っている
ため、全く同じにはならない。Generally speaking, the speech recognition method performs pattern matching between the input speech and standard speech stored in a dictionary (these are parameterized), and selects the speech in the dictionary with the highest degree of similarity. This means that it is output as a recognition result. In this case, there is no problem if the input voice and the voice in the dictionary are physically exactly the same, but in general, even if the input voice is the same, different people say it or say it in different ways, so they may not be exactly the same. It won't be.

人の違い、言い方の違いなどは、物理的にはスペクトル
の特徴の違いと時間的な特徴の違いとして表現される。Physically, differences between people and differences in the way they speak are expressed as differences in spectral features and differences in temporal features.

すなわち、調音器官（口、舌、のどなど）の形状は人ご
とに異なっているので、人が違えば同じ言葉でもスペク
トル形状は異なる。In other words, the shape of the articulatory organs (mouth, tongue, throat, etc.) differs from person to person, so the spectral shape of the same word will differ between different people.

また早口で発声するか、ゆっくり発声するかによって時
間的な特徴は異なる。Furthermore, the temporal characteristics differ depending on whether the voice is spoken quickly or slowly.

不特定話者用の認識技術では、このようなスペクトルお
よびその時間的変動を正規化して、標準パターンと比較
する必要がある。Speaker-independent recognition techniques require such spectra and their temporal variations to be normalized and compared to standard patterns.

不特定話者の音声認識に有効な方法として、本出願人等
は既にパラメータの時系列情報と統計的距離尺度を併用
する方法を提案している（二矢田他：　“簡単な不特定
話者用音声認識方法”、日本音響学会講演論文集、１−
１−４（昭和６１年３月））ので、その方法を以下に説
明する。As an effective method for speaker-independent speech recognition, the present applicant and others have already proposed a method that uses parameter time-series information and a statistical distance measure (Niyata et al.: “Simple speaker-independent speech recognition”). "Speech Recognition Method for Use", Proceedings of the Acoustical Society of Japan, 1-
1-4 (March 1986)), the method will be explained below.

この方法は、パターンマツチング法を用いて、音声を騒
音中からスポツティングすることによって、音声の認識
を行なうと同時に音声区間をも検出することができる。This method uses a pattern matching method to spot speech in noise, thereby making it possible to recognize speech and detect speech sections at the same time.

マス、パターンマツチングに用いている距離尺度（統計
的距離尺度）について説明する。The distance measure (statistical distance measure) used in mass and pattern matching will be explained.

入力単語音声長をＪフレームに線形伸縮し、１フレーム
あたりのパラメータベクトルをＸＩとすると、入力ベク
トルＸは次のようになる。If the input word audio length is linearly expanded or contracted to J frames and the parameter vector per frame is set to XI, the input vector X becomes as follows.

Ｘ＝（町、鷹２．・・・・・・、匂）ここで、各ａＪは一次元のベクトルである。X = (machi, taka 2..., scent) Here, each aJ is a one-dimensional vector.

単語ωｈ　（ｋ＝１．　２．・・・、に）の標準パター
ンとして、平均値ベクトルをμに１共分散行列をｌＷｋ
　とすると、事後確率Ｐ（ωｋ　１ス）を最大とする単
語を認識結果とすればよい。As a standard pattern for the word ωh (k=1. 2...,), let the mean value vector be μ and the covariance matrix 1 be lWk
Then, the word with the maximum posterior probability P(ωk 1st) may be taken as the recognition result.

ベイズの定理よりＰ（ωｋｌ＆）＝Ｐ（ωｋ）−Ｐ（Ｘｌωｋ）／Ｐ（Ｘ
）　　　（１）右辺第１項のＰ（ωｋ）は定数と見なせ
る。正規分布を仮定すると、第２項はＰ（＆１ｃｕｋ）＝（２ｆｆＥ”１ｌＷｋｌ−”・ｅｘ
ｐ（−１／２（ｘ−Ｉｕｋ）−１Ｗ　ｋ−（Ｊ−ｔｕｋ
＞）　　（２）分母項Ｐ（Ｘ）は入力パラメータが同一
ならば定数と見做せるが、異なる入力に対して相互比較
するときは、定数にならない。ここでは、Ｐ（＆）が平
均値／Ｉｊｋ、共分散行列Ｗｘの正規分布に従うものと
仮定する。From Bayes theorem, P(ωkl &) = P(ωk) - P(Xlωk)/P(X
) (1) P(ωk), the first term on the right-hand side, can be regarded as a constant. Assuming a normal distribution, the second term is P(&1cuk)=(2ffE"1lWkl-"・ex
p(-1/2(x-Iuk)-1W k-(J-tuk
>) (2) The denominator term P(X) can be regarded as a constant if the input parameters are the same, but it does not become a constant when comparing different inputs. Here, it is assumed that P(&) follows a normal distribution with mean value/Ijk and covariance matrix Wx.

ｐ　　（ｘ　　）＝　（２ｙｒ）　　２１！ｖｌ／ｘ　
ｌ　　　２・ａｘｐ（−１／２（Ｊ−ａｘ）ＷＶ　’に
−（、Ｘ−ｒｕｋ））　（３）（１）の対数をとり、定
数項を省略して、これをｌＬｋと置くと、ｒＬｋ＝（Ｘ−１１ｈ）−１ｐｔ−’ｈ−（Ｘ−ｔｕｋ
）−（Ｘ−１ｕｋ　）　・’ｆｌ　’ｘ　・（Ｊ（−μ
ｋ）＋ｌｏｇｌ”ｌ／ｌ／ｋｌ−１ｏｇｌ’Ｗｋｌ　　
　　　　　　　　　　　　　　（４）ここで、ｌｌ＃に
、ｖ／ｘを全て共通と置き＼Ｗとする。p (x) = (2yr) 21! vl/x
l 2 · axp (-1/2 (J-ax) WV' - (, X-ruk)) (3) If we take the logarithm of (1), omit the constant term, and set it as lLk, rLk=(X-11h)-1pt-'h-(X-tuk
)-(X-1uk) ・'fl 'x ・(J(-μ
k)+logl"l/l/kl-1ogl'Wkl
(4) Here, in ll#, set v/x to be common to all and set it to \W.

すなわち、ｓｗ＝（ｗ、−）！／２＋−・−−−＋ｇ＋ｎｔ−＋ｗ
ｘ）／（Ｋ＋１　）　　　　（５）として（４３式を展
開すると、Ｌｋ＝Ｂｋ−＾に−Ｘ　　　　　　　　　　　　　　（
ａ）ただし、１Ａｓｔ＝２（ｖｖ−−ｔｕｋ−ｗ　　−＃ｋ）　　　
　　　　（７）Ｂｋ　＝　Ｉｕｋ　”Ｗ−・紅に−ｆｕ
ｋ　ＱＷ−’　・４ｋ　　　　　（８）（６）式は計算
量が少ない１次判別式である。ここで、（＠式を次のよ
うに変形する。That is, sw=(w,-)! /2+-・--+g+nt-+w
x)/(K+1) (5) Expanding formula (43), Lk=Bk-^ becomes -X (
a) However, 1Ast=2(vv--tuk-w-#k)
(7) Bk = Iuk ”W-・Red-fu
k QW-' 4k (8) Equation (6) is a first-order discriminant with a small amount of calculation. Here, (@expression is transformed as follows.

Ａｈ＝（祷れ一瓢）、・・・、硝））とすると、すなわ
ち、Ｌｋ　はフレームごとの部分類似度、（ｋ、ｌ　＝
　、ｔｓ＝３　・町　　の−回の加算と１回の減算で求
められる。Assuming that Ah = (pray for one gourd), ..., ni)), that is, Lk is the partial similarity for each frame, (k, l =
, ts=3 ・Machi is calculated by - times of addition and once of subtraction.

次に、上記の距離尺度を用いて、騒音中から音声をスポ
ツティングして認識する方法と、計算量の削減法につい
て説明する。Next, a method for spotting and recognizing speech in noise using the above distance measure and a method for reducing the amount of calculation will be explained.

音声を確実に含む十分長い区間を対象として、この中に
種々の部分区間を設定して、各単語との類似度を（＠式
によって求め、全ての部分区間を通して類似度が最大と
なる単語を認識結果とすればよい。この類似度計算をそ
のまま実行すると計算量が膨大となるが、単語の持続時
間を考慮して部分区島長を制限し、呟た計算の途中で部
分類似度ｄ（７）を共通に利用することによって、大幅
に計算量を削減できる。第３図は本方法の説明図である
。Targeting a sufficiently long interval that definitely includes speech, set various subintervals within this, calculate the similarity with each word using the (@ formula, and find the word with the maximum similarity across all subintervals. This can be the recognition result.If this similarity calculation is performed as it is, the amount of calculation will be enormous, but by considering the duration of the word and limiting the partial ward island length, the partial similarity d( 7) can significantly reduce the amount of calculation. FIG. 3 is an explanatory diagram of this method.

入力と単語にの照合を行う場合、部分区間長ｎ　（ｎ’
：ｋｎ＜ｎ’普））を標準パターン長Ｊに線形伸縮し、
フレームごとに終端固定で類似度を計算していく様子を
示している。類似度はＱＲ上の点Ｔから出発してＰで終
るルートに沿って（９）式で計算される。したがって、
１フレームあたりの類似度計算は全て△ＰＱＲ内で行わ
れる。ところで（９）式の町は、区間長ｎを伸縮した後
の第」フレーム成分なので、対応する入力フレーム８′
が存在する。When matching input to words, subinterval length n (n'
:kn<n'P)) to the standard pattern length J,
It shows how the similarity is calculated for each frame with a fixed end. The degree of similarity is calculated using equation (9) along a route starting from point T on QR and ending at P. therefore,
All similarity calculations per frame are performed within ΔPQR. By the way, the town in equation (9) is the 'th frame component after expanding and contracting the section length n, so the corresponding input frame 8'
exists.

そこで入力ベクトルを用いて１．（？）を次のように表
現できる。Therefore, using the input vector, 1. (?) can be expressed as follows.

ｄ”Ｊ＋’　、　、）＝ｊ’７−ａ、　　　　　　　（
１ｏ）ただし、ｌ′＝藝−ｒ　（Ｊ）−Ｈ（１１）ここ
で、Ｐｎ（Ｊ）は単語長ｎ、！：Ｊの線形伸縮を関係づ
ける関数である。したがって、入力の各フレームと＆（
艷の部分類似度が予め求められていれは、（９）式は薯
′の関係を有する部分類似度を選択して加算することに
よって簡単に計算できる。ところで、△ＰＱＲは１フレ
ームごとに右へ移動するので、ＰＳ上で一μ襄、の部分
類似度を計算して、それをΔＰＱＳに相当する分だけメ
モリに蓄積し、フレームごとにシフトするように構成し
ておけば、必要な類似度は全てメモリ内にあるので、部
分類似度を求める演算が大幅に省略でき、計算量が非常
に少なくなる。d"J+', ,)=j'7-a, (
1o) However, l'=藝−r(J)−H(11)Here, Pn(J) is the word length n,! : It is a function that relates the linear expansion and contraction of J. So each frame of input and &(
If the degree of partial similarity of 艷 is determined in advance, equation (9) can be easily calculated by selecting and adding the degree of partial similarity having the relationship of 良'. By the way, △PQR moves to the right every frame, so calculate the partial similarity of 1μ on the PS, store it in memory by the amount equivalent to ΔPQS, and shift it every frame. If configured as follows, all the necessary similarities are stored in the memory, so the operation for determining partial similarities can be largely omitted, and the amount of calculations can be extremely reduced.

第４図は従来例の実現方法を説明した、機能ブロック図
である。未知入力音声信号はＡｎ変換部１０で、８ＫＨ
ｚサンプリングされて１２ビツトのディジタル信号に変
換される。音響分析部１１は１０ｍ＄・ｏ（１フレーム
）ごとに入力信号のＬＰＧ分析を行ない、１０次の線形
予測係数と残差パワーを求める。特徴パラメータ抽出部
１２は、線形予測係数と残差パワーを用いて、ＬＰＧケ
プストラム係数Ｃ１〜Ｃ５とパワー項Ｃｏを特徴パラメ
ータとして求める。したがって、フレームごとの特徴ベ
クトル義はＡ　’　＝（Ｃ□　＋　０１　、・・・・・ｃ　ｓ　）
　　　　　　（１２）である。なお、ＬＰＧ分析とＬＰ
Ｃケプストラム係数の抽出法に関しては、例えばＪ、Ｄ
、マーケル、ＡＨ，グレイ著、鈴木久喜訳「音声の線形
予測」に詳しく記述されているので、省略する。FIG. 4 is a functional block diagram illustrating a conventional implementation method. The unknown input audio signal is converted into 8KH by the An converter 10.
The signal is z-sampled and converted into a 12-bit digital signal. The acoustic analysis unit 11 performs LPG analysis of the input signal every 10 m$·o (1 frame), and obtains the 10th-order linear prediction coefficient and residual power. The feature parameter extraction unit 12 uses the linear prediction coefficients and residual power to obtain the LPG cepstrum coefficients C1 to C5 and the power term Co as feature parameters. Therefore, the feature vector definition for each frame is A' = (C□ + 01, ...c s )
(12). In addition, LPG analysis and LP
Regarding the extraction method of C cepstral coefficients, for example, J, D
Since it is described in detail in "Linear Prediction of Speech" by Markel, AH, Gray, translated by Hisaki Suzuki, it will be omitted here.

フレーム同期信号発生部１３は１０ｍ＄・Ｃごとのタイ
ミング信号（フレーム信号）を発生する部分であり、認
識処理はフレーム信号に同期して行なわれる。The frame synchronization signal generator 13 is a part that generates a timing signal (frame signal) every 10 m$·C, and recognition processing is performed in synchronization with the frame signal.

標準パターン選択部１８は、１フレームの期間に、標準
パターン格納部１７に格納されている単語ナンバーに＝
１．２．・・・にを次々と選択してゆく。部分類似度計
算部２１では、１選択された標準パターン１（荀と第一
フレームの特徴ベクトル町の部分類似度ｔｉ（ｋ’　（
ｒｅ　　Ｊ　）を計算する。The standard pattern selection unit 18 selects the word number stored in the standard pattern storage unit 17 during one frame period.
1.2. ...select one after another. The partial similarity calculation unit 21 calculates the partial similarity ti(k' (
Calculate re J ).

ｄ（ｋ）（川）尋（−丙（Ｊ＝１．２．・・・Ｊ）（１
３）計算した部分類似度は類似度バッファ２２へ退出し
て蓄積する。類似度バッファ２２は、新しい入力が入る
と、一番古い情報が消滅する構成になりている。d(k) (river) Hiro(-hei(J=1.2...J)(1
3) The calculated partial similarity is output to the similarity buffer 22 and accumulated therein. The similarity buffer 22 is configured such that when a new input is input, the oldest information disappears.

区間候補設定部１５は選択された単語ナンバーごとに、
その単語の最小長、（ｋ）と最大長、（ｋ）を設３　　
　　　　　　　　　　　　ｅ定する。時間伸縮テーブル２４には（１１）式の関係が
テーブル形式で格納されて＄す、単語長ｎとフレームＪ
を指定するとそれに対応するｌ′が求まる。、（ｋ）≦
ｎ≦ｎ（ｋ）“の範囲の各々の単語長ｎに＄　　　　　
　　　　　　・対して１を読出し、それに相当する部分類似度ｄ（ｋ’
（１’、　Ｊ　）　、Ｊ　＝１ｔ　　２１　・・・Ｊを
類似度バッファ２２から読み出す。類似度加算部２３は
める。類似度比較部２０は、求めたし、と一時記憶１９
の内容を比較し、類似度が大きい（距離が小さい）方を
一時記憶１９に記録する。For each selected word number, the section candidate setting unit 15
Set the minimum length, (k) and maximum length, (k) of the word.
e. The time expansion/contraction table 24 stores the relationship of equation (11) in a table format, and the word length n and frame J.
When , the corresponding l' is found. , (k)≦
$ for each word length n in the range n≦n(k)
・ Read 1 for the corresponding partial similarity d(k'
(1′, J), J = 1t 21 . . . J is read from the similarity buffer 22. The similarity adder 23 is inserted. The similarity comparison unit 20 temporarily stores 19
, and the one with greater similarity (smaller distance) is recorded in the temporary memory 19.

このようにして、フレームｌ　”　Ｉ　ｏから始め、標
準パターンに＝１に対してｎ（−）　＜、　＜、（Ｓ）
の範として、（２）＜ｎ＜　、（２）の範囲で求めたＬ
２゜と−・Ｌ”（ｒｎａｘ）を比較して類似度の最大値を求め、こ
のようにしてに＝にまで同様な手順を繰返して最大類似
値り越（ｍａｘ）とその時の単語ナンバーに′　を一時
記憶１９に記憶する。次に１＝ＩＯ＋Δ１として同様な
手順を繰返して、最終フレームｌ＝１に到達した時に一
時記憶に残されている単語ナンバーに＝ｋｒｎが認識結
果である。また、最大類似度が得られた時のフレームナ
ンバーｉ＝−と単語長ｎ＝ｎｒｎを一時記憶１９に蓄積
し、更新するようにしておけば、認識結果と同時に、そ
の時の音声区間を結果として求めることができる。音声
区間は一□−’ｍ〜−□である。In this way, starting from frame l''Io, we have a standard pattern with n(-) <, <, (S) for =1
As a range, L calculated in the range of (2)<n<, (2)
Find the maximum similarity by comparing 2゜ and -. ' is stored in the temporary memory 19.Next, the same procedure is repeated with 1=IO+Δ1, and when the final frame l=1 is reached, the word number left in the temporary memory is =krn.The recognition result is also , if the frame number i=- and the word length n=nrn when the maximum similarity was obtained are stored in the temporary memory 19 and updated, the speech section at that time can be obtained as a result at the same time as the recognition result. The voice interval is from 1□-'m to -□.

発明が解決しようとする問題点かかる方法にあける問題点は、音声を確実に含む十分長
い区胸を対象として、この中の取り得るすべての音声区
間とパターン・マツチングを実行するため、例えば数字
音声の認識において、「イチ」と発声しても、「イチ」
のチの部分で「キュウ」又は「イ」の部分で「二」と認
識するような、長い発声単語の部分に、短い単語に認識
される可能性が大きい。Problems to be Solved by the Invention The problem with this method is that pattern matching is performed with all possible speech sections within a sufficiently long segment that definitely contains speech, so for example, when it comes to numeric speech, In the recognition of
There is a high possibility that the part of a long uttered word will be recognized as a short word, such as the chi part of ``chi'' being recognized as ``kyu'' or the ``i'' part of ``two'' being recognized as ``two''.

本発明の目的は、上記問題点を解決するもので、各標準
パターンに対する最大類似度又は最小距離に対応する各
々の音声区間が、異常に異なる場合に＄いて、各々の最
大類似度又は最小距離を補正することにより、高い認識
率を有する音声認識方法を提供することを目的とするも
のである。An object of the present invention is to solve the above-mentioned problems, and to solve the above problem, when each speech interval corresponding to the maximum similarity or minimum distance with respect to each standard pattern is abnormally different, the maximum similarity or minimum distance of each standard pattern is The purpose of this invention is to provide a speech recognition method that has a high recognition rate by correcting.

問題点を解決するための手段本発明は、上記目的を達成するもので、特徴パラメータ
と標準パターンとの最大類似度（最小距ｔｉ１ｍ）に個
について、類似度（距離）の最大（最小）となる音声を
求め、その音声に対応する予め用意されたバイアステー
ブルのに個の数値を、前記に個の類似度（距離）に加算
し、そのうちの類似度（距離）最大（最小）となる標準
パターンに対応する音声を認識結果とするもので、に個
の標準／４ターンに対応する最大類似度を、Ｌｋ、（１
≦−≦に）とするとＬ　１＝ｎ’ｌｌｌ　ｘ　（Ｌｌ　、Ｌ２　、”・−Ｌ
Ｋ　）　”・”・・”・”　（１４）を求め、あらかじ
め用意されたに×に個の数値からなるバイアス・テーブ
ルをＢ（量、」）（１≦薯≦に。Means for Solving the Problems The present invention achieves the above object by calculating the maximum (minimum) similarity (distance) and the maximum similarity (minimum distance ti1m) between the feature parameter and the standard pattern. Find the voice that corresponds to the voice, add the values of the pre-prepared bias table corresponding to that voice to the similarity (distance) of the above, and calculate the standard that has the maximum (minimum) similarity (distance) among them. The speech corresponding to the pattern is the recognition result, and the maximum similarity corresponding to the standard/4 turns is defined as Lk, (1
≦-≦), then L 1 = n'llll x (Ll , L2 , "・-L
K ) ``・"・・"・" (14), and use a bias table prepared in advance consisting of x numbers as B (amount, ") (1≦薯≦).

１≦ｊ≦Ｋ）とすると、Ｌ　’　ｈｍ二ｍａ　ｘ　（Ｌ　１　＋８　（１，１）
　、Ｌ２＋８　（１，２）　。1≦j≦K), then L' hm2max (L 1 +8 (1, 1)
, L2+8 (1,2).

・・・、ＬＫ−□（１に）・・・（１５）を求め、ｋｒ
ｎを認識結果とするものである。..., LK-□ (to 1) ... (15), kr
n is the recognition result.

作　　用本発明は不特定話者用の音声区間を明確に定めないワー
゛ド・スポッテング手法を用いた認識方法において、に
個の標準パターンの最大類似度値を補正するようにした
もので、各標準パターンとマツチングした最大類似度値
に対応する各音声区間が、異常に異なる場合の、誤認識
を低減し、全体の認識率を向上させることができる。Function The present invention corrects the maximum similarity value of two standard patterns in a recognition method using a word spotting method that does not clearly define speech intervals for unspecified speakers. It is possible to reduce erroneous recognition and improve the overall recognition rate when each speech section corresponding to the maximum similarity value matched with each standard pattern is abnormally different.

実施例以下に本発明の実施例を図面を用いて詳細に説明する。Example Embodiments of the present invention will be described in detail below with reference to the drawings.

第１図は本発明の一実施例における音声認識方法の具現
化を示す機能ブロック図である。FIG. 1 is a functional block diagram showing an implementation of a speech recognition method according to an embodiment of the present invention.

まず本実施例の基本的な認識の考え方は、従来例に上げ
た方式とほぼ同じである。すなわち、未知入力音声信号
はＡｎ変換部１１０で、８にＨχサンプリングされて、
１２ビツトのディジタル信号に変換される。音響分析部
１１１は、１０ｍ５・Ｏ（１フレーム）ごとに入力信号
のＬＰＧ分析を行ない、１０次の線形予測係数と残差パ
ワーを求める。特徴パラメータ抽出部１１２は、線形予
測係数と残差パワーを用いて、ＬＰＣケプストラム係＆
Ｃ，〜Ｃ９とパワー項ｃｏを特徴パラメータとして求め
る。したがって、フレーム毎の特徴ベクトル漠は、真ｔ　＝（ＣＱ、　ｃ、　　・・・・・・・・・・Ｃ９
）　　　　　　（１６）である。なお、ＬＰＧ分析とＬ
ＰＧケプストラム係数の抽出法に関しては、例えばＪ、
ｌ）、マーケル、Ａ、Ｈ，グレイ著鈴木久暮訳「音声の
線形子側」に詳しく記述されているので省略する。First, the basic recognition concept of this embodiment is almost the same as the method described in the conventional example. That is, the unknown input audio signal is sampled by Hχ at 8 in the An converter 110,
It is converted into a 12-bit digital signal. The acoustic analysis unit 111 performs LPG analysis of the input signal every 10 m5·O (one frame), and obtains the 10th-order linear prediction coefficient and residual power. The feature parameter extraction unit 112 uses the linear prediction coefficients and residual power to calculate the LPC cepstral coefficient &
C, ~C9 and the power term co are determined as feature parameters. Therefore, the feature vector for each frame is t = (CQ, c, ......C9
) (16). In addition, LPG analysis and L
Regarding the extraction method of PG cepstral coefficients, for example, J,
l) is described in detail in "The Linear Side of Speech" by Markel, A., H. Gray, translated by Hisakure Suzuki, so it will be omitted here.

フレーム同期信号発生部１１３は、１０ｍ＠・Ｃごとの
タイミング信号（フレーム信号）を発生する部分であり
、認識処理はフレーム信号に同期して行なわれる。The frame synchronization signal generation unit 113 is a part that generates a timing signal (frame signal) every 10 m@·C, and recognition processing is performed in synchronization with the frame signal.

標準パターン選択部１１６は、１フレームの期間に、標
準パターン格納部１１５に格納されている単語ナンバー
に＝１．２・・・・・・Ｋを次々と選択してゆ（。部分
頬似度計算部１１４では、選択された標準パターンｎ（
ｋ）と第１フレームの特徴ベクトル町の部分類似度ｄ（
ｋ）（１、ｊ　）を計算する。The standard pattern selection unit 116 successively selects word numbers =1.2...K stored in the standard pattern storage unit 115 during one frame period. The calculation unit 114 calculates the selected standard pattern n(
k) and the partial similarity of the feature vector town of the first frame d(
k) Calculate (1,j).

−）（川）＝α（１）１・訛、　　（Ｊ＝１．２・・・
Ｊ）　（１７）計算した部分類似度は類似度バッファ１
１９へ送出して蓄積する。類似度バッファ１１９は、新
しい入力が入ると、一番古い情報が消滅する構成になっ
ている。-) (river) = α (1) 1・accent, (J=1.2...
J) (17) The calculated partial similarity is stored in similarity buffer 1.
19 and accumulates it. The similarity buffer 119 is configured such that when a new input is input, the oldest information disappears.

８聞候補設定部１１７は、選択された単語ナンバーごと
に、その単語の最小長ｎ（ｋ）と最大炎ｎ（ｋ）薯　　
　　　　　　　　　　　・を設定する。時間伸縮テーブル１１８には（１１）式の
関係がテーブル形式で格納されており、単語長、（ｎ（
ｋ）≦ｎ≦ｎ（ｋ））トフレーム」を指定する＄　　　
　　　　　　　　６と、それに対応する白（求まる。ｎ（ｋ）≦ｎ≦ｎ（ｋ
）の範囲の各々の単語長ｎに対して１を読み出し、それ
に相当する部分類似度−（ｋ’（１’−Ｊ　）、Ｊ＝１
．２・・・Ｊを類似度バッファ１１９から読み出す。頑
似度加算部１２０は、　嗟−ｋ（、′・１）を計算し、
（９）式によってに個のし、を求める。類似度比較部１
２１は、求めたし、と一時記憶１２２の内容とをに毎に
比較し、類似度が大きい（距離が小さい）方を、一時記
憶１２２にに個のｍ５ｘＬｌｋ記憶すると同に時に、このに個の記憶された類似度の最大類似度Ｌｋｒ
ｎ　を求めて、記憶する。For each selected word number, the 8-word candidate setting unit 117 determines the minimum length n(k) and maximum length n(k) of the word.
・Set. The time expansion/contraction table 118 stores the relationship of equation (11) in a table format, including word length, (n(
k)≦n≦n(k)) $
6 and the corresponding white (find.n(k)≦n≦n(k
) for each word length n in the range of
．． 2...J is read from the similarity buffer 119. The robustness addition unit 120 calculates 嗟−k(,′・1),
Calculate the number of units using equation (9). Similarity comparison section 1
21 compares the obtained data with the contents of the temporary memory 122, and stores the one with a greater degree of similarity (smaller distance) in the temporary memory 122, and at the same time stores the one with the contents of the temporary memory 122 The maximum similarity Lkr of the stored similarities of
Find n and memorize it.

このような動作を、フレームｌ＝１゜より始め、７Ｌ／
−ム毎に動作させ、最純フレームＩ＝１に到達した時、
類似度比較部１２１で、計算された最大類似度しｋｒｎ
のｋｒｎの値、（すなわち’ｍは、最大類似度を出した
ｋｒｎ番目の標準パターンを表わす。）を、バイアス・
テーブル１２８′＃ζ送る。Start this kind of operation from frame l = 1° and move to 7L/
- When the purest frame I=1 is reached,
The similarity comparison unit 121 calculates the calculated maximum similarity krn
The value of krn, (i.e., 'm represents the krnth standard pattern that produced the maximum similarity), is determined by bias and
Send table 128'#ζ.

バイアス・テーブル１２８は、にＸに個の８（１，Ｊ）
のあらかじめ求められた数値からなり、ｋ　が入力され
ると、に（−のＢ（ｋｆＴｌ・Ｊ）（Ｊ＝１・に）を、
類似度比較部１２１へ送る。類似度比較部１２１は、一
時記憶１２２のに個の標準パターン毎の最大類似度ｍａ
ｘＬ１ｋを送ってもらい、次式を計算する。Bias table 128 has 8 (1, J)
It consists of the predetermined numerical value of , and when k is input, (-B(kfTl・J) (J=1・),
It is sent to the similarity comparison unit 121. The similarity comparison unit 121 calculates the maximum similarity ma for each standard pattern in the temporary storage 122.
Ask xL1k to be sent to you and calculate the following equation.

Ｙｋ’ｍ　＝ｍｓｘ（ｍａｘＬｌ　’＋８（ｋｍ、１）
。Yk'm = msx(maxLl'+8(km, 1)
.

ｍａｘＬ２２＋Ｂ（ｋｒｎ、２）。maxL22+B(krn, 2).

・・・・・・１ｍロＬン＋ａ（ｋ、、、、に））・・・
・・・（１８）すなわち、＜１８）式の右辺の項のうち
、最大になる項に−を求め、ｋ−を認識結果とする。...1m ron+a (k,,,,ni))...
(18) That is, among the terms on the right side of the equation <18), - is determined for the maximum term, and k- is set as the recognition result.

次に、上記説明における１゜から１までの走査区間決定
方法について説明する。Next, a method of determining the scanning section from 1° to 1 in the above description will be explained.

第２図は、走査開始（頬似度加算部１２０以後の開始）
Ｉ０フレームと認識完了１フレームと音声との関係を表
わしたものである。FIG. 2 shows the start of scanning (start after cheek similarity addition unit 120)
This shows the relationship between the I0 frame, one frame of recognition completion, and audio.

走査区間の始端はパワー情報で求め、終端はパワー情報
と類似度情報を併用して求める。第１図（Ｄパワー計算
部１２３で、フレーム毎の平均パワー（対数値）ｐ＋（
ｔｉｔフレーム番号）を求め、パワー比較部１２５へ送
る。パワー比較部は、ノイズ・レベル学習部よりの出力
ＰＮを用いて、ＰｌとＰＮ十θＮ（θ、はいき値レベル
）とを比較し、Ｐ、が大きければ、１をそうでなければ
０を走査区間設定部に出力する。また、ＰＩくＰＮＮＯ
６の時は、ノイズ・レベル学習部１２４に、Ｐ、を送り
、ノイズ・レベル学習部では、Ｐ、を用いて、ＰＮを計
算しなおす。その計算式は、ＰＮ　＝、　、　　（ＦＡＬＬ　＋　Ｐ　Ｉ　）　　　
　　（”）ここで、開は送られたパワーの数、ＰＡＬＬ
は前のフレームまでの送られたパワーの総和である。The starting point of the scanning section is found using power information, and the ending point is found using both power information and similarity information. FIG. 1 (The D power calculation unit 123 calculates the average power (logarithm value) for each frame p+(
tit frame number) and sends it to the power comparison section 125. The power comparison section uses the output PN from the noise level learning section to compare Pl and PN+θN (θ, threshold level), and sets it to 1 if P is large and 0 otherwise. Output to the scanning section setting section. Also, PIkuPNNO
6, P is sent to the noise level learning section 124, and the noise level learning section uses P to recalculate PN. The calculation formula is: PN =, (FALL + PI)
('') where OPEN is the number of powers sent, PALL
is the sum of the power sent up to the previous frame.

とする。shall be.

走査区間設定部では、パワー比較部１２５より最初のｌ
が出力された時を走査開始とし、類似度加算部１２０以
後が動作を開始する。また、走査の終了を決定する条件
は、１）パワー比較部１２５の出力が、最後に０→１に変化
した時点からθ、フレーム経過して０のフレームが存在
する。In the scanning section setting section, the first l is determined by the power comparison section 125.
When is output, scanning is started, and the similarity adder 120 and subsequent parts start operating. Further, the conditions for determining the end of scanning are as follows: 1) θ frames have passed since the last time the output of the power comparison unit 125 changed from 0 to 1, and a 0 frame exists.

２）最後に１→Ｏに変化した時点から０２フレーム読け
て、０のフレームが存在する。2) 02 frames can be read from the time of the last change from 1 to O, and there is a 0 frame.

３）類似度比較部の最大類似度’ｋｍが、θ３より大き
い。3) The maximum similarity 'km of the similarity comparison section is greater than θ3.

であり、上記３つの条件を満した時、認識を終了する。When the above three conditions are satisfied, recognition is terminated.

従来例に述べた音声区間を決定せず、音声らしき所の周
辺において、考えられる音声区間すべての中から、最大
類似度を求める方法においては、一般的に、パワー情報
を用いて、音声区間を決定し、標準パターンとマツチン
グする方法よりも、３音レベルが高い場合や非定常なノ
イズが混入する場合は、強いと言えるが、逆に、認識対
象単語中に、長い単語のｌ部分と非常に似かよった短い
単語があった場合、非常に認識率が悪くなる。In the method described in the conventional example, in which the maximum similarity is calculated from all possible speech sections around a place that seems to be speech without determining the speech section, power information is generally used to determine the speech section. It can be said that this method is stronger than the method of determining and matching a standard pattern when the three-tone level is higher or when non-stationary noise is mixed, but on the other hand, if the recognition target word contains the l part of a long word and very If there are short words similar to , the recognition rate will be very poor.

たとえば、認識対象単語中に「新大阪」と「大阪」があ
る場合等である。本実施例の場合、「新大阪」は「大阪
」になりやすいが「大阪」は「新大阪」になりにくい性
質を利用し、１度得られた類似度又は距離に、予め準備
したバイアステーブルの数値を加算し、そのうちの最大
又は最小のものを認識結果をすることにより、部分マツ
チングをできる限り防止することができ、非常に有効な
手段である。For example, there is a case where "Shin-Osaka" and "Osaka" are included in the words to be recognized. In the case of this example, by utilizing the property that "Shin-Osaka" is likely to become "Osaka" but "Osaka" is difficult to become "Shin-Osaka", a bias table prepared in advance is applied to the similarity or distance obtained once. By adding the numerical values of , and using the maximum or minimum of them as the recognition result, partial matching can be prevented as much as possible, which is a very effective means.

発明の効果以上要するに本発明は、類似度又は距離が最大又は最小
となる音声を求め、その音声に対応する予め用意された
バイアステーブルの数値を、前記類似度又は距離に加算
し、そのうちの類似度、又は距離が最大又は最小となる
標準パターンに対応する音声を認識結果とするもので、
似た発声の単語の課認識を防止し、全体の認識率を向上
させることができる利点を有する。Effects of the Invention In short, the present invention calculates the voice with the maximum or minimum similarity or distance, adds the numerical value of a bias table prepared in advance corresponding to that voice to the similarity or distance, and calculates the similarity or distance. The recognition result is the voice corresponding to the standard pattern with the maximum or minimum degree or distance.
This method has the advantage of preventing the recognition of words with similar utterances and improving the overall recognition rate.

[Brief explanation of the drawing]

第１図は本発明の一実施例における音声認識方法を具現
化する＆ｎブロック図、第２図は本実施例における標準
パターンとめマツチングを行う開始、終了時期と音声と
の関係図、第３図は標準パターンとのパターンマツチン
グ法を説明した概念図、第４図は従来例の方法を説明し
た機能ブロック図である。１１０・・・・・・ＡＯ変換部、１１１・・・・・・音
響分析部、１１２・・・・・・特徴パラメータ抽出部、
Ｉｔ；Ｌ・・・・・フレーム同期信号発生部、１１４・
・・・・・部分類似度計算部、１１５・・・・・・標準
パターン格納部、１１６・・・・・・標準パターン選択
部、１１７・・・・・・区間候補設定部、１１８・・・
・・・時間伸縮テーブル、１１９・・・・・・類似度バ
ッファ、１２０・・・・・・類似度加算部、１２１・・
・・・・類似度比較部、１２２・・・・・・一時記憶、
１２３・・・・・・パワー計算部、１２４・・・・・・
ノイズ・レベル学習部、１２５・・・・・・パワー比較
部、１２７・・・・・・走査区間設定部、１２８・・・
・・・バイアステーブル。第　２　図集３図Fig. 1 is a &n block diagram embodying the speech recognition method in one embodiment of the present invention, Fig. 2 is a relation diagram between the start and end times of standard pattern stop matching and speech in this embodiment, and Fig. 3 4 is a conceptual diagram illustrating a pattern matching method with a standard pattern, and FIG. 4 is a functional block diagram illustrating a conventional method. 110...AO conversion unit, 111...Acoustic analysis unit, 112...Feature parameter extraction unit,
It;L... Frame synchronization signal generation section, 114.
...Partial similarity calculation unit, 115...Standard pattern storage unit, 116...Standard pattern selection unit, 117...Section candidate setting unit, 118...・
...Time expansion/contraction table, 119...Similarity buffer, 120...Similarity addition unit, 121...
...Similarity comparison section, 122 ...Temporary memory,
123...Power calculation section, 124...
Noise level learning unit, 125...Power comparison unit, 127...Scanning section setting unit, 128...
...bias table. Part 2 Figure 3

Claims

[Claims]

(1) Detect the presence of voice using power information from an unknown input signal including voice and noise before and after the voice, and use the time of detection as a reference point, and from the reference point and reference point N (N_1≦N≦N
Linearly expand or contract the unknown input signal in a section separated by _2) to the section length L, extract the feature parameter of the expanded or contracted section, and calculate the similarity or distance between this feature parameter and the standard pattern of K voices to be recognized. , and then perform these operations while changing N from N_1 to N_2.Furthermore, perform similar operations while shifting the reference point by unit interval to successively obtain and compare the degrees of similarity or distance. Then, all reference points and all time expansion/contraction when the reference point reaches the processing end point determined by combining the duration of the voice obtained using the movement of power information and the temporal change in similarity. corresponding to the l-th (1≦l≦K) standard pattern for which K maximum similarities or minimum distances with K standard patterns are calculated, and the maximum or minimum value of the K similarities or distances is obtained. Then, add the l-th K values of the K x K bias tables prepared in advance and the K similarities or distances, and use the standard pattern to obtain the maximum or minimum value. A speech recognition method characterized by outputting corresponding speech as a recognition result.

(2) The voice recognition method according to claim 1, wherein the presence/absence of voice is detected using a ratio of a voice signal to noise.

(3) The speech recognition method according to claim 1, characterized in that the degree of similarity or distance between the characteristic parameters of the unknown input signal and the standard pattern of each speech is calculated using a statistical distance measure.

(4) A patent claim characterized in that the statistical distance measure is any one of a measure based on posterior probability, a linear discriminant function, a quadratic discriminant function, Mahalanobis distance, Bayesian judgment, and a measure based on composite similarity. The speech recognition method described in scope 3.