JP2007114355A

JP2007114355A - Voice synthesis method and apparatus

Info

Publication number: JP2007114355A
Application number: JP2005304082A
Authority: JP
Inventors: Shigeki Sagayama; 茂樹嵯峨山; Takeya Kai; 武也槐; Shinji Sako; 慎司酒向; Kyosuke Matsumoto; 恭輔松本; Takuya Nishimoto; 卓也西本
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2005-10-19
Filing date: 2005-10-19
Publication date: 2007-05-10

Abstract

<P>PROBLEM TO BE SOLVED: To provide a high-quality synthesized voice, and to provide a voice synthesis method that is superior in processability. <P>SOLUTION: A speech spectrum is expressed with few parameters, by approximating a speech spectral envelope with a mixed Gaussian distribution function, and analysis parameters are obtained. Superposition of Gabor functions which are the inverse Fourier transform of the mixed Gaussian distribution function is made into fundamental waveform, and voiced sound is synthesized, by arranging it for each pitch period. Voiceless sound is also synthesized, when the pitch period is rendered random. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声合成技術に関するものである。 The present invention relates to a speech synthesis technique.

コンピュータにおける音声情報処理が進展するに伴い、音声合成においては、テキストをただ読み上げるだけにとどまらず、対話調の合成音声など様々な要求に適用可能な、高品質かつ多様なスタイルの音声を生成できる音声合成が待望されている。 With the progress of speech information processing in computers, speech synthesis can generate high-quality and various styles of speech that can be applied not only to reading text but also to various requirements such as interactive synthesized speech. Speech synthesis is awaited.

PSOLA方式や波形接続型（非特許文献１）の音声合成手法は、十分なバラエティの音声素片のデータがあれば高品質な合成音声が期待できるが、データベースに含まれない条件の音声を合成したり、話者適応するような音声の特徴を操作する加工性は高くない。合成したい音声のスタイルに応じた音声データを補うことによって対処するとしても、様々な発話スタイルに対応したデータを収集することは困難が予想される。このため、感情音声や対話音声などを生成するには効率が悪いと考えられる。 PSOLA and waveform-connected speech synthesis methods (Non-Patent Document 1) can synthesize high-quality synthesized speech if there is sufficient variety of speech segment data, but synthesize speech with conditions not included in the database. However, the processability of manipulating voice features that adapt to the speaker is not high. Even if it is dealt with by supplementing the voice data according to the style of voice to be synthesized, it is expected that it is difficult to collect data corresponding to various utterance styles. For this reason, it is thought that it is inefficient to generate emotional speech, dialogue speech, and the like.

これに対して、パラメトリックな音声合成手法の代表例であるフィルタ型の音声合成では、スペクトル包絡と微細構造を(近似的に)分離して扱う。そのため、F₀は任意に変化させられ、フィルタ特性を比較的少数のパラメータで制御して音声スペクトルを生成するため、加工性が高いと期待されている。フィルタ特性を与えるパラメータとしてLPC（非特許文献２）、PARCOR（非特許文献３）、LSP（非特許文献４）やケプストラム（非特許文献５）などが提案されており、それぞれ比較的品質が高い音声分析合成方式が確立されている。しかし、これらの方法ではフィルタパラメータと音声の声質やスタイルの間の関係が一意には定まらないため、音声の性質を自在に制御することは容易ではない。 In contrast, in filter-type speech synthesis, which is a representative example of a parametric speech synthesis method, the spectral envelope and the fine structure are separated (approximately). For this reason, F ₀ can be arbitrarily changed, and the filter characteristics are controlled with a relatively small number of parameters to generate a speech spectrum, so that high workability is expected. LPC (Non-Patent Document 2), PARCOR (Non-Patent Document 4), LSP (Non-Patent Document 4), Cepstrum (Non-Patent Document 5), and the like have been proposed as parameters that give filter characteristics, and each has a relatively high quality. A speech analysis and synthesis method has been established. However, in these methods, since the relationship between the filter parameters and the voice quality and style of the voice is not uniquely determined, it is not easy to freely control the nature of the voice.

フィルタ型の音声合成は、音声分析合成系として使われる場合はかなり高い品質を示す。しかし、分析時とは異なるF₀で駆動した場合など、一般に波形接続型音声合成に比べ音声品質が低い。その一因として、次に述べるフィルタの利得特性と時間特性に注目することができる。 The filter-type speech synthesis exhibits considerably high quality when used as a speech analysis / synthesis system. However, the voice quality is generally lower than that of waveform-connected speech synthesis, such as when driving at F ₀ different from the time of analysis. As one factor, attention can be paid to the gain characteristic and time characteristic of the filter described below.

全極型フィルタによる音声分析合成方式(LPC系)における有声音の分析合成について考察する。一般に音声スペクトル包絡の山と谷の間には数十dBに達する大きなレベル差(スペクトルダイナミックレンジ)があることが多く、これを少数のパラメータを用いたモデルで表現するために、十数次のような比較的次数が低い全極型フィルタを用いる。全極型フィルタは多重共振系であるが、このような理由によっておのおのの極の共振特性のQ値は、実際の声道の特性よりも大きな値をとる傾向がある。 We consider the analysis and synthesis of voiced sound in the speech analysis and synthesis method (LPC system) using all-pole filters. In general, there is often a large level difference (spectrum dynamic range) reaching several tens of dB between the peaks and valleys of the speech spectrum envelope, and in order to express this with a model using a small number of parameters, Such an all-pole filter having a relatively low order is used. The all-pole filter is a multiple resonance system, but for this reason, the Q value of the resonance characteristic of each pole tends to take a larger value than the actual vocal tract characteristic.

このような周波数特性のフィルタの時間特性は、共振周波数の信号成分に対してQ値にほぼ比例した利得が生じるとともに、Q値にほぼ比例した時定数で出力振幅が立ち上り、減衰する。アクセント(ピッチ)を制御して音声を合成するような場合を考えると、分析時と異なるF₀で全極型フィルタを駆動し、たまたま駆動音源信号の倍音成分が高Q値の共振周波数に一致した場合などには、出力振幅の立ち上りにも減衰(立ち下がり)にも時間がかかり、その結果として合成音声の時間制御特性が悪くなる。そして、このような音が後続の音声に重畳することで、エコーが掛かっているような印象の「歯切れの悪い」音になる一因となっている可能性がある。 In the time characteristic of such a frequency characteristic filter, a gain substantially proportional to the Q value is generated with respect to the signal component of the resonance frequency, and the output amplitude rises and attenuates with a time constant substantially proportional to the Q value. Considering the case of synthesizing speech by controlling the accent (pitch), the all-pole filter is driven with F ₀ different from the analysis time, and the harmonic component of the driving sound source signal coincides with the resonance frequency of high Q value. In such a case, it takes time to rise and fall (fall) the output amplitude, and as a result, the time control characteristic of the synthesized speech is deteriorated. Then, by superimposing such a sound on the subsequent sound, there is a possibility that it becomes a cause of a “crisp” sound with an impression that an echo is applied.

図1は、ある音声データにおいて、音素/o/に該当する区間をLPC分析して得た全極型フィルタに、１フレーム分の長さ(30msec)のインパルス列(有声音駆動に相当)を入力したときの出力波形である。入力に対して出力振幅は増大を続ける(定常状態に達するまでに時間が掛かる)とともに、入力が終了した後も数十msecにわたり出力が持続している。また、フィルタでは出力信号の利得がQ値に比例するため、その利得は駆動音原信号のピッチ周波数によって大きく変動する。このような現象のため、フィルタ型音声合成では合成音声のパワーを制御しにくい。 Figure 1 shows an impulse train (corresponding to voiced sound drive) of one frame length (30 msec) to an all-pole filter obtained by LPC analysis of a section corresponding to phoneme / o / in a certain voice data. This is the output waveform when input. The output amplitude continues to increase with respect to the input (it takes time to reach a steady state), and the output continues for several tens of milliseconds after the input is completed. Further, since the gain of the output signal is proportional to the Q value in the filter, the gain greatly varies depending on the pitch frequency of the driving sound original signal. Because of this phenomenon, it is difficult to control the power of synthesized speech in the filter type speech synthesis.

これらの問題は決して特殊な状況ではなく、LPC系においてはしばしば起こりうる。実験的にそれを示すために、ある程度長い(1分程度)音声を用意し、LPC系で分析合成を行った。まず、時間制御特性を調べるための実験を行った。ピッチ周期を0.8倍から1.2倍まで0.02刻みで変更し、分析したフィルタに30msec間入力した。その後入力をせずに合成を続け、各フレーム、ピッチ周期で減衰時間を調べた。ただし、減衰時間は入力停止から合成音声のパワーが30dB低下するまでの時間と定義する。また、速い変化に追従するためパワーを10msec間の振幅の二乗和として定義した。図1においては、55msecが減衰時間である。そして、図2に減衰時間を5ms単位のヒストグラムで示した。分布が右に偏るほど、減衰時間が長くなりやすいと言える。さらに、利得特性を調べるためにピッチ周波数を同様に変化させて音声全体の合成を行い、有声区間の各フレームのパワーを調べた。同一のフレームで、駆動音源のピッチ周波数を変えることでパワーが変化するが、その最大になる場合と最小になる場合のパワーの差を図3にヒストグラムで示した。やはり分布が右へ偏るほど、利得の変化が大きいと言える。これらの結果より、LPCフィルタにおいて、時間特性の問題や利得が大きく変化する現象が確認できる。 These problems are by no means a special situation and can often occur in LPC systems. In order to show this experimentally, we prepared a long voice (about 1 minute) and analyzed and synthesized it with the LPC system. First, an experiment was conducted to investigate the time control characteristics. The pitch cycle was changed from 0.8 to 1.2 in steps of 0.02, and input to the analyzed filter for 30 msec. After that, synthesis was continued without input, and the decay time was examined at each frame and pitch period. However, the decay time is defined as the time from when input is stopped until the power of the synthesized speech drops by 30 dB. In addition, power was defined as the sum of squares of amplitude for 10 msec in order to follow fast changes. In FIG. 1, 55 msec is the decay time. FIG. 2 shows the decay time as a histogram in units of 5 ms. It can be said that the more the distribution is biased to the right, the longer the decay time. Furthermore, in order to investigate the gain characteristics, the pitch frequency was similarly changed to synthesize the entire speech, and the power of each frame in the voiced section was examined. The power changes by changing the pitch frequency of the driving sound source in the same frame. The difference in power between the maximum and the minimum is shown by the histogram in FIG. It can be said that the more the distribution is biased to the right, the greater the change in gain. From these results, it is possible to confirm the problem of time characteristics and the phenomenon that the gain changes greatly in the LPC filter.

以上の理由から、LPC系の分析合成では、原音声のピッチ周波数を用いれば比較的高い品質の分析合成音が得られるが、原音と異なるピッチ周波数で駆動すると品質が劣化する現象が見られると考えられる。この問題は、有極型フィルタ(巡回型ディジタルフィルタ)の本質に根ざす問題で解消は難しい。仮にそれを改善するためにQ値を下げると、包絡の山と谷のレベル差が形成できず明瞭性の低いbuzzyな印象の音が生成されてしまう。 For these reasons, in the analysis and synthesis of the LPC system, if the pitch frequency of the original voice is used, a relatively high quality analysis and synthesis sound can be obtained. However, when the pitch frequency is different from that of the original sound, the quality deteriorates. Conceivable. This problem is rooted in the nature of the polar filter (cyclic digital filter) and is difficult to solve. If the Q value is lowered to improve it, the level difference between the peaks and valleys of the envelope cannot be formed, and a buzzy impression sound with low clarity is generated.

CSM法（非特許文献６）では、線スペクトルモデルに基づく音声分析法であるCSM音声分析によって、フォルマント周波数にほぼ対応する複数個の正弦波周波数(CSM 周波数)を得る。そして、それらの周波数の正弦波の和を基本波形として、位相を基本周期ごとに0にリセットすることで音声を合成する。線スペクトルを広げる目的で振幅に指数関数減衰を乗じることも行われた。これは、巡回型フィルタを用いずにパラメトリックに音声合成が行える方式なので、振幅の制御は極めて容易であるため「歯切れのよい」音声合成が期待できる。しかし、CSM法は音声スペクトルを図4のように線スペクトルで近似することに相当するため、スペクトルの再現方法としては検討の余地が残っていた。
特公昭６１−１３６００号ニック・キャンベル, アラン・ブラック: “CHATR: 自然音声波形接続型任意音声合成システム,” 信号処理学会技術報告, vol.96,no. 39, pp. 45-52, 1996. F. Itakura and S. Saito: “AnalysisSynthesis Telephony Based onthe Maximum Likelihood Method,” Proc. 6th Int. Congresson Acoustics, 1968 北脇信彦, 板倉文忠, 斉藤収三: “ PARCOR 形音声分析合成系における最適符号構成,” 電子通信学会論文誌, J61-A, pp.119-126, 1978 管村昇, 板倉文忠: “線スペクトル対(LSP) 音声分析合成方式による音声情報圧縮,” 電子通信学会論文誌, J64-A, pp. 599-606,1981. 今井聖, 北村正, 竹谷博行: “2 次元ケプストラムを利用する音声分析,” 電子通信学会論文誌, J59-A, pp. 1096-1103, 1976. 嵯峨山茂樹, 板倉文忠: “複合正弦波による音声合成,” 音声研究会資料, S79-39, pp.293-300, 1979. ParhamZolfaghari, Tony Robinson, “Formant Analysis Using Mixture of Gaussians,” Proc.ICSLP 96, vol. 2, pp. 1229. 1232, 1996. . 嵯峨山茂樹, 古井貞煕: “ラグ窓を用いたピッチ抽出の一方法,”電子情報通信学会全国大会予稿集, 1235, Vol. 5, p. 263, 1978. 亀岡弘和, 西本卓也, 嵯峨山茂樹, “ 調波時間構造化クラスタリング(HTC) による音楽の音響特徴量同時推定,” 情報処理学会研究報告, 2005-MUS-61-12, pp. 71-78, 2005. 亀岡弘和, 小野順貴, 嵯峨山茂樹: “スペクトル包絡と調波構造の合成関数モデルによる音声分析,” 日本音響学会2005 年秋季研究発表会講演論文集, 2-6-4, 2005. In the CSM method (Non-Patent Document 6), a plurality of sinusoidal frequencies (CSM frequencies) substantially corresponding to the formant frequency are obtained by CSM speech analysis, which is a speech analysis method based on a line spectrum model. Then, using the sum of the sine waves of those frequencies as a basic waveform, the phase is reset to 0 for each basic period to synthesize speech. The amplitude was also multiplied by an exponential decay to broaden the line spectrum. Since this is a method in which speech synthesis can be performed parametrically without using a recursive filter, the amplitude control is extremely easy, so that “crisp” speech synthesis can be expected. However, since the CSM method corresponds to approximating the speech spectrum with a line spectrum as shown in FIG. 4, there remains room for study as a method for reproducing the spectrum.
JP-B 61-13600 Nick Campbell, Alan Black: “CHATR: Natural Speech Waveform-Connected Arbitrary Speech Synthesis System,” Signal Processing Society of Japan Technical Report, vol.96, no. 39, pp. 45-52, 1996. F. Itakura and S. Saito: “AnalysisSynthesis Telephony Based on the Maximum Likelihood Method,” Proc. 6th Int. Congresson Acoustics, 1968 Nobuhiko Kitawaki, Fumitada Itakura, Shuzo Saito: “Optimum Code Construction in PARCOR Type Speech Analysis and Synthesis System,” IEICE Transactions, J61-A, pp.119-126, 1978 Noboru Tsumura and Fumita Itakura: “Linear Spectrum Pair (LSP) Speech Information Compression by Speech Analysis and Synthesis,” IEICE Transactions, J64-A, pp. 599-606, 1981. Sei Imai, Tadashi Kitamura, Hiroyuki Takeya: “Speech analysis using two-dimensional cepstrum,” IEICE Transactions, J59-A, pp. 1096-1103, 1976. Shigeki Hiyama and Fumitada Itakura: “Speech synthesis using composite sine waves,” Spoken Papers, S79-39, pp.293-300, 1979. ParhamZolfaghari, Tony Robinson, “Formant Analysis Using Mixture of Gaussians,” Proc.ICSLP 96, vol. 2, pp. 1229. 1232, 1996. Shigeki Kajiyama, Sadahiro Furui: “A method of pitch extraction using lag windows,” Proceedings of the IEICE National Convention, 1235, Vol. 5, p. 263, 1978. Hirokazu Kameoka, Takuya Nishimoto, Shigeki Hiyama, “Simultaneous estimation of acoustic features of music using harmonic time structured clustering (HTC),” IPSJ SIG, 2005-MUS-61-12, pp. 71-78, 2005. Hirokazu Kameoka, Junki Ono, Shigeki Hatakeyama: “Speech analysis using synthetic function model of spectral envelope and harmonic structure,” Proceedings of the 2005 Autumn Meeting of the Acoustical Society of Japan, 2-6-4, 2005.

本発明は、高品質の合成音声を提供すると共に、加工性に優れた音声合成手法を提供することを目的とする。 It is an object of the present invention to provide a high-quality synthesized speech and a speech synthesis method with excellent processability.

本発明が採用した音声合成方法は、フレーム毎の音声のスペクトル包絡を所定数の単峰性関数の混合分布で近似する音声分析により得られた音声スペクトル特徴量に基づいて、時間領域において前記単峰性関数に対応する関数を所定数重畳させてなる複合関数を基本波形とし、前記基本波形を所定の駆動時点に配置することを特徴とする。有声音の音声はほぼ周期波形で、周期的にある波形が繰り返される。本明細書では、その周期的に繰り返される波形を基本波形という。本発明では、基本波形は、前記音声スペクトル特徴量に基づいて求めることができる。加工（変化）された音声スペクトル特徴量に基づいて、基本波形を求めてもよい。 The speech synthesis method employed by the present invention is based on speech spectral features obtained by speech analysis that approximates the spectral envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions, in the time domain. A composite function obtained by superimposing a predetermined number of functions corresponding to peak functions is used as a basic waveform, and the basic waveform is arranged at a predetermined driving time point. Voiced sound has a substantially periodic waveform, and a certain waveform is repeated periodically. In the present specification, the waveform that is periodically repeated is referred to as a basic waveform. In the present invention, the basic waveform can be obtained based on the voice spectrum feature amount. The basic waveform may be obtained based on the processed (changed) voice spectrum feature amount.

１つの好ましい態様では、前記音声スペクトル特徴量は、単峰性関数の混合分布のモデルパラメータの取得である。混合分布のパラメータから、目的とする基本波形を生成することができる。典型的には、前記モデルパラメータは、各単峰性関数の平均、分散、重みを含む。尚、単峰性関数の混合数をパラメータに含めて扱ってもよい。１つの好ましい態様では、前記モデルパラメータは、ＥＭアルゴリズムを用いて取得される。本明細書において、「EMアルゴリズム」は、非特許文献９で用いられているような実質的にEMアルゴリズムと等価であるアルゴリズムも含む意味で用いる。 In one preferable aspect, the speech spectrum feature amount is acquisition of a model parameter of a mixture distribution of a unimodal function. The target basic waveform can be generated from the parameters of the mixture distribution. Typically, the model parameters include the mean, variance, and weight of each unimodal function. Note that the number of mixtures of unimodal functions may be included in the parameters. In one preferred aspect, the model parameters are obtained using an EM algorithm. In this specification, the “EM algorithm” is used to include an algorithm that is substantially equivalent to the EM algorithm as used in Non-Patent Document 9.

時間領域における基本波形は、周波数領域における混合分布に対応すると考えられ、前記基本波形は、前記混合分布を逆フーリエ変換したものに相当する。本発明において、逆フーリエ変換は必須ではなく、時間領域におけるある関数と周波数領域におけるある関数との対応関係が既知であれば、周波数領域の関数のパラメータ（音声スペクトル特徴量）を用いて直接基本波形を計算することができる。１つの好ましい態様では、周波数領域におけるガウス分布関数と時間領域におけるガボール関数を対応させる。したがって、この場合、前記混合分布は、所定数のガウス分布関数からなる混合ガウス分布であり、前記基本波形は、所定数のガボール関数を重畳してなる複合ガボール関数である。 The basic waveform in the time domain is considered to correspond to the mixed distribution in the frequency domain, and the basic waveform corresponds to an inverse Fourier transform of the mixed distribution. In the present invention, the inverse Fourier transform is not indispensable, and if the correspondence between a certain function in the time domain and a certain function in the frequency domain is known, the basic function is directly used using the parameters of the frequency domain function (speech spectrum feature). Waveform can be calculated. In one preferred embodiment, a Gaussian distribution function in the frequency domain is associated with a Gabor function in the time domain. Therefore, in this case, the mixed distribution is a mixed Gaussian distribution including a predetermined number of Gaussian distribution functions, and the basic waveform is a composite Gabor function formed by superimposing a predetermined number of Gabor functions.

１つの態様では、前記音声合成方法は、フレーム毎の音声のスペクトル包絡を、所定数の単峰性関数の混合分布で近似し、音声スペクトル特徴量を求める音声分析ステップを含む。スペクトル包絡の取得は必須ではなく、予め取得され格納されているスペクトル包絡を分析してもよい。１つの態様では、前記スペクトル包絡は、ラグ窓を用いた音声スペクトルの平滑化により取得される。 In one aspect, the speech synthesis method includes a speech analysis step of approximating a spectrum envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions to obtain a speech spectrum feature amount. The acquisition of the spectrum envelope is not essential, and the spectrum envelope acquired and stored in advance may be analyzed. In one aspect, the spectral envelope is obtained by smoothing a speech spectrum using a lag window.

１つの態様では、前記音声合成方法は、ピッチ抽出を含む。また、１つの態様では、前記音声合成方法は、有声音/無声音の判定を含む。 In one aspect, the speech synthesis method includes pitch extraction. In one aspect, the speech synthesis method includes determination of voiced / unvoiced sound.

１つの態様では、前記音声が有声音であり、前記駆動時点は、ピッチ周期ごとに設定される。すなわち、基本波形を、ピッチ周期で配置することで、有声音を合成する。１つの態様では、駆動時点を設定するにあたり、複合波形の各成分ごとに重畳時点をずらしてピッチ周期で重畳する。有声音の場合も駆動時点の配置は、周期ごとには限定されない。１つの態様では、前記駆動時点は、ピッチ周期内に複数ある。LPCにおけるマルチパルス方式に倣って、大小の駆動時点を適切に配置して、そこに「複合Gabor関数」を配置しても良い。これにより合成音声品質の向上が見込まれる。 In one aspect, the voice is a voiced sound, and the driving time point is set for each pitch period. That is, a voiced sound is synthesized by arranging basic waveforms at a pitch period. In one aspect, when setting the driving time point, the superimposition time point is shifted for each component of the composite waveform and superimposed at a pitch period. Even in the case of voiced sound, the arrangement at the time of driving is not limited for each period. In one aspect, there are a plurality of driving time points within a pitch period. In accordance with the multi-pulse method in LPC, large and small driving time points may be appropriately arranged, and a “composite Gabor function” may be arranged there. This is expected to improve the synthesized speech quality.

１つの態様では、前記音声が無声音であり、前記駆動時点は、ランダム間隔に設定される。例えば、ランダム信号と複合Gabor関数の畳み込みが考えられる。また、無声音については、従来の無声音の生成法を採用してもよい。 In one aspect, the voice is an unvoiced sound, and the driving time is set at a random interval. For example, convolution of a random signal and a composite Gabor function can be considered. As for the unvoiced sound, a conventional unvoiced sound generation method may be employed.

本発明が採用した音声合成装置は、フレーム毎の音声のスペクトル包絡を所定数の単峰性関数の混合分布で近似する音声分析により得られた音声スペクトル特徴量に基づいて、時間領域において前記単峰性関数に対応する関数を所定数重畳させてなる複合関数を基本波形とし、前記基本波形を所定の駆動時点に配置することを特徴とする。１つの好ましい態様では、前記混合分布は、所定数のガウス分布関数からなる混合ガウス分布であり、前記基本波形は、所定数のガボール関数を重畳してなる複合ガボール関数である。 The speech synthesizer employed by the present invention is based on speech spectral features obtained by speech analysis that approximates the spectral envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions. A composite function obtained by superimposing a predetermined number of functions corresponding to peak functions is used as a basic waveform, and the basic waveform is arranged at a predetermined driving time point. In one preferred embodiment, the mixed distribution is a mixed Gaussian distribution composed of a predetermined number of Gaussian distribution functions, and the basic waveform is a composite Gabor function formed by superimposing a predetermined number of Gabor functions.

１つの態様では、前記音声合成装置は、フレーム毎の音声のスペクトル包絡を所定数の単峰性関数の混合分布で近似して音声スペクトル特徴量を取得する音声分析部と、時間領域において前記単峰性関数に対応する関数を所定数重畳させてなる複合関数を基本波形とし、前記基本波形を所定の駆動時点に配置する音声合成部と、を有する。 In one aspect, the speech synthesizer includes: a speech analysis unit that obtains a speech spectrum feature amount by approximating a spectrum envelope of speech for each frame by a mixed distribution of a predetermined number of unimodal functions; And a speech synthesizer that uses a composite function formed by superimposing a predetermined number of functions corresponding to the peak function as a basic waveform, and arranges the basic waveform at a predetermined driving time point.

１つの態様では、前記音声合成装置は、フレーム毎の音声のスペクトル包絡を所定数の単峰性関数の混合分布で近似する音声分析により得られた音声スペクトル特徴量を記憶する記憶部と、時間領域において前記単峰性関数に対応する関数を所定数重畳させてなる複合関数を基本波形とし、前記基本波形を所定の駆動時点に配置する音声合成部と、を有する。 In one aspect, the speech synthesizer stores a speech spectrum feature obtained by speech analysis that approximates a spectrum envelope of speech for each frame with a mixture distribution of a predetermined number of unimodal functions, and a time And a speech synthesizer that uses a composite function obtained by superimposing a predetermined number of functions corresponding to the unimodal function in a region as a basic waveform, and places the basic waveform at a predetermined driving time point.

本発明に係る音声合成方法は全てコンピュータによって実行することができる。また、本発明に係る音声合成装置は、コンピュータ（入力手段、出力手段、表示手段、演算手段、記憶手段、を含む）によって構成することができる。したがって、本発明は、さらに、本発明に係る音声合成をコンピュータに実行させるためのコンピュータプログラム、ないし、当該コンピュータプログラムを記録したコンピュータ読み取り可能な記録媒体に係る。 All the speech synthesis methods according to the present invention can be executed by a computer. The speech synthesizer according to the present invention can be configured by a computer (including an input unit, an output unit, a display unit, a calculation unit, and a storage unit). Therefore, the present invention further relates to a computer program for causing a computer to perform speech synthesis according to the present invention, or a computer-readable recording medium on which the computer program is recorded.

１つの態様では、本発明は、フレーム毎の音声のスペクトル包絡を所定数の単峰性関数の混合分布で近似する音声分析により得られた音声スペクトル特徴量に基づいて音声合成を行うためにコンピュータを、時間領域において前記単峰性関数に対応する関数を所定数重畳させてなる複合関数を基本波形とし、前記基本波形を所定の駆動時点に配置する手段として機能させるための音声合成用コンピュータプログラム、である。 In one aspect, the invention provides a computer for performing speech synthesis based on speech spectral features obtained by speech analysis that approximates the spectral envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions. Is a computer program for speech synthesis for functioning as a basic waveform a composite function obtained by superimposing a predetermined number of functions corresponding to the unimodal function in the time domain, and for arranging the basic waveform at a predetermined driving time point .

１つの態様では、本発明は、音声スペクトル特徴量に基づいて音声合成するためにコンピュータを、フレーム毎の音声のスペクトル包絡を所定数の単峰性関数の混合分布で近似する音声分析により得られた音声スペクトル特徴量を記憶する記憶手段と、時間領域において前記単峰性関数に対応する関数を所定数重畳させてなる複合関数を基本波形とし、前記基本波形を所定の駆動時点に配置する手段と、して機能させるための音声合成用コンピュータプログラム、である。 In one aspect, the present invention is obtained by speech analysis that approximates a speech spectral envelope per frame with a mixed distribution of a predetermined number of unimodal functions for speech synthesis based on speech spectral features. Storage means for storing the voice spectrum feature value, and means for arranging a basic function at a predetermined driving time point with a complex function obtained by superimposing a predetermined number of functions corresponding to the unimodal function in the time domain as a basic waveform And a computer program for speech synthesis for functioning.

１つの態様では、本発明は、音声スペクトル特徴量に基づいて音声合成するためにコンピュータを、フレーム毎の音声のスペクトル包絡を所定数の単峰性関数の混合分布で近似する音声分析により音声スペクトル特徴量を取得する手段と、時間領域において前記単峰性関数に対応する関数を所定数重畳させてなる複合関数を基本波形とし、前記基本波形を所定の駆動時点に配置する手段と、して機能させるための音声合成用コンピュータプログラム。 In one aspect, the present invention provides a speech spectrum by speech analysis that approximates a speech spectral envelope for each frame with a mixed distribution of a predetermined number of unimodal functions for speech synthesis based on speech spectrum features. Means for acquiring a feature value, and means for setting a basic function as a composite waveform obtained by superimposing a predetermined number of functions corresponding to the unimodal function in the time domain, and disposing the basic waveform at a predetermined driving time point, A computer program for speech synthesis to function.

本発明によれば、フィルタ型音声合成に比べて高品質の合成音声が得られる。本発明に係る音声合成手法は、音声スペクトル特徴量に基づいて音声合成を行うものであり、波形接続型音声合成に比べて加工性に優れる。したがって、本発明によれば、対話音声の生成に適した、高品質かつ多様なスタイルの音声を生成可能な音声合成が可能となる。 According to the present invention, high-quality synthesized speech can be obtained compared to filter-type speech synthesis. The speech synthesis method according to the present invention performs speech synthesis based on speech spectrum feature amounts, and is superior in workability compared to waveform-connected speech synthesis. Therefore, according to the present invention, it is possible to perform speech synthesis capable of generating high-quality and various styles of speech suitable for generating conversational speech.

本発明に係る音声分析合成手法について、１つの好適な実施形態である複合ウェーブレットモデル（ＣＷＭ：Composite Wavelet Model）に基づいて説明する。 The speech analysis and synthesis method according to the present invention will be described based on a composite wavelet model (CWM) which is one preferred embodiment.

［Ａ］基本波形の接続による音声合成
先ず、本発明に係る複合ウェーブモデルの前提となる基本波形の接続による音声合成について説明する。従来例で述べた方式における有声音の合成を、ピッチ周期のインパルス列を入力したある線形系と考えて、その線形系のインパルス応答により整理すると、波形接続型では音声波形のピッチ周期波形そのものをインパルス応答とするのに対し、全極型フィルタでは推定されたスペクトル包絡の逆Fourier変換が対応する。 [A] Speech Synthesis by Connection of Basic Waveforms First, speech synthesis by connection of basic waveforms, which is a premise of the composite wave model according to the present invention, will be described. Considering the synthesis of voiced sound in the method described in the conventional example as a linear system that inputs an impulse train of pitch period, and organizing it by the impulse response of the linear system, the waveform period type itself is the pitch period waveform of the speech waveform. In contrast to the impulse response, the all-pole filter corresponds to the inverse Fourier transform of the estimated spectral envelope.

これを基本波形の繰り返しとして解釈し比較すると、波形接続型におけるピッチ周期波形は、これを構成する基本正弦波とその多数の高調正弦波の重ね合わせととらえられるが、これら個々の振幅位相はピッチそのものに大きく依存するため、ピッチと独立した制御には適さない。 When this is interpreted as a repetition of the basic waveform and compared, the pitch periodic waveform in the waveform connection type can be seen as a superposition of the basic sine wave and its many harmonic sine waves that compose this, but these individual amplitude phases are the pitch. Because it depends largely on itself, it is not suitable for control independent of pitch.

一方、CSM合成においてはほぼフォルマント周波数に対応する正弦波断片が、全極型フィルタにおいては単振動(二次系)のインパルス応答である指数型減衰正弦波が、それぞれ基本波形となっており、いずれもこれら基本波形の重ね合わせと解釈できる。これら基本波形に必要な性質は、音声のスペクトル包絡をよく近似するスペクトルをもつことである。この意味からは必ずしも巡回型フィルタの場合のような長い基本波形は必要ではなく、巡回型フィルタでは単に時間特性を悪化させる要因になっている。 On the other hand, in the CSM synthesis, the sine wave fragment corresponding to the formant frequency is the basic waveform, and in the all-pole filter, the exponential damped sine wave that is the impulse response of the simple vibration (secondary system) is the basic waveform, Either can be interpreted as a superposition of these basic waveforms. A necessary property for these basic waveforms is to have a spectrum that closely approximates the spectral envelope of the speech. From this point of view, a long basic waveform as in the case of the cyclic filter is not necessarily required, and the cyclic filter is simply a factor of deteriorating the time characteristic.

したがって、パラメトリックでかつ時間特性が良い音声合成は、少なくとも有声音の合成においては、巡回型フィルタを用いず、スペクトル包絡の逆Fourier 変換をピッチ周期で繰り返し、それに希望する振幅を乗じる方法が有利である。 Therefore, for speech synthesis with good parametric and time characteristics, at least for synthesis of voiced sound, it is advantageous to use a method that repeats the inverse Fourier transform of the spectral envelope at the pitch period and multiplies it by the desired amplitude without using a recursive filter. is there.

［Ｂ］基本波形のモデル化
合成音声の基本波形を少数の扱いやすいパラメータによって表現することができれば、合成音声の声質や感情を操作するなどの加工がしやすくなる可能性がある。その要求条件には、
（１）多様な音声を少数のパラメータで表現できるパラメトリックな方式であること、
（２）音声スペクトルの大きなダイナミックレンジを表現でき、かつQ値は低く抑えるために、巡回型フィルタによらない方式であること、
が要求される。 [B] Modeling of basic waveform If the basic waveform of synthesized speech can be expressed by a small number of easy-to-handle parameters, processing such as manipulating voice quality and emotion of synthesized speech may be facilitated. The requirements include:
(1) A parametric method capable of expressing various voices with a small number of parameters.
(2) In order to express a large dynamic range of the speech spectrum and keep the Q value low, the system should not use a cyclic filter.
Is required.

そこで、次のFourier変換公式に着目する。ωを周波数、tを時間、a,b, cを任意の実数とすると、

が成り立つ。すなわち、周波数領域のガウス分布関数は、図5(a)に示すように、時間領域ではガウス分布関数と正弦波の積であるGabor 関数で表される。ガウス分布関数はdB尺度で見れば下に開いた放物線であり、これを共振特性と考えるとQ値を抑えつつ、かつ大きな山と谷を形成するのに都合がよい。これらの関数対は、スペクトル領域でも時間領域でも大きく拡がらない利点を持つ。これを音声のフォルマントに対応づけて考える。 Therefore, pay attention to the following Fourier transform formula. If ω is frequency, t is time, and a, b, and c are arbitrary real numbers,

Holds. That is, the Gaussian distribution function in the frequency domain is represented by a Gabor function that is the product of a Gaussian distribution function and a sine wave in the time domain, as shown in FIG. 5 (a). The Gaussian distribution function is a parabola that opens downward on the dB scale. Considering this as a resonance characteristic, it is convenient to form large peaks and valleys while suppressing the Q value. These function pairs have the advantage that they do not expand significantly in the spectral or time domain. This is considered in association with the voice formant.

したがって、図6に示すように、音声スペクトル包絡を、混合ガウス分布関数モデル(GMM)で近似すれば、GMMで表されたスペクトル包絡から、基本波形を生成することができる。(振幅)スペクトル包絡を図5(b)のように複数のガウス分布関数の重ね合わせによって近似した場合には、基本波形は複数のGabor関数の重ね合わせとなる。このため、本手法を複合正弦波モデル(Composite Sinusoidal Modeling)に倣って、正弦波の代わりにGabor
Waveletの重ね合わせを基本波形とするという意味で、複合ウェーブレットモデル(ＣＷＭ：Composite Wavelet Model)と名付ける。尚、通常、GMMはGaussian Mixture Modelの略で、混合ガウス分布密度モデルを意味し、その積分値は1に等しくなければならない。しかし、本明細書において、GMMは、スペクトル(パワースペクトルあるいは0位相化した振幅スペクトル)のモデルとしての混合ガウス分布関数モデルを意味するものとする。 Therefore, as shown in FIG. 6, if the speech spectrum envelope is approximated by a mixed Gaussian distribution function model (GMM), a basic waveform can be generated from the spectrum envelope represented by the GMM. When the (amplitude) spectrum envelope is approximated by superimposing a plurality of Gaussian distribution functions as shown in FIG. 5B, the basic waveform is a superposition of a plurality of Gabor functions. Therefore, this method is modeled on Composite Sinusoidal Modeling and Gabor is used instead of sine wave.
The composite wavelet model (CWM: Composite Wavelet Model) is used in the sense that superposition of wavelets is used as a basic waveform. In general, GMM is an abbreviation for Gaussian Mixture Model, which means a mixed Gaussian distribution density model, and its integral value must be equal to 1. However, in this specification, GMM means a mixed Gaussian distribution function model as a model of a spectrum (power spectrum or zero-phased amplitude spectrum).

［Ｃ］EMアルゴリズムを用いたＧＭＭの近似による音声分析法
少数のガウス分布関数でスペクトル包絡の近似を行って音声スペクトル特徴量（平均、分散、重み）を取得することで、各混合成分の平均がフォルマント周波数に、分散がフォルマントの広がりに対応することが期待でき、分析パラメータ（音声スペクトル特徴量）によって音声のフォルマント構造を直接操作できる可能性がある。これにより、フォルマント音声合成同様に音声学の知見を活かした声質変換の点で有利であると考えられる。また、逆に多数のガウス分布関数でスペクトル包絡の近似を行う場合には、加工は難しくなるが近似の精度がよくなり音声品質が向上することが期待できる。 [C] Speech analysis method by approximation of GMM using EM algorithm The spectrum envelope is approximated by a small number of Gaussian distribution functions to obtain speech spectrum features (average, variance, weight), and the average of each mixture component Can be expected to correspond to the formant frequency, and the dispersion corresponds to the spread of the formant, and there is a possibility that the formant structure of the speech can be directly manipulated by the analysis parameter (speech spectral feature). This is considered to be advantageous in terms of voice quality conversion utilizing the knowledge of phonetics as in the formant speech synthesis. On the contrary, when the spectral envelope is approximated with a large number of Gaussian distribution functions, it is difficult to process, but it is expected that the accuracy of the approximation is improved and the voice quality is improved.

非特許文献７には、音声スペクトルのフォルマント分析のためにスペクトル包絡を混合ガウス分布関数で近似する手法が開示されている。本発明における音声分析においては、非特許文献７に開示された手法を用いることもできる。しかし、分布密度関数推定に関するEM(Expectation-Maximization)アルゴリズムがパワースペクトルのモデル化にそのまま使用できるかどうかは自明でない。それについては、非特許文献９で議論されており、EMアルゴリズムと同型のアルゴリズムにより、観測したスペクトルに対するモデルパラメータのKL尺度(Kullback-LeiblerL情報量と同型の関数間の擬距離)を最小化(あるいは極小化)することができることが示されている。ここでは、その原理に基づいて、EMアルゴリズムに同型なアルゴリズムに基づいて、分析フレーム単位の音声スペクトルのGMM推定によりスペクトルパラメータを抽出する。本明細書において、「EMアルゴリズム」は、非特許文献９で用いられているような実質的にEMアルゴリズムと等価であるアルゴリズムも含む意味で用いる。 Non-Patent Document 7 discloses a method of approximating a spectrum envelope with a mixed Gaussian distribution function for formant analysis of a speech spectrum. In the speech analysis in the present invention, the method disclosed in Non-Patent Document 7 can also be used. However, it is not obvious whether the EM (Expectation-Maximization) algorithm relating to the distribution density function estimation can be used for power spectrum modeling as it is. This is discussed in Non-Patent Document 9, and the KL scale of model parameters (Kullback-LeiblerL information amount and pseudo distance between functions of the same type) for the observed spectrum is minimized by an algorithm of the same type as the EM algorithm ( Or it can be minimized). Here, based on the principle, spectrum parameters are extracted by GMM estimation of the speech spectrum in units of analysis frames based on an algorithm that is the same type as the EM algorithm. In this specification, the “EM algorithm” is used to include an algorithm that is substantially equivalent to the EM algorithm as used in Non-Patent Document 9.

また、非特許文献７では、GMM化が包絡でなくピッチ構造に収束する場合を指摘している。本実施例では、自己相関関数にラグ窓を掛けてフーリエ変換することにより平滑化パワースペクトルを得て用いることによりその問題を回避している。ラグ窓を用いたスペクトル包絡の計算については、特許文献１を参照することができる。 Non-Patent Document 7 points out the case where GMM convergence converges to a pitch structure, not an envelope. In this embodiment, the problem is avoided by obtaining and using a smoothed power spectrum by multiplying the autocorrelation function by a lag window and performing Fourier transform. For the calculation of the spectral envelope using the lag window, Patent Document 1 can be referred to.

［Ｄ］ＣＷＭを用いた音声分析合成手順
本発明の音声合成法の手順を示す。本発明に係る音声合成法は、予め実行される音声分析ステップと、その分析結果の蓄積・伝送・加工などを経て行なわれる音声合成ステップと、から構成されている。以下に、音声分析ステップの１つの好ましい態様、音声合成ステップの１つの好ましい態様を例示する。 [D] Speech Analysis / Synthesis Procedure Using CWM The procedure of the speech synthesis method of the present invention will be described. The speech synthesis method according to the present invention includes speech analysis steps that are executed in advance and speech synthesis steps that are performed through storage, transmission, and processing of the analysis results. Hereinafter, one preferred embodiment of the speech analysis step and one preferred embodiment of the speech synthesis step will be exemplified.

［Ｄ−１］分析系の手順
（１）フレーム毎に音声スペクトル特徴量を計算する。
その詳細は、例えば、
（1a）音声波形の差分処理を行う（例えば、高域強調フィルタを通す）；
（1b）短時間ごとに音声波形を切り出しデータ窓(Hamming窓など)を掛ける；
（1c）音声信号の自己相関関数を求める；
（1d）自己相関関数に窓(ラグ窓)を掛ける；
（1e）フーリエ変換する(FFTなどのアルゴリズムによる)；
（1f）周波数点ごとに平方根を求める（これにより零位相化された平滑化振幅スペクトルが求まる）；
ことで行う。
ここでの音声スペクトルの計算については、特許文献１を参照することができる。また、音声スペクトルの計算（音声スペクトル包絡の取得）については、その他の公知の様々な手法を採用することができる。 [D-1] Analysis System Procedure (1) A speech spectrum feature amount is calculated for each frame.
The details are, for example,
(1a) Perform difference processing of speech waveform (for example, pass through high frequency emphasis filter);
(1b) Cut out speech waveform every short time and apply data window (Hamming window etc.);
(1c) Find the autocorrelation function of the audio signal;
(1d) Multiply the autocorrelation function by a window (lag window);
(1e) Fourier transform (by an algorithm such as FFT);
(1f) Find the square root for each frequency point (this gives a smoothed amplitude spectrum that is zero-phased);
Do that.
Patent Document 1 can be referred to for the calculation of the speech spectrum here. In addition, various other known methods can be employed for calculation of the voice spectrum (acquisition of the voice spectrum envelope).

（２）フレームごとの音声スペクトルを混合ガウス関数で近似する。
その詳細は、たとえば、
混合ガウス関数(GMM)のモデルパラメータ(平均、分散、重み)を、適当な初期値から出発して、EMアルゴリズムに類似したアルゴリズムにより求める。
すなわち、混合数をmとして、
各平均μ_i(i=1,2,3,...,m)、
各分散σ_i ² (i=1,2,3,...,m)、
各重みw_i(i=1,2,3,...,m)、
がスペクトル分析結果である。
ここで用いたEMアルゴリズムに類似したアルゴリズムの詳細については非特許文献９を参照することができる。音声スペクトル包絡をGMMで近似する手法は、ここで述べたものに限定されるものではなく、非特許文献７に記載された手法、その他の手法を用いても良い。例えば、非特許文献１０に記載されたスペクトル包絡推定を用いることも可能である。 (2) The speech spectrum for each frame is approximated by a mixed Gaussian function.
The details are, for example,
The model parameters (mean, variance, weight) of the mixed Gaussian function (GMM) are determined by an algorithm similar to the EM algorithm, starting from appropriate initial values.
That is, if the number of mixing is m,
Each mean μ _i (i = 1,2,3, ..., m),
Each variance σ _i ² (i = 1,2,3, ..., m),
Each weight w _i (i = 1,2,3, ..., m),
Is the spectrum analysis result.
For details of an algorithm similar to the EM algorithm used here, Non-Patent Document 9 can be referred to. The method of approximating the speech spectrum envelope with the GMM is not limited to that described here, and the method described in Non-Patent Document 7 and other methods may be used. For example, spectral envelope estimation described in Non-Patent Document 10 can be used.

目的に応じて、さらに有声音/無声音の判定、有声音の場合は基本周波数（ピッチ周波数）を求めて、分析結果に追加してもよい。有声音/無声音の判定とF₀推定には、既存のピッチ抽出手法を利用することができる。また、ラグ窓を用いたピッチ抽出については、非特許文献８を参照することができる。 Depending on the purpose, voiced / unvoiced sound may be determined. In the case of voiced sound, a fundamental frequency (pitch frequency) may be obtained and added to the analysis result. An existing pitch extraction method can be used for voiced / unvoiced sound determination and F ₀ estimation. Moreover, nonpatent literature 8 can be referred about the pitch extraction using a lag window.

［Ｄ−２］合成系の手順
（１）駆動時点を決定する。有声音の場合はピッチ周期ごと、無声音の場合はランダムに、駆動時点を決める。
本明細書において、有声音の合成の場合は、基本波形が周期的に繰り返されるが(実際は基本波形は徐々に形を変えて行くが)、その基本波形を配置する位置を駆動時点と呼ぶ。ＣＷＭ合成の場合は、ＣＷＭ基本波形は原点を中心にした左右対称波形であるが、これを時間軸上に周期的に配置して、周期波形を作る。そのような、基本波形の中心を置く時点のことを駆動時点と呼ぶ。 [D-2] Synthesis System Procedure (1) Determine the driving time. In the case of voiced sound, the driving time is determined at every pitch period, and in the case of unvoiced sound, the driving time is determined randomly.
In this specification, in the case of synthesis of voiced sound, the basic waveform is periodically repeated (actually, the basic waveform gradually changes shape), but the position at which the basic waveform is arranged is called a driving time point. In the case of CWM synthesis, the CWM basic waveform is a symmetrical waveform with the origin as the center, but this is periodically arranged on the time axis to create a periodic waveform. Such a time point at which the center of the basic waveform is placed is called a drive time point.

（２）駆動時点に対応するフレームの分析で得られた(あるいはそれを加工した)混合ガウス関数の逆フーリエ変換に相当する「複合Gabor関数」(複数のGabor関数を重畳したもの)を、駆動時点に配置する。フレームごとのガウス関数の平均μ_i、分散σ_i ²、重みw_i (但しi = 1,…,m)から、GMMの逆フーリエ変換に対応するGabor関数の重みつき和を求める。これを、ピッチ周期間隔で周期的に配置して音声合成出力とする。
（２−１）Gabor関数はすべて中心を揃えるのでなく、適度にずらせば、ピークを下げつつ全体のエネルギーを増す(波高率の改善)ができる。同時に、合成音声波形の位相を調整して品質を向上させられる可能性もある。
（２−２）有声音の場合も周期ごととは限らずに、LPCにおけるマルチパルス方式に倣って、大小の駆動時点を適切に配置して、そこに「複合Gabor関数」を配置して良い。合成音声品質の向上が見込まれる。駆動時点に相当するマルチパルスを、単一パルスから複合パルス、さらにランダムパルスまで連続的に変化させることで、有声音から無声音までを連続的に生成することができ、より滑らかに自然な音声を合成することができる。
（２−３）無声音の場合は、ランダム信号と複合Gabor関数の畳み込みでも良い。その他、無声音の生成法のみ従来手法を用いるなど、各種の変形が考えられる。 (2) Driving a “composite Gabor function” (superimposed of multiple Gabor functions) corresponding to the inverse Fourier transform of the mixed Gaussian function obtained (or processed) from the analysis of the frame corresponding to the driving time point Place at the time. A weighted sum of Gabor functions corresponding to the inverse Fourier transform of the GMM is obtained from the average μ _i of the Gaussian function for each frame, the variance σ _i ² , and the weights w _i (where i = 1,..., M). This is periodically arranged at pitch cycle intervals to obtain a speech synthesis output.
(2-1) The Gabor functions are not all centered, but if they are shifted appropriately, the overall energy can be increased (improvement of the crest factor) while lowering the peak. At the same time, the quality may be improved by adjusting the phase of the synthesized speech waveform.
(2-2) In the case of voiced sound, it is not limited to each cycle, and a large and small driving time point may be appropriately arranged in accordance with the multi-pulse method in LPC, and a “composite Gabor function” may be arranged there. . Expected to improve synthesized speech quality. By continuously changing the multi-pulse corresponding to the driving time point from a single pulse to a composite pulse, and further to a random pulse, it is possible to continuously generate voiced to unvoiced sound, and smoother and more natural sound. Can be synthesized.
(2-3) In the case of unvoiced sound, convolution of a random signal and a composite Gabor function may be used. In addition, various modifications such as using the conventional method only for the unvoiced sound generation method are conceivable.

［Ｅ］音声合成実験
本発明の音声分析合成手法の有効性を確認するために、分析合成によって音声が再現されるかを確認した。また、従来法の問題点の解決に向けて、LPC法との比較を行った。 [E] Speech synthesis experiment In order to confirm the effectiveness of the speech analysis and synthesis method of the present invention, it was confirmed whether speech was reproduced by analysis and synthesis. In addition, in order to solve the problems of the conventional method, we compared it with the LPC method.

［Ｅ−１］実験条件
まず、本発明に係る音声合成法及びGMMによる近似の動作検証のために、本発明の手法によって音声を低次元のパラメータに分析し、パラメータから合成を行った。実験にはATR音声データベースより3−5秒程度の女性話者による文音声を5 程度選び、用いた。サンプリング周波数16kHz、サンプルサイズ16bitの音声に対して、ラグ窓法によるスペクトル包絡の抽出を行った。さらに、スペクトル包絡を5個のガウス関数の和に近似した。したがって、分析パラメータは１フレームにつき15次元である。今回は、F₀はSnack Sound Toolkit(“The Snack Sound Toolkit,”https://www.speechkth.se/snack/）付属のF₀抽出ツールによって抽出した。また、フレーム長30ms、フレームシフト10msで分析した。 [E-1] Experimental conditions First, in order to verify the operation of the speech synthesis method according to the present invention and the approximation by GMM, speech was analyzed into low-dimensional parameters by the method of the present invention, and synthesis was performed from the parameters. In the experiment, about 5 sentence speeches by female speakers in 3-5 seconds were selected from the ATR speech database and used. Spectral envelopes were extracted by the lag window method for audio with a sampling frequency of 16 kHz and a sample size of 16 bits. Furthermore, the spectral envelope was approximated to the sum of five Gaussian functions. Therefore, the analysis parameters are 15 dimensions per frame. This time, F ₀ was extracted by the F ₀ extraction tool attached to Snack Sound Toolkit (“The Snack Sound Toolkit,” https://www.speechkth.se/snack/). The analysis was performed with a frame length of 30 ms and a frame shift of 10 ms.

まず、ピッチ周期や分析パラメータに変更を加えず、音声を合成した。無声音については、ランダムなピッチ周期を与える方法で合成した。そして聴取による比較の他、スペクトルの比較を行った。さらに、ピッチ周期や分析パラメータの平均を0.7倍−1.3倍程度に変化させ、音声を合成し、音声として破綻していないか聴取によって確認を行った。後者は、フォルマント周波数を変更したことに相当する。 First, the speech was synthesized without changing the pitch period and analysis parameters. The unvoiced sound was synthesized by a method that gives a random pitch period. In addition to the comparison by listening, the spectra were compared. Furthermore, the average of the pitch period and analysis parameters was changed to about 0.7 times to 1.3 times, the speech was synthesized, and it was confirmed by listening whether it was broken as speech. The latter corresponds to changing the formant frequency.

本発明に係る音声合成手法により時間特性が改善することを示すため、従来例で記載したLPCフィルタについての実験と同様の実験を行い、時間特性と利得特性を調べた。 In order to show that the time characteristic is improved by the speech synthesis method according to the present invention, an experiment similar to the experiment for the LPC filter described in the conventional example was performed, and the time characteristic and the gain characteristic were examined.

［Ｅ−２］実験結果と考察
聴取実験によって、良好な音声が合成されることを確認したが、背景にブザー的な雑音が聴かれた。図8に「うれしいはずが...」の冒頭部分の原音声と提案法の合成音声のスペクトルを示す。この図から分かるように、合成音声はかなり原音声の特徴を再現できているが、基本波形をゼロ位相化しているためにエネルギーの集中が著しくなっていることがわかる。図8に本発明に係る音声合成手法により合成される「あ」の音の一部を示す。原音声とは明らかに異なる波形を持つが、スペクトルはほぼ同じである。ピッチ周期やフォルマント周波数を変更する試験を行ったところ、いずれの条件においても破綻することなく音声を合成することができた。図9および図10に本発明に係る手法の時間特性と利得特性を示す。図2および図3との比較より、本発明に係る手法によって時間特性が改善し、かつ利得が安定したことがわかる。 [E-2] Experimental results and discussion It was confirmed by listening experiments that a good voice was synthesized, but a buzzer-like noise was heard in the background. Fig. 8 shows the spectrum of the original speech at the beginning of "I'm glad ..." and the synthesized speech of the proposed method. As can be seen from the figure, the synthesized speech can reproduce the characteristics of the original speech considerably, but the concentration of energy is remarkable because the basic waveform is zero-phased. FIG. 8 shows a part of the sound “A” synthesized by the speech synthesis method according to the present invention. It has a waveform that is clearly different from the original speech, but the spectrum is almost the same. When tests were conducted to change the pitch period and formant frequency, it was possible to synthesize speech without failing under any conditions. 9 and 10 show time characteristics and gain characteristics of the method according to the present invention. Comparison with FIG. 2 and FIG. 3 shows that the time characteristic is improved and the gain is stabilized by the method according to the present invention.

本発明は、高音質かつ多様な声質の音声を生成可能な音声合成技術を提供するものであり、歌声合成、会話音声や感情音声の生成、音声対話システム、カーナビゲーションシステム、HMM音声合成系と組み合わせた擬人化エージェントやロボット、視覚障害者支援などの様々な場面において利用が可能である。 The present invention provides a speech synthesis technology capable of generating voices with high sound quality and various voice qualities, including singing voice synthesis, conversation voice and emotion voice generation, voice dialogue system, car navigation system, HMM voice synthesis system, It can be used in various situations such as combining anthropomorphic agents, robots, and visually impaired support.

LPC フィルタ出力の時間特性の例を示す。An example of time characteristics of LPC filter output is shown below. LPC フィルタ出力の時間特性の傾向を示す。The trend of the time characteristic of the LPC filter output is shown. LPC フィルタ出力の利得特性の傾向を示す。Shows the trend of LPC filter output gain characteristics. CSM 法による音声スペクトルの線スペクトル表現を示す。The line spectrum representation of speech spectrum by CSM method is shown. ガウス関数のFourier 変換対を示す。Here is a Fourier transform pair of Gaussian functions. GMM による音声スペクトルの近似の例を示す。An example of speech spectrum approximation by GMM is shown. 原音声(上) および合成音声(下) のスペクトログラム例を示す。An example of spectrograms of original speech (top) and synthesized speech (bottom) is shown. 原音声(上) および合成音声(下) の波形例を示す。Waveform examples of the original speech (top) and synthesized speech (bottom) are shown. 本発明に係る手法の時間特性を示す。The time characteristic of the method based on this invention is shown. 本発明に係る手法の利得格差を示す。The gain disparity of the method according to the present invention is shown.

Claims

A function corresponding to the unimodal function in the time domain based on a speech spectral feature obtained by speech analysis that approximates a spectral envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions, A speech synthesis method, wherein a composite function formed by superimposing a predetermined number is used as a basic waveform, and the basic waveform is arranged at a predetermined driving time point.

The speech synthesis method according to claim 1, wherein the speech spectrum feature amount is a model parameter of a mixed distribution of unimodal functions.

The speech synthesis method according to claim 2, wherein the model parameters include an average, a variance, and a weight of each unimodal function.

The speech synthesis method according to claim 3, wherein the model parameter is acquired using an EM algorithm.

The speech synthesis method according to claim 1, wherein the basic waveform corresponds to an inverse Fourier transform of the mixture distribution.

6. The mixed distribution according to claim 1, wherein the mixed distribution is a mixed Gaussian distribution including a predetermined number of Gaussian distribution functions, and the basic waveform is a composite Gabor function formed by superimposing a predetermined number of Gabor functions. Speech synthesis method.

The speech synthesis method includes a speech analysis step of obtaining a speech spectrum feature amount by approximating a spectrum envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions. Voice synthesis method.

The speech synthesis method according to claim 7, wherein the spectrum envelope is acquired by smoothing a speech spectrum using a lag window.

The speech synthesis method according to claim 1, wherein the speech synthesis method includes pitch extraction.

The speech synthesis method according to claim 1, wherein the speech synthesis method includes determination of voiced / unvoiced sound.

The speech synthesis method according to claim 1, wherein the voice is voiced sound, and the driving time point is set for each pitch period.

The speech synthesis method according to any one of claims 1 to 11, wherein the voice is a voiced sound, and when the driving time point is set, the superposition time point is shifted for each component of the composite waveform and is superposed at a pitch period.

The speech synthesis method according to claim 1, wherein the voice is a voiced sound, and there are a plurality of the driving time points within a pitch period.

The speech synthesis method according to claim 1, wherein the voice is an unvoiced sound, and the driving time is set at a random interval.

A function corresponding to the unimodal function in the time domain based on a speech spectral feature obtained by speech analysis that approximates a spectral envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions, A speech synthesizer characterized in that a composite function formed by superimposing a predetermined number is used as a basic waveform, and the basic waveform is arranged at a predetermined driving time point.

The speech synthesizer
A speech analysis unit that obtains a speech spectrum feature amount by approximating a spectrum envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions;
A speech synthesizer for placing a function corresponding to the unimodal function in a time domain as a basic waveform with a composite function formed by superimposing a predetermined number of the functions, and arranging the basic waveform at a predetermined driving time;
The speech synthesizer according to claim 15.

The speech synthesizer
A storage unit for storing a speech spectrum feature obtained by speech analysis that approximates a spectrum envelope of speech for each frame by a mixed distribution of a predetermined number of unimodal functions;
A speech synthesizer for placing a function corresponding to the unimodal function in a time domain as a basic waveform with a composite function formed by superimposing a predetermined number of the functions, and arranging the basic waveform at a predetermined driving time;
The speech synthesizer according to claim 15, comprising:

The mixed mixture is a mixed Gaussian distribution including a predetermined number of Gaussian distribution functions, and the basic waveform is a composite Gabor function formed by superimposing a predetermined number of Gabor functions. Speech synthesizer.

A computer to perform speech synthesis based on speech spectral features obtained by speech analysis that approximates the spectral envelope of speech for each frame with a mixed distribution of a predetermined number of unimodal functions;
A computer program for speech synthesis for causing a complex function obtained by superimposing a predetermined number of functions corresponding to the unimodal function in a time domain as a basic waveform and functioning as means for arranging the basic waveform at a predetermined driving time point.

A computer to synthesize speech based on speech spectral features,
Storage means for storing a speech spectrum feature obtained by speech analysis that approximates a spectrum envelope of speech for each frame by a mixed distribution of a predetermined number of unimodal functions;
A function corresponding to the unimodal function in the time domain, a complex function formed by superimposing a predetermined number of times as a basic waveform, and means for arranging the basic waveform at a predetermined driving time point;
A computer program for speech synthesis to make it function.

A computer to synthesize speech based on speech spectral features,
Means for acquiring a speech spectrum feature amount by speech analysis that approximates a spectral envelope of speech for each frame by a mixed distribution of a predetermined number of unimodal functions;
A function corresponding to the unimodal function in the time domain, a complex function formed by superimposing a predetermined number of times as a basic waveform, and means for arranging the basic waveform at a predetermined driving time point;
A computer program for speech synthesis to make it function.

A computer-readable recording medium on which the speech synthesis computer according to any one of claims 19 to 21 is recorded.