KR101065188B1

KR101065188B1 - Apparatus and method for speaker adaptation by evolutional learning, and speech recognition system using thereof

Info

Publication number: KR101065188B1
Application number: KR1020090067685A
Authority: KR
Inventors: 육동석; 이협우; 김동현
Original assignee: 고려대학교 산학협력단
Priority date: 2009-07-24
Filing date: 2009-07-24
Publication date: 2011-09-19
Also published as: KR20110010233A

Abstract

본 발명은 음성인식 시스템에서의 화자 적응 기술에 관한 것으로서, 본 발명에 따른 진화 학습에 의한 화자 적응 장치는, 음성인식 시스템이 음성인식을 수행하고 있는 인식 모드에서, 미리 결정된 환경 파라미터를 이용하여 인식 대상 음성데이터의 특징벡터에 대해 특징 변환을 수행하는 특징 변환부; 상기 인식 대상 음성데이터를 저장하는 음성 데이터베이스; 및 상기 음성인식 시스템이 음성인식을 수행하지 않고 있는 대기 모드에서, 상기 음성 데이터베이스에 저장된 음성데이터를 이용하여 기존 음향모델에 대한 모델 변환을 수행하는 모델 변환부를 포함하는 것을 특징으로 하여, 사용자 편의성 및 음성인식 시스템의 인식 성능을 개선함은 물론, 화자 적응 및 환경 적응을 동시에 수행하는 이점을 제공한다.The present invention relates to a speaker adaptation technology in a speech recognition system. The apparatus for speaker adaptation by evolutionary learning according to the present invention recognizes a speech using a predetermined environment parameter in a recognition mode in which the speech recognition system performs speech recognition. A feature transform unit for performing feature transformation on the feature vector of the target speech data; A voice database storing the recognition target voice data; And a model converter configured to perform model transformation on an existing acoustic model using voice data stored in the voice database in a standby mode in which the voice recognition system does not perform voice recognition. In addition to improving the recognition performance of the speech recognition system, it provides the advantage of simultaneously performing speaker adaptation and environmental adaptation.

Description

Apparatus and method for speaker adaptation by evolutional learning, and speech recognition system using according to the present invention.

본 발명은 음성인식 시스템에서의 화자 적응 기술에 관한 것으로서, 더욱 상세하게는, 진화 학습에 의한 비지도 적응 방식을 통해 특징 변환 및 모델 변환을 모두 수행함으로써, 사용자 편의성 및 음성인식 시스템의 인식 성능을 개선함은 물론, 화자 적응 및 환경 적응을 동시에 수행하는 화자 적응 장치 및 방법과 이를 이용한 음성인식 시스템에 관한 것이다.The present invention relates to a speaker adaptation technique in a speech recognition system, and more particularly, by performing both feature transformation and model transformation through an unsupervised adaptation method by evolutionary learning, thereby improving user convenience and recognition performance of the speech recognition system. The present invention relates to a speaker adaptation apparatus and method for simultaneously performing speaker adaptation and environment adaptation and a speech recognition system using the same.

최근 음성인식 기술은 가정용 전자기기, 휴대폰, 보안 및 인증 등 다양한 분야에 적용되고 있으며, 그 적용분야 및 수요가 급속히 증가하고 있는 추세이다.Recently, voice recognition technology has been applied to various fields such as home appliances, mobile phones, security and authentication, and its application fields and demands are rapidly increasing.

여기서 음성인식(speech or voice recognition)이란, 자동적인 음성인식 시스템을 통해 인간으로부터 발화된 음성으로부터 언어적 의미, 내용을 식별하는 것으로서, 구체적으로는 음성 파형을 입력하여 단어나 단어열을 식별하고 처리하는 과정을 말한다.Here, speech or voice recognition refers to linguistic meaning and content from speech spoken by humans through an automatic speech recognition system. Specifically, a speech waveform is input to identify and process a word or word string. The process of doing that.

이러한 음성인식을 수행하는 음성인식 시스템(speech recognition system)들 은 일반적으로, 단어 음성인식 시스템(word recognition system), 연속 음성인식 시스템(continuous speech recognition system) 및 화자 인식 시스템(speaker recognition system) 등으로 분류할 수 있으며, 상기 음성인식 시스템들에 공통적으로 적용되는 음성인식 알고리즘들은 음성구간 검출(Voice Activity Detection) 과정, 특징(feature) 추출 과정, 그리고 매칭(matching) 과정을 포함하고 있다.Speech recognition systems that perform such speech recognition generally include word recognition systems, continuous speech recognition systems, speaker recognition systems, and the like. Voice recognition algorithms that can be classified and commonly applied to the voice recognition systems include a voice activity detection process, a feature extraction process, and a matching process.

이와 같이, 음성인식 기술은 일종의 패턴인식 기술로서, 상기 매칭 과정을 통해 최종적으로 입력 음성데이터의 의미를 인식해 내기 위해서는, “훈련(training) 과정” 및 “테스트(test) 과정”을 필요로 한다.As described above, the speech recognition technology is a kind of pattern recognition technology. In order to finally recognize the meaning of the input speech data through the matching process, a "training process" and a "test process" are required. .

상기 훈련 과정은 화자(speaker)로부터 특정 음성 샘플들을 수집하여 음향모델을 생성하는 과정이며, 상기 테스트 과정은 입력 음성데이터로부터 추출된 일련의 특징벡터들을 상기 생성된 음향모델의 패턴과 매칭시키는 과정이다. 따라서, 음성인식 시스템의 인식성능은, 상기 훈련 과정을 통해 생성된 음향모델이 테스트용 입력 음성 샘플들과의 패턴 매칭에 사용되는 경우 얼마만큼의 높은 정확도 내지 신뢰도를 나타낼 수 있는가에 달려있다.The training process is a process of generating an acoustic model by collecting specific speech samples from a speaker, and the test process is a process of matching a pattern of the generated acoustic model with a series of feature vectors extracted from input speech data. . Therefore, the recognition performance of the speech recognition system depends on how high accuracy or reliability can be obtained when the acoustic model generated through the training process is used for pattern matching with the input speech samples for testing.

한편, 상기한 음성인식 시스템에서 사용되는 음향모델에는, 화자 독립 음향모델과 화자 종속 음향모델이 있다. 화자 독립 음향모델은 다수의 화자로부터 발화된 음성데이터를 사용하여 훈련된 일반적 음향모델인 반면, 화자 종속 음향모델은 특정 화자로부터 발화된 음성데이터를 사용하여 훈련된 고유 음향모델이다.On the other hand, the acoustic model used in the speech recognition system includes a speaker independent acoustic model and a speaker dependent acoustic model. The speaker independent acoustic model is a general acoustic model trained using speech data uttered from multiple speakers, while the speaker dependent acoustic model is a unique acoustic model trained using speech data uttered from a specific speaker.

화자 독립 음향모델의 경우, 광범위한 불특정 화자들에 대해서는 평균적으로 최상의 인식 성능을 나타낼 수 있도록 설계된다. 그러나, 상기 화자 독립 음향모델 은 단일한 특정 화자에 대해서는 최상의 인식 성능을 발휘할 수 없는 문제점이 있다. 따라서, 일반적으로 단일한 특정 화자만 사용하는 핸드폰, PDA 등 개인용 모바일 기기에 적용되는 음성인식 시스템에서는 상기 특정 화자의 음성을 이용하여 화자 독립 음향모델을 화자 종속 음향모델로 변환시키는 화자 적응 방식이 필요하다.In the case of speaker independent acoustic model, it is designed to show the best recognition performance on average for a wide range of unspecified speakers. However, the speaker-independent acoustic model has a problem in that it cannot exhibit the best recognition performance for a single specific speaker. Therefore, in a speech recognition system applied to a personal mobile device such as a mobile phone or a PDA using only a single specific speaker, a speaker adaptation method of converting a speaker-independent acoustic model into a speaker-dependent acoustic model using a specific speaker's voice is required. Do.

기존의 화자 적응 방식으로는 지도 적응(Supervised Adaptation) 방식과 비지도 적응(Unsupervised Adaptation) 방식이 있다.The existing speaker adaptation methods include supervised adaptation and unsupervised adaptation.

그러나, 상기 지도 적응 방식은, 실제 음성인식 서비스 환경에서 음성인식 수행 전 사용자로 하여금 특정 문장을 발성하도록 하거나 발화한 문장의 전사데이터(label 또는 transcription)를 입력받아야 하므로 사용자 편의성을 저해한다는 문제점이 있다.However, the map adaptation method has a problem in that the user's convenience is impaired because the user must make a specific sentence or receive transcription data (label or transcription) of the spoken sentence before performing the speech recognition in the actual speech recognition service environment. .

또한, 상기 비지도 적응 방식은, 화자 종속 음향모델을 생성하기 위해 상당한 적응시간을 요구하며, 특히 적응을 위한 전사데이터를 음성인식 시스템 자체에서 예측해야 하는 결과, 부정확한 전사데이터로 인한 음성인식 성능 저하를 초래하게 되는 문제점이 있다.In addition, the unsupervised adaptation method requires considerable adaptation time to generate a speaker-dependent acoustic model, and in particular, the transcription data for adaptation must be predicted by the speech recognition system itself, resulting in inaccurate speech data. There is a problem that causes degradation.

한편, 실제로 음성인식을 수행하는 경우에는 잡음이 섞인 음성이 사용되므로 학습과 음성인식 테스트시에 환경이 일치하지 않아 음성인식률이 저하한다. 그러므로 음성인식 시스템에서 학습환경과 음성인식 환경을 일치시켜 잡음환경하에서 음성인식 성능을 높이는 기술이 개발되고 있다.On the other hand, when the speech recognition is actually performed, since the speech is mixed with noise, the speech recognition rate is lowered because the environment does not match during the learning and the speech recognition test. Therefore, a technique for improving speech recognition performance in a noise environment by matching a learning environment with a speech recognition environment in a speech recognition system has been developed.

그러나, 기존의 화자 적응 기술은, 실제 잡음 환경에서 화자 적응 및 환경 적응을 동시에 효율적으로 수행할 수 있는 화자 적응 기술을 제시하지 못하고 있다 는 문제점이 있다.However, the conventional speaker adaptation technique has a problem that does not provide a speaker adaptation technique capable of efficiently performing speaker adaptation and environmental adaptation simultaneously in a real noise environment.

따라서, 본 발명이 해결하고자 하는 첫 번째 기술적 과제는, 진화 학습에 의한 비지도 적응 방식을 통해 특징 변환 및 모델 변환을 모두 수행함으로써, 사용자 편의성 및 음성인식 시스템의 인식 성능을 개선함은 물론, 화자 적응 및 환경 적응을 동시에 수행하는 화자 적응 장치를 제공하는 것이다.Accordingly, the first technical problem to be solved by the present invention is to improve both user convenience and recognition performance of the speech recognition system by performing both feature transformation and model transformation through an unsupervised adaptive method by evolutionary learning. The present invention provides a speaker adaptation apparatus that simultaneously performs adaptation and environmental adaptation.

본 발명이 해결하고자 하는 두 번째 기술적 과제는, 진화 학습에 의한 비지도 적응 방식을 통해 특징 변환 및 모델 변환을 모두 수행함으로써, 사용자 편의성 및 음성인식 시스템의 인식 성능을 개선함은 물론, 화자 적응 및 환경 적응을 동시에 수행하는 화자 적응 방법을 제공하는 것이다.The second technical problem to be solved by the present invention is to improve both user convenience and recognition performance of the speech recognition system, as well as speaker adaptation and performance by performing both feature transformation and model transformation through an unsupervised adaptive method by evolutionary learning. It is to provide a speaker adaptation method that simultaneously performs environmental adaptation.

본 발명이 해결하고자 하는 세 번째 기술적 과제는, 상기 화자 적응 장치 내지 상기 화자 적응 방법을 이용한 음성인식 시스템을 제공하는 것이다.A third technical problem to be solved by the present invention is to provide a speech recognition system using the speaker adaptation apparatus or the speaker adaptation method.

상기와 같은 첫 번째 기술적 과제를 해결하기 위하여 본 발명은, 음성인식 시스템에서 진화 학습에 의해 화자 적응을 수행하는 장치에 있어서, 상기 음성인식 시스템이 음성인식을 수행하고 있는 인식 모드에서, 미리 결정된 환경 파라미터를 이용하여 인식 대상 음성데이터의 특징벡터에 대해 특징 변환을 수행하는 특징 변환부; 상기 인식 대상 음성데이터를 저장하는 음성 데이터베이스; 및 상기 음성인식 시스템이 음성인식을 수행하지 않고 있는 대기 모드에서, 상기 음성 데이터베이스에 저장된 음성데이터를 이용하여 기존 음향모델에 대한 모델 변환을 수행하는 모델 변환부를 포함하는 진화 학습에 의한 화자 적응 장치를 제공한다.In order to solve the first technical problem as described above, the present invention is a device for performing speaker adaptation by evolutionary learning in a speech recognition system, in a recognition mode in which the speech recognition system performs speech recognition, a predetermined environment A feature conversion unit for performing feature transformation on the feature vector of the speech data to be recognized using the parameter; A voice database storing the recognition target voice data; And a model converter configured to perform model transformation on an existing acoustic model using voice data stored in the voice database in a standby mode in which the voice recognition system does not perform voice recognition. to provide.

일 실시예에 있어서, 상기 특징 변환부는, 최대 우도 기법(maximum likelihood method)을 이용하여 상기 환경 파라미터를 미리 결정한다.In one embodiment, the feature converter determines the environment parameter in advance using a maximum likelihood method.

일 실시예에 있어서, 상기 화자 적응 장치는, 상기 음성인식 시스템에서 음성구간 검출(Voice Activity Detection)을 통해 검출되는 음성구간 이외의 비음성구간에서 주변잡음의 특징을 추출하여 상기 인식 대상 음성데이터가 발생한 환경을 식별하는 환경 식별부를 더 포함한다.In an embodiment, the speaker adaptation apparatus extracts a feature of the ambient noise in a non-voice section other than a voice section detected through voice activity detection in the voice recognition system, so that the recognition target voice data is extracted. It further includes an environment identification unit for identifying the generated environment.

일 실시예에 있어서, 상기 음성 데이터베이스는, 상기 환경 식별부에 의해 식별된 환경별로 해당 환경에서 발생한 음성데이터를 저장한다.In one embodiment, the voice database stores voice data generated in the environment for each of the environments identified by the environment identification unit.

일 실시예에 있어서, 상기 모델 변환부는, 상기 음성 데이터베이스에 저장된 음성데이터를 이용하여 상기 식별된 환경별로 상기 기존 음향모델에 대한 모델 변환을 수행한다.In one embodiment, the model conversion unit, using the voice data stored in the voice database performs a model conversion for the existing acoustic model for each identified environment.

상기와 같은 두 번째 기술적 과제를 해결하기 위하여 본 발명은, 음성인식 시스템에서 진화 학습에 의해 화자 적응을 수행하는 방법에 있어서, 상기 음성인식 시스템이 음성인식을 수행하고 있는 인식 모드에서 미리 결정된 환경 파라미터를 이용하여 인식 대상 음성데이터의 특징벡터에 대해 특징 변환을 수행하는 특징 변환 단계; 상기 인식 대상 음성데이터를 데이터베이스에 저장하는 음성데이터 저장 단계; 및 상기 음성인식 시스템이 음성인식을 수행하지 않고 있는 대기 모드에서 상기 데이터베이스에 저장된 음성데이터를 이용하여 기존 음향모델에 대한 모델 변환을 수행하는 모델 변환 단계를 포함하는 진화 학습에 의한 화자 적응 방법을 제 공한다.In order to solve the second technical problem as described above, the present invention provides a method for performing speaker adaptation by evolutionary learning in a speech recognition system, wherein a predetermined environmental parameter is determined in a recognition mode in which the speech recognition system performs speech recognition. A feature transformation step of performing a feature transformation on the feature vector of the speech data to be recognized using; A voice data storage step of storing the recognition target voice data in a database; And a model transformation step of performing a model transformation on an existing acoustic model using voice data stored in the database in a standby mode in which the speech recognition system does not perform speech recognition. Ball.

상기와 같은 세 번째 기술적 과제를 해결하기 위하여 본 발명은, 진화 학습에 의해 화자 적응을 수행하여 음성을 인식하는 음성인식 시스템에 있어서, 상기 음성인식 시스템이 음성인식을 수행하고 있는 인식 모드에서, 인식 대상 음성데이터의 특징벡터에 대해 미리 결정된 환경 파라미터를 이용하여 특징 변환을 수행하는 특징 변환부; 상기 인식 대상 음성데이터를 저장하는 음성 데이터베이스; 상기 음성인식 시스템이 음성인식을 수행하지 않고 있는 대기 모드에서, 상기 음성 데이터베이스에 저장된 음성데이터를 이용하여 기존 음향모델에 대한 모델 변환을 수행하는 모델 변환부; 및 상기 인식 모드에서, 상기 특징 변환부에 의해 변환된 특징벡터 및 상기 모델 변환부에 의해 변환된 음향모델을 이용하여 음성인식을 수행하는 인식부를 포함하는 진화 학습에 의한 음성인식 시스템을 제공한다.In order to solve the third technical problem as described above, the present invention provides a speech recognition system that recognizes speech by performing speaker adaptation by evolutionary learning, in a recognition mode in which the speech recognition system performs speech recognition. A feature converter configured to perform feature transformation on the feature vector of the target speech data using a predetermined environment parameter; A voice database storing the recognition target voice data; A model converter configured to perform model transformation on an existing acoustic model using voice data stored in the voice database in a standby mode in which the voice recognition system does not perform voice recognition; And a recognition unit configured to perform voice recognition using the feature vector converted by the feature conversion unit and the acoustic model converted by the model conversion unit in the recognition mode.

일 실시예에 있어서, 상기 음성인식 시스템은, 음성구간 검출(Voice Activity Detection)을 통해 검출되는 음성구간 이외의 비음성구간에서 주변잡음의 특징을 추출하여 상기 인식 대상 음성데이터가 발생한 환경을 식별하는 환경 식별부를 더 포함한다.In an embodiment, the voice recognition system may identify an environment in which the voice data to be recognized is generated by extracting features of the ambient noise in a non-voice section other than the voice section detected through voice activity detection. It further includes an environment identification unit.

일 실시예에 있어서, 상기 음성인식 시스템은, 상기 모델 변환부에 의해 상기 식별된 환경별로 모델 변환된 환경별 음향모델을 저장하는 모델 데이터베이스를 더 포함한다.In one embodiment, the speech recognition system further includes a model database for storing the acoustic model for the environment model-specific conversion by the identified environment by the model conversion unit.

일 실시예에 있어서, 상기 음성인식 시스템은, 미리 결정된 임계치에 의해 상기 모델 데이터베이스에 저장되는 음향모델의 개수를 제한하는 모델 재배치부를 더 포함한다.In one embodiment, the speech recognition system further includes a model repositioning unit for limiting the number of acoustic models stored in the model database by a predetermined threshold.

일 실시예에 있어서, 상기 인식부는, 상기 모델 데이터베이스에 저장된 음향모델 중 상기 인식 대상 음성데이터가 발생한 환경에 대응하는 음향모델을 이용하여 음성인식을 수행한다.In one embodiment, the recognition unit, the voice recognition using the acoustic model corresponding to the environment in which the voice data to be recognized from the acoustic model stored in the model database.

본 발명은, 음성인식 시스템의 화자 적응을 위해 진화 학습에 의한 비지도 적응 방식을 적용함으로써 사용자 편의성을 도모하는 이점을 제공한다.The present invention provides an advantage of improving user convenience by applying an unsupervised adaptation method by evolutionary learning for speaker adaptation of a speech recognition system.

또한, 음성인식 시스템의 동작 상태에 따라 특징 변환 및 모델 변환을 적절하게 수행함으로써, 화자 적응 방식의 효율성 및 음성인식 시스템의 인식 성능을 개선하는 이점을 제공한다.In addition, by appropriately performing the feature conversion and model conversion according to the operating state of the speech recognition system, it provides an advantage of improving the efficiency of the speaker adaptation method and the recognition performance of the speech recognition system.

나아가, 음성인식 환경별로 화자 종속 음향모델을 생성함으로써, 화자 적응은 물론 환경 적응까지 동시에 수행하는 이점을 제공한다.Furthermore, by generating a speaker-dependent acoustic model for each speech recognition environment, it provides an advantage of simultaneously performing speaker adaptation and environmental adaptation.

본 발명에 관한 구체적인 내용의 설명에 앞서 이해의 편의를 위해 본 발명의 기술적 개요를 우선 설명한다.Prior to the description of the specific contents of the present invention, the technical outline of the present invention will be described first for ease of understanding.

화자 내지 환경 적응 기법의 기본적인 목적은, 음성인식 시스템의 음향모델 생성시 사용되는 불특정 다수 화자의 잡음없는 훈련 데이터와 실제 테스트시 사용되는 특정 화자의 잡음 섞인 음성데이터 간의 불일치로 인한 음성인식 성능 저하를 방지하고자 하는 것이다.The basic purpose of the speaker-to-environment adaptation technique is to reduce speech recognition performance due to inconsistencies between noise-free training data of unspecified majority speakers used in generating acoustic models of speech recognition systems and noise-mixed speech data of specific speakers used in actual tests. It is to be prevented.

이러한 적응 기법은, 적응 변환을 수행하는 대상에 따라 “특징 변환”에 의한 적응 기법과 “모델 변환”에 의한 적응 기법으로 분류할 수 있다.Such an adaptation technique may be classified into an adaptation technique based on a “feature transformation” and an adaptation technique based on a “model transformation” according to a target to perform an adaptation transformation.

상기 특징 변환 적응 기법은, 특징벡터 보상(feature compensation)을 통해 실제 음성데이터에서 잡음을 제거하고 깨끗한 음성데이터로 매핑하여 음성인식을 수행하는 기법으로서, 비교적 적은 연산량을 통해 순간적인 환경 변화에 신속하게 적응할 수 있는 장점이 있다.The feature transformation adaptation technique removes noise from real speech data and maps it to clean speech data by using feature vector compensation, and performs speech recognition quickly through a relatively small amount of computation. There is an advantage to adaptation.

그러나, 상기 특징 변환 적응 기법은, 상기 모델 변환 적응 기법에 비해 인식성능의 향상 정도가 제한된다는 단점이 있다.However, the feature transformation adaptation technique has a disadvantage in that the degree of improvement in recognition performance is limited compared to the model transformation adaptation technique.

상기 모델 변환 적응 기법은, 적응 데이터를 이용한 모델 보상(model compensation)을 통해 기존에 학습된 음향모델을 실제 음성인식 환경에 종속된 음향모델로 적응시켜 음성인식을 수행하는 기법으로서, 인식성능 향상률이 크다는 장점이 있다.The model transformation adaptation technique is a technique for performing speech recognition by adapting a previously learned acoustic model to an acoustic model dependent on a real speech recognition environment through model compensation using adaptation data. There is a big advantage.

그러나, 상기 모델 변환 적응 기법은, 특정 적응 데이터와 관련된 모델 변수만 각각 변환하기 때문에 상대적으로 많은 양의 적응 데이터 및 연산시간을 필요로 하는 단점이 있다.However, the model transformation adaptation technique requires a relatively large amount of adaptation data and computation time because only model variables related to specific adaptation data are transformed.

따라서, 본 발명은 음성인식에 있어서 상기 양 기법을 효과적으로 모두 적용함과 동시에 각각의 단점을 보완하는 새로운 화자 적응 기법을 제공한다. 즉, 음성인식 시스템이 음성인식을 수행하고 있는 인식 모드에서, 신속한 환경 적응이 가능한 특징 변환을 수행하도록 하고(온라인 적응), 상기 음성인식 시스템이 음성인식을 수행하지 않고 있는 대기 모드에서, 미리 저장해 둔 충분한 양의 음성데이터를 적응 데이터로 이용하여 모델 변환을 수행하도록 한다(오프라인 적응). 이와 같이, 본 발명은, 화자 적응을 위한 진화 학습(evolutional learning) 과정으로서 특징 변환 및 모델 변환을 반복적으로 수행하도록 하여 축적된 적응 데이터의 양과 무관하게 안정적이고 높은 인식성능을 보장한다.Accordingly, the present invention provides a new speaker adaptation technique that effectively applies both techniques in speech recognition and at the same time compensates for the disadvantages. That is, in the recognition mode in which the speech recognition system is performing the speech recognition, the feature conversion to enable rapid environment adaptation is performed (online adaptation), and in the standby mode in which the speech recognition system does not perform the speech recognition, it is stored in advance. A sufficient amount of speech data is used as adaptive data to perform model transformation (offline adaptation). As described above, the present invention ensures stable and high recognition performance regardless of the amount of adaptive data accumulated by repeatedly performing feature transformation and model transformation as an evolutionary learning process for speaker adaptation.

이하, 본 발명의 기술적 과제의 해결 방안을 명확화하기 위해 첨부도면을 참조하여 본 발명의 바람직한 실시예를 상세하게 설명한다. 다만, 본 발명을 설명함에 있어서 관련 공지기술에 관한 설명이 오히려 본 발명의 요지를 불명료하게 할 수 있다고 판단되는 경우 그에 관한 설명을 생략하기로 한다. 또한, 후술하는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자 등의 의도 또는 관례 등에 따라 달라질 수 있을 것이다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings in order to clarify a solution of the technical problem of the present invention. However, in describing the present invention, when it is determined that the description of the related known technology may make the gist of the present invention unclear, the description thereof will be omitted. In addition, terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to intentions or customs of users, operators, and the like. Therefore, the definition should be based on the contents throughout this specification.

도 1에는 본 발명에 따른 진화 학습에 의한 화자 적응 장치(100)가 적용된 음성인식 시스템의 일례가 블록도로 도시되어 있다.1 is a block diagram illustrating an example of a speech recognition system to which the speaker adaptation apparatus 100 according to the present invention is applied.

도 2에는 본 발명에 따른 진화 학습에 의한 화자 적응 방법의 일례가 흐름도 로 도시되어 있다.2 shows an example of a speaker adaptation method by evolutionary learning according to the present invention in a flowchart.

도 1 및 도 2를 참조하면, 본 발명에 따른 화자 적응 장치(100)는, 특징 변환부(102), 음성 데이터베이스(106) 및 모델 변환부(108)를 포함하며, 환경 식별부(104)를 더 포함할 수 있다.1 and 2, the speaker adaptation apparatus 100 according to the present invention includes a feature converter 102, a voice database 106, and a model converter 108, and an environment identifier 104. It may further include.

우선, 상기 음성인식 시스템은, 초기에 훈련(training) 단계를 통해 화자 및 환경 독립 음향모델을 생성하여 저장하고 있다(S210). 이러한 화자 및 환경 독립 음향모델을 특정 화자 및 환경에 적합한 모델로 재추정하는 것이 바로 적응 기법이다.First, the speech recognition system initially generates and stores a speaker and an environment-independent acoustic model through a training step (S210). The adaptive technique is to reestimate the speaker and environment-independent acoustic model into a model suitable for a specific speaker and environment.

도 3에는 화자 적응 기법의 기본원리가 도시되어 있다.3 shows the basic principle of the speaker adaptation technique.

도 3을 참조하면, 화자 적응 기법은 화자 독립 음향모델을 추정하는 과정(300)과 이를 통해 추정된 화자 독립 음향모델을 화자 종속 음향모델로 재추정하는 과정(310)으로 이루어진다. 즉, 불특정 다수의 화자로부터 얻어진 많은 양의 음성데이터에서 MFCC(Mel-Frequency Cepstral Coefficient)와 같은 다차원의 특징벡터를 추출하고, 추출된 특징벡터들에 대해 EM(Expectaion-Maximization) 기법 등을 이용하여 HMM(Hidden Markov Model)을 추정한다. 이렇게 추정된 HMM은 불특정 다수의 화자에 대한 음향모델로서 화자 독립적인 특성을 지닌다. 한편, 상기 HMM의 훈련에 사용된 음성데이터가 다양한 여러 환경(다양한 종류의 음향기기, 잡음환경 등)에서 수집된 것이라면 상기 HMM은 동시에 환경 독립적인 특성도 지니게 된다.Referring to FIG. 3, the speaker adaptation method includes a process of estimating a speaker independent acoustic model 300 and a process of re-estimating the speaker independent acoustic model estimated through the speaker independent acoustic model 310. That is, a multidimensional feature vector such as MFCC (Mel-Frequency Cepstral Coefficient) is extracted from a large amount of speech data obtained from an unspecified number of speakers, and EM (Expectaion-Maximization) technique is applied to the extracted feature vectors. Estimate the Hidden Markov Model (HMM). The estimated HMM has speaker-independent characteristics as an acoustic model for an unspecified number of speakers. On the other hand, if the voice data used in the training of the HMM is collected in a variety of different environments (various kinds of sound equipment, noise environment, etc.), the HMM will also have environmentally independent characteristics at the same time.

이와 같이 훈련된 화자 독립 음향모델을 이용하여 특정 사용자가 음성인식을 수행하면, 상기 특정 사용자의 음성 특성을 반영하지 못하기 때문에 만족할만한 인 식성능을 나타낼 수 없다. 따라서, 핸드폰, PDA 등과 같은 개인용 모바일 기기에서 음성인식 기반 기능을 원활하게 사용하기 위해서는 특정 사용자의 음향 특성에 최적화된 음향모델로 재추정(적응)할 필요가 있다. 아래에서 다시 설명하겠지만, 본 발명은 새로운 적응 방식을 도입하여 이미 추정된 화자 및 환경 독립 음향모델과 특정 화자(사용자)로부터 획득되는 상대적으로 적은 양의 음성데이터를 이용하여 화자 및 환경 종속 음향모델을 추정한다.When a specific user performs speech recognition using the trained speaker independent acoustic model as described above, the speech characteristics of the specific user may not be reflected and thus satisfactory recognition performance may not be exhibited. Therefore, in order to use voice recognition-based functions smoothly in personal mobile devices such as mobile phones and PDAs, it is necessary to reestimate (adapt) the acoustic model optimized for the acoustic characteristics of a specific user. As will be described again below, the present invention introduces a new adaptation scheme and uses the estimated speaker and environment independent acoustic model and the relatively small amount of speech data obtained from a specific speaker (user) to generate the speaker and environment dependent acoustic model. Estimate.

상기 특징 변환부(102)는, 상기 음성인식 시스템이 음성인식을 수행하고 있는 인식 모드에서, 인식 대상 음성데이터 입력되면 미리 결정된 환경 파라미터를 이용하여 상기 인식 대상 음성데이터의 특징벡터에 대해 특징 변환을 수행한다(S220, S230). 즉, 음성인식 시스템이 적용된 핸드폰, PDA, 기타 모바일 기기 등에서 사용자의 음성 명령을 받아들이는 작업을 수행하는 동안 신속한 환경 적응이 가능한 특징 변환을 수행한다(온라인 적응).In the recognition mode in which the speech recognition system performs speech recognition, the feature conversion unit 102 performs a feature transformation on the feature vector of the speech data to be recognized by using a predetermined environment parameter when the speech data is input. It performs (S220, S230). In other words, while the mobile phone, PDA, and other mobile devices to which the voice recognition system is applied are receiving the user's voice command, the feature transformation is performed to enable rapid environment adaptation (online adaptation).

더욱 구체적으로 설명하면, 특징 변환이란, 입력된 음성데이터에서 추출한 특징벡터들을 미리 예측한 환경 파라미터들을 이용하여 변환하는 방법이다. 즉, 특징 변환 과정은 기본적으로 수학식 1과 같이 나타낼 수 있다.More specifically, the feature transformation is a method of transforming feature vectors extracted from input speech data using environment parameters predicted in advance. That is, the feature conversion process can be basically expressed as Equation 1.

X′= A·XX '= A + + BB

상기 수학식 1에서, X는 입력 음성데이터의 특징벡터, A 및 B는 미리 결정된 환경 파라미터를 나타낸다. 결국, 특징 변환은 미리 결정된 환경 파라미터 A, B를 이용하여 입력 음성데이터로부터 추출한 특징벡터 X만을 변화시킴으로써 음성인식 의 성능을 높이는 것이다. 상기 환경 파라미터 A 및 B는 입력 음성데이터의 인식에 앞서 입력 음성 앞부분의 짧은 구간에서 획득되는 음성 데이터 등과 최대 우도 기법(maximum likelihood method)을 이용하여 예측할 수 있고, 이렇게 미리 결정된 환경 파라미터들을 이용하여 신속하게 특징 변환을 수행함으로써 환경 변화에 따라 실시간적인 적응을 수행할 수 있게 된다. 더욱 구체적으로 설명하면, 상기 환경 파라미터 A, B는 입력된 음성과 기존에 존재하는 음향모델 간의 차이를 줄이기 위해 사용되는 일종의 변환 파라미터에 해당한다. 상기 환경 파라미터는 기존의 음향모델, 즉 상기 특징 변환 전의 기존의 음향모델과 입력된 음성 중 앞부분의 짧은 시간 동안 획득되는 데이터를 이용하여 예측할 수 있다. 예컨대, 특정 음향모델이 주어진 경우 음성이 발생할 수 있는 우도(likelihood)를 Baum's auxiliary 함수를 이용하여 최대화하는 기법 등을 사용하여 상기 환경 파라미터를 예측할 수 있고, 상기 예측된 상기 환경 파라미터를 상기 입력 음성데이터의 특징벡터에 적용하여 실제 음성인식 환경에 조금 더 가까운 상태의 특징벡터로 빠르게 변환함으로써 음성인식 성능을 보완할 수 있다.In Equation 1, X denotes a feature vector of the input voice data, and A and B denote predetermined environmental parameters. As a result, the feature conversion is to improve the performance of speech recognition by changing only the feature vector X extracted from the input speech data using predetermined environmental parameters A and B. The environmental parameters A and B may be predicted by using a maximum likelihood method and the like, such as voice data obtained in a short section of the front part of the input voice before recognition of the input voice data, and quickly using the predetermined environment parameters. By performing the feature transformation, it is possible to perform the adaptation in real time according to the environment change. More specifically, the environmental parameters A and B correspond to a kind of conversion parameter used to reduce the difference between the input voice and the existing acoustic model. The environmental parameter may be predicted using an existing acoustic model, that is, an existing acoustic model before the feature conversion and data acquired for a short time of the front part of the input voice. For example, the environment parameter may be predicted using a technique of maximizing a likelihood in which a voice may be generated using a Baum's auxiliary function when a specific acoustic model is given, and the predicted environment parameter is converted into the input voice data. It can be applied to the feature vector of, so that the speech recognition performance can be compensated by quickly converting the feature vector into a feature vector that is closer to the actual speech recognition environment.

물론, 실제 구현에 있어서 상기 특징 변환을 위해 다양한 방식들이 적용될 수 있다. 예컨대, 특징벡터의 시간적인 특성을 고려한 켑스트럼 평균 차감법(cepstral mean subtraction), 평균-분산 정규화(mean-variance normalization: On real-time mean-variance normalization of speech recognition features, P.Pujol, D.Macho and C.Nadeu, ICASSP, 2006, pp.773-776 참조), RASTA 알고리즘(RelAtive SpecTrAl algorithm: Data-driven RASTA filters in reverberation, M.L.Shire et al, ICASSP, 2000, pp. 1627-1630 참조), 히스토그램 정규화(histogram normalization: Quantile based histogram equalization for noise robust large vocabulary speech recognition, F.Hilger and H.Ney, IEEE Trans. Audio, Speech, Language Processing, vol.14, no.3, pp.845-854 참조), 델타 특징 증강 알고리즘(augmenting delta feature: On the use of high order derivatives for high performance alphabet recognition, J. diMartino, ICASSP, 2002, pp.953-956) 등이 적용될 수 있다. 또한, 특징벡터들을 선형적으로 변환하는 기술로서, LDA(Linear Discriminant Analysis) 및 PCA(Principal Component Analysis: Optimization of temporal filters for constructing robust features in speech recognition, Jeih-Weih Hung et. al, IEEE Trans. Audio, Speech, and Language Processing, vol.14, No.3, 2006, pp.808-832 참조) 등이 적용될 수 있다. 또한, 비선형 신경망을 사용하는 방법으로서, TRAP(TempoRAl Patterns: Temporal patterns in ASR of noisy speech, H.Hermansky and S.Sharma, ICASSP, 1999, pp.289-292 참조), ASAT(Automatic Speech Attribute Transcription: A study on knowledge source integration for candidate rescoring in automatic speech recognition, Jinyu Li, Yu Tsao and Chin-Hui Lee, ICASSP, 2005, pp.837-840 참조) 등이 적용될 수 있다.Of course, various schemes may be applied for the feature conversion in actual implementation. For example, cepstral mean subtraction, mean-variance normalization: On real-time mean-variance normalization of speech recognition features, P.Pujol, D See Macho and C. Nadeu, ICASSP, 2006, pp. 773-776), and RASTA algorithm (ReelAtive SpecTrAl algorithm: Data-driven RASTA filters in reverberation, MLShire et al, ICASSP, 2000, pp. 1627-1630). Histogram normalization: Quantile based histogram equalization for noise robust large vocabulary speech recognition, F. Hilger and H. Ney, IEEE Trans. Audio, Speech, Language Processing, vol. 14, no. 3, pp. 845-854 Augmenting delta feature: On the use of high order derivatives for high performance alphabet recognition, J. diMartino, ICASSP, 2002, pp.953-956). In addition, as a technique for linearly transforming feature vectors, Linear Discriminant Analysis (LDA) and Principal Component Analysis: Optimization of temporal filters for constructing robust features in speech recognition, Jeih-Weih Hung et. Al, IEEE Trans.Audio , Speech, and Language Processing, vol. 14, No. 3, 2006, pp. 808-832) may be applied. In addition, as a method using a non-linear neural network, TRAP (TempoRAl Patterns: Temporal patterns in ASR of noisy speech, H. Hermansky and S. Sharma, ICASSP, 1999, pp. 289-292), ASAT (Automatic Speech Attribute Transcription: A study on knowledge source integration for candidate rescoring in automatic speech recognition, Jinyu Li, Yu Tsao and Chin-Hui Lee, ICASSP, 2005, pp.837-840).

그 다음, 상기 음성 데이터베이스(106)는, 상기 인식 대상 음성데이터를 음향모델 적응에 사용될 적응 데이터로서 저장한다(S240). 이때, 핸드폰, 로봇 등과 같이 자원 제약적인 모바일 환경에서는 모든 음성데이터를 저장하기보다 선입선출 방식 등을 통해 제한된 양의 음성데이터만을 저장할 수 있다.Next, the voice database 106 stores the recognition target voice data as adaptation data to be used for acoustic model adaptation (S240). In this case, in a resource-constrained mobile environment such as a mobile phone or a robot, only a limited amount of voice data may be stored through a first-in, first-out method, rather than storing all voice data.

그 다음, 상기 모델 변환부(108)는, 상기 음성인식 시스템이 음성인식을 수행하지 않고 있는 대기 모드에서(S250), 상기 음성 데이터베이스(106)에 저장된 음성데이터를 이용하여 기존 음향모델에 대한 모델 변환을 수행한다(S260). 즉, 음성인식 시스템이 적용된 기기의 데이터 연산량이 많지 않은 경우(예컨대, 야간이나 충전 중인 경우 등), 상기 인식 모드 상태에 있는 동안 상기 음성 데이터베이스(106)에 누적된 충분한 양의 적응 데이터를 이용하여 모델 변환을 수행한다(오프라인 적응). 상기 모델 변환이란 음향모델인 HMM의 파라미터들을 직접 변환하는 기법을 말한다. 환언하면, 음향모델은 평균 벡터(mean vector), 공분산 행렬(covariance matrix) 및 가중치(weight)로 구성되는데 상기 모델 변환은 이것들을 전부 또는 선택적으로 변화시키는 것이다. 이를 위해 상기 음성 데이터베이스(106)에 저장된 음성데이터를 이용하여 파라미터들을 예측하고 선형회귀트리(linear regression tree) 등을 이용하여 모델 파라미터를 변환한다. 이러한 과정은, 모바일 환경 등에서 시스템이 음향모델 훈련을 위한 transcription을 예측하여 훈련하도록 하는 비지도 적응을 수행하는 과정에 해당한다.Then, the model conversion unit 108, in the standby mode in which the speech recognition system does not perform speech recognition (S250), a model for the existing acoustic model using the speech data stored in the speech database 106 The conversion is performed (S260). That is, when the amount of data calculation of the device to which the speech recognition system is applied is not large (for example, at night or during charging), a sufficient amount of adaptive data accumulated in the speech database 106 while in the recognition mode is used. Perform model transformation (offline adaptation). The model transformation refers to a technique of directly converting parameters of an HMM which is an acoustic model. In other words, the acoustic model consists of a mean vector, a covariance matrix, and a weight, wherein the model transformation changes them all or selectively. To this end, parameters are predicted using the speech data stored in the speech database 106 and model parameters are converted using a linear regression tree. This process corresponds to a process of performing unsupervised adaptation in which the system predicts and trains transcription for acoustic model training in a mobile environment.

이와 같이, 상기 음성인식 시스템의 대기 모드에서는, 상대적으로 연산량이 많지만 인식성능을 최대한 높일 수 있는 모델 변환을 수행한다. 이때, 상기 모델 변환부(108)가 모델 변환시 사용하는 저장된 음성데이터의 특징은, 화자 독립 음향모델 추정시와 마찬가지로 MFCC(Mel-Frequency Cepstral Coefficient)를 사용할 수 있지만, 화자 적응 과정의 특성을 고려하여 일반적인 EM(Expectaion-Maximization) 기법 이외에 MLLR(Maximum Likelihood Linear Regression) 또는 MAP(Maximum A Posterior) 기법을 사용할 수 있다.As described above, in the standby mode of the speech recognition system, although the amount of computation is relatively large, model transformation is performed to maximize recognition performance. At this time, the feature of the stored speech data used by the model converter 108 when the model is transformed may use MFCC (Mel-Frequency Cepstral Coefficient) as in the case of speaker-independent acoustic model estimation. In addition to the typical expectation-maximization (EM) technique, a maximum likelihood linear regression (MLLR) or maximum a posterior (MAP) technique may be used.

상기 모델 변환부(108)는, 상기 음성 데이터베이스(106)에 새롭게 저장된 음성데이터가 존재하는 경우, 상기 새롭게 저장된 음성데이터를 이용하여 최종 음향모델에 대한 모델 변환을 수행한다(S270). 한편, 상기 음성인식 시스템이 다시 인식 모드가 되는 경우(S250), 상술한 S220 내지 S240 단계들을 반복 수행한다.When there is the newly stored voice data in the voice database 106, the model converting unit 108 performs model conversion on the final acoustic model using the newly stored voice data (S270). On the other hand, when the voice recognition system is in the recognition mode again (S250), the above-described steps S220 to S240 are repeated.

아래에서 다시 설명하겠지만, 본 발명의 일 실시예에 있어서, 상기 인식 대상 음성데이터는 사용자의 발화 환경(예컨대, 지하철, 실내, 거리 등)에 따라 다른 특성을 지니게 되므로, 상기 화자 적응 장치(100)는, 상기 음성인식 시스템에서 음성구간 검출(Voice Activity Detection)을 통해 검출되는 음성구간 이외의 비음성구간에서 주변잡음의 특징을 추출하여 상기 인식 대상 음성데이터가 발생한 환경을 식별하는 환경 식별부(104)를 더 포함할 수 있다. 이 경우, 상기 음성 데이터베이스(106)는, 상기 환경 식별부(104)에 의해 식별된 환경별로 해당 환경에서 발생한 음성데이터를 저장한다. 상기 모델 변환부(108)는, 상기 음성 데이터베이스(106)에 저장된 음성데이터를 이용하여 상기 식별된 환경별로 상기 기존 음향모델에 대한 모델 변환을 수행할 수 있다. 그 결과 생성되는 각각의 환경별 음향모델들은, 상기 저장된 음성데이터들이 모두 특정 사용자로부터 획득된 것인 점에서 화자 종속적 특징을 지니게 되며, 각각의 환경별로 음향모델을 훈련하는 점에서 환경 종속적 특징을 지니게 된다.As will be described again below, in one embodiment of the present invention, since the recognition target voice data has different characteristics according to a user's speech environment (eg, subway, indoor, street, etc.), the speaker adaptation apparatus 100 may be used. The environment identification unit 104 extracts a feature of ambient noise in a non-voice section other than the voice section detected through voice activity detection in the voice recognition system to identify an environment in which the voice object to be recognized is generated. ) May be further included. In this case, the voice database 106 stores voice data generated in the environment for each environment identified by the environment identification unit 104. The model converting unit 108 may perform model transformation on the existing acoustic model for each identified environment by using the voice data stored in the voice database 106. The resulting acoustic models for each environment have speaker-dependent features in that the stored voice data are all acquired from a specific user, and have environmentally-dependent features for training acoustic models for each environment. do.

결과적으로, 본 발명에 따른 화자 적응 장치는, 특징 변환 적응과 모델 변환 적응을 반복적으로 수행하는 진화 학습(evolutional learning)을 통해 화자 적응 및 환경 적응을 동시에 수행하게 된다.As a result, the speaker adaptation apparatus according to the present invention simultaneously performs speaker adaptation and environmental adaptation through evolutionary learning that repeatedly performs feature transformation adaptation and model transformation adaptation.

도 4에는 본 발명에 따른 진화 학습에 의한 음성인식 방법의 일례가 흐름도로 도시되어 있다.4 shows an example of a speech recognition method by evolutionary learning according to the present invention in a flow chart.

도 1 및 도 4를 참조하면, 본 발명에 따른 음성인식 시스템은, 상기 화자 적응 장치(100)을 포함하며, 음성 입력부(110), 음성구간 검출부(120), 특징 추출부(130), 모델 데이터베이스(140), 모델 재배치부(150) 및 인식부(160)를 더 포함한다.1 and 4, the speech recognition system according to the present invention includes the speaker adaptation apparatus 100, and includes a speech input unit 110, a speech section detector 120, a feature extractor 130, and a model. The database 140 further includes a model rearrangement unit 150 and a recognition unit 160.

우선, 상기 음성 입력부(110)는, 아날로그 형태의 음성신호를 입력받는다(S400). 또한, 상기 음성 입력부(110)는, 반-에일리어싱 필터(anti-aliasing filter)를 통한 필터링, ADC(Analog to Digital Converter)를 통한 변환을 거쳐 상기 음성신호를 디지털 형태의 음성데이터로 변환하는 전처리 과정을 수행할 수 있다.First, the voice input unit 110 receives an analog voice signal (S400). In addition, the voice input unit 110 includes a pre-processing process of converting the voice signal into digital voice data through filtering through an anti-aliasing filter and conversion through an analog to digital converter (ADC). Can be performed.

그 다음, 상기 음성구간 검출부(120)는, 상기 음성데이터가 인간의 음성에 해당하는 것인지 판별하기 위해 음성구간 검출(Voice Activity Detection: VAD)을 수행한다(S402). 이때, 상기 음성구간 검출을 수행하기 위해 상기 음성구간 검출부(120)는, 음성 및 음성 이외의 소리에 관한 학습 데이터를 이용하여 모델을 학습시킨 후 음성구간을 검출하는 기계학습 방식을 사용하거나, 또는 음성의 특성과 깊이 관련된 특징(zero crossing rate, spectral entropy 등)을 모델링하고 해당 특징의 출현 여부를 탐색하여 음성구간을 검출하는 방식 등을 사용할 수 있다.Next, the voice section detection unit 120 performs voice activity detection (VAD) to determine whether the voice data correspond to human voice (S402). In this case, in order to perform the speech section detection, the speech section detection unit 120 uses a machine learning method for detecting a speech section after learning a model using learning data about speech and sounds other than the speech, or Modeling features related to speech characteristics (zero crossing rate, spectral entropy, etc.) and searching for the appearance of the features may be used to detect speech segments.

그 다음, 상기 특징 추출부(130)는, 상기 음성구간의 인식 대상 음성데이터를 입력받아 상기 인식 대상 음성데이터에서 특징벡터를 추출한다(S404). 이때, 상기 추출된 특징벡터는 상기 인식 대상 음성데이터에서 음성인식에 필요한 성분만을 압축하여 가지고 있는 형태로 시간에 따른 주파수 정보를 지니는 것이 일반적이다. 상기 특징벡터는 MFCC(Mel-Frequency Cepstral Coefficients), LPCC(Linear Prediction Cepstral Coefficients), EIH(Ensenble Interaval Histogram) 등을 포함하며, 본 발명의 일 실시예에 있어서 MFCC를 특징벡터로 사용한다. 또한, 상기 특징 추출부(130)는, 상기 인식 대상 음성데이터에서 특징벡터를 추출하기 위해 여러 가지 전처리 과정, 예컨대 프레임 단위 구성, 해밍 윈도우, 푸리에 변환, 필터 뱅크, 켑스트럼 변환 등의 처리를 수행할 수 있다.Next, the feature extracting unit 130 receives the recognition target voice data of the voice section and extracts a feature vector from the recognition target voice data (S404). In this case, the extracted feature vector generally has frequency information over time in a form in which only the components necessary for speech recognition are compressed in the speech object to be recognized. The feature vector includes MFCC (Mel-Frequency Cepstral Coefficients), LPCC (Linear Prediction Cepstral Coefficients), EIH (Ensenble Interaval Histogram), etc. In an embodiment of the present invention, MFCC is used as a feature vector. In addition, the feature extractor 130 performs various preprocessing processes, such as frame unit configuration, hamming window, Fourier transform, filter bank, and cepstrum transform, to extract a feature vector from the speech data to be recognized. Can be done.

그 다음, 상기 화자 적응 장치(100)의 상기 특징 변환부(102)는, 상술한 바와 같이, 상기 음성인식 시스템이 음성인식을 수행하고 있는 인식 모드에서 미리 결정된 환경 파라미터를 이용하여 상기 추출된 특징벡터에 대해 특징 변환을 수행한다(S406). 그러면, 상기 인식부(160)가 상기 변환된 특징벡터 및 음향모델을 이용하여 음성인식을 수행하게 된다.Next, as described above, the feature converting unit 102 of the speaker adaptation apparatus 100 extracts the extracted feature using a predetermined environment parameter in a recognition mode in which the speech recognition system is performing speech recognition. Feature transformation is performed on the vector (S406). Then, the recognition unit 160 performs speech recognition using the transformed feature vector and the acoustic model.

한편, 상기 화자 적응 장치(100)의 상기 환경 식별부(104)는, 상기 음성구간 검출부(120)에 의해 검출된 음성구간 이외의 비음성구간에서 주변잡음의 특징을 추출하여 상기 음성데이터가 발생한 환경을 식별한다(S408). 이때, 상기 환경 식별부(104)는, 상기 특징 추출부(130)와 동일한 원리로 비음성구간에서 주변잡음의 특징을 추출할 수 있으며, 또한 별도의 주변잡음 추정 알고리즘들(예컨대, S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction", IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-27(2), pp. 113-120, 1979. 참조)을 활용하여 주변잡음의 특징을 추출할 수도 있다.On the other hand, the environment identification unit 104 of the speaker adaptation apparatus 100 extracts the feature of the ambient noise in a non-voice section other than the voice section detected by the voice section detection unit 120 to generate the voice data. Identifies the environment (S408). At this time, the environment identification unit 104, the same principle as the feature extraction unit 130 can extract the features of the ambient noise in the non-speech section, and also separate ambient noise estimation algorithms (eg SF Boll, "Suppression of acoustic noise in speech using spectral subtraction", IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-27 (2), pp. 113-120, 1979. It may be.

그 다음, 상기 인식부(160)는, 상기 인식 모드에서 상기 특징 변환부(102)에 의해 변환된 특징벡터와 상기 화자 적응 장치(100)의 상기 모델 변환부(108)에 의해 변환된 환경 종속적 음향모델들 중에서 상기 식별된 환경에 대응하는 최적의 음향모델을 이용하여 상기 인식 대상 음성데이터에 대한 음성인식을 수행한다(S410).Next, the recognition unit 160 is dependent on the feature vector transformed by the feature transform unit 102 and the environment-dependent transformed by the model transform unit 108 of the speaker adaptation apparatus 100 in the recognition mode. Among the acoustic models, voice recognition is performed on the speech data to be recognized using an optimal acoustic model corresponding to the identified environment (S410).

그 다음, 상기 화자 적응 장치(100)의 상기 음성 데이터베이스(106)는, 상기 환경 식별부(104)에 의해 식별된 환경별로 해당 환경에서 발생한 음성데이터를 음향모델 적응에 사용될 적응 데이터로서 저장한다(S412).Next, the voice database 106 of the speaker adaptation apparatus 100 stores the voice data generated in the environment for each of the environments identified by the environment identification unit 104 as adaptation data to be used for acoustic model adaptation ( S412).

한편, 상기 음성인식 시스템이 계속 인식 모드 상태인 경우(S414), 상술한 S400 내지 S412 단계들을 반복하여 수행한다.On the other hand, if the voice recognition system is still in the recognition mode (S414), the above-described steps S400 to S412 are repeatedly performed.

반면, 상기 음성인식 시스템이 음성인식을 수행하지 않고 있는 대기 모드 상태인 경우(S414), 상기 화자 적응 장치(100)의 상기 모델 변환부(108)는, 상기 음성 데이터베이스(106)에 저장된 음성데이터를 이용하여 기존 음향모델에 대한 모델 변환을 수행한다(S416). 특히, 상기 모델 변환부(108)는, 상기 음성 데이터베이스(106)부에 저장된 음성데이터를 이용하여 상기 식별된 환경별로 상기 기존 음향모델에 대한 모델 변환을 각각 수행한다. 즉, 상기 식별된 환경들 각각에 종속하는 음향모델을 생성한다.On the other hand, when the voice recognition system is in the standby mode without performing the voice recognition (S414), the model conversion unit 108 of the speaker adaptation apparatus 100, the voice data stored in the voice database 106 Using the model conversion for the existing acoustic model is performed (S416). In particular, the model converting unit 108 performs model transformation on the existing acoustic model for each of the identified environments by using the voice data stored in the voice database 106. That is, an acoustic model is generated that depends on each of the identified environments.

그 다음, 상기 모델 데이터베이스(140)는, 상기 모델 변환부(108)에 의해 상 기 식별된 환경별로 모델 변환된 환경별 음향모델을 저장한다(S418). 즉, 상기 모델 데이터베이스(140)는 최초에 화자 독립 또는 화자 및 환경 독립 음향모델을 저장하고 있을 수 있다. 그러나, 상술한 진화 학습 과정들을 거치면서 상기 모델 데이터베이스(140)는 화자 및 환경 종속적인 음향모델들을 저장하게 된다.Next, the model database 140 stores the acoustic model for each environment, which is model-converted for each environment identified by the model converter 108 (S418). That is, the model database 140 may initially store speaker independent or speaker and environment independent acoustic models. However, through the above-described evolutionary learning process, the model database 140 stores speaker and environment dependent acoustic models.

도 5에는 진화 학습을 통해 화자 및 환경 종속 음향모델을 생성하는 과정이 도시되어 있다.5 shows a process of generating a speaker and an environment dependent acoustic model through evolutionary learning.

도 5를 참조하면, 상기 음성인식 시스템에는 최초에 화자 및 환경 독립 음향모델(500)이 생성 및 저장되어 있다. 상기 화자 및 환경 독립 음향모델은 상기 음성 데이터베이스(106)에 저장된 환경별 음성데이터(510)를 이용하여 각각의 환경별로 모델 변환을 수행함으로써 상기 각각의 환경에 대응하는 환경 종속 음향모델들(520)을 생성한다. 그리고, 상기 음성 데이터베이스(106)에 새로운 환경에서의 음성데이터(530)가 저장되는 경우, 상기 환경 종속 음향모델들(520) 각각에 대하여 상기 새로운 환경별로 해당 음성데이터를 이용하여 모델 변환을 수행하고 새로운 환경 종속 음향모델들(540)을 생성한다.Referring to FIG. 5, a speaker and an environment independent acoustic model 500 are initially generated and stored in the speech recognition system. The speaker and the environment-independent acoustic model performs environment transformation of the environment-dependent acoustic models 520 corresponding to each environment by performing model transformation for each environment using the environment-specific voice data 510 stored in the voice database 106. Create When the voice data 530 of the new environment is stored in the voice database 106, model conversion is performed for each of the environment dependent acoustic models 520 using the corresponding voice data for each new environment. Create new environmentally dependent acoustic models 540.

일 실시예에 있어서, 상기 음성 데이터베이스(106)에 새롭게 저장된 음성데이터가 이미 환경 종속 음향모델이 생성된 환경에서 발생한 것이면, 상기 새롭게 저장된 음성데이터를 이용하여 상기 이미 생성된 환경 종속 음향모델에 대한 모델 변환을 다시 수행함으로써 특정 환경에 진화적으로 적응하도록 할 수 있다. 또한, 상기 새롭게 저장된 음성데이터가 새로운 환경에서 발생한 것이면, 상술한 바와 같이 상기 새롭게 저장된 음성데이터를 이용하여 상기 이미 생성된 환경 종속 음향모 델들 전부에 대해 모델 변환을 수행하고, 그 결과 생성되는 환경 종속 모델들의 우도(likelihood)를 이용하여 상기 새로운 환경에 종속된 음향 모델을 생성할 수 있다.In one embodiment, if the newly stored voice data in the voice database 106 is generated in an environment in which an environment dependent acoustic model has already been generated, a model for the previously generated environment dependent acoustic model using the newly stored voice data. By performing the transformation again, you can adapt it to your specific environment. In addition, if the newly stored voice data is generated in a new environment, as described above, model transformation is performed on all of the already generated environment dependent acoustic models using the newly stored voice data, and the resultant environment dependent The likelihood of the models can be used to generate an acoustic model dependent on the new environment.

그 다음, 상기 모델 재배치부(150)는, 미리 결정된 임계치에 의해 상기 모델 데이터베이스(140)에 저장되는 음향모델의 개수를 제한한다(S420). 즉, 자원 제약적인 모바일 기기 등에 있어서 환경 종속 음향모델들을 무제한 생성하여 모두 저장 및 사용하는 것이 아니라 저장용량 내지 연산량을 고려하여 저장 및 사용하는 음향모델의 개수를 제한할 수 있다.Next, the model rearrangement unit 150 limits the number of acoustic models stored in the model database 140 by a predetermined threshold (S420). That is, the number of acoustic models stored and used may be limited in consideration of the storage capacity or the amount of calculation, rather than generating and storing all of the environment dependent acoustic models in a resource-constrained mobile device.

상기 모델 재배치부(150)는, 상기 임계치 이하로 음향모델의 개수를 제한하기 위해 다양한 방식들을 적용할 수 있다. 예컨대, 일 실시예에 있어서, 상기 모델 데이터베이스(140)에 저장되는 음향모델의 개수가 상기 임계치를 초과하는 경우, 상기 모델 재배치부(150)는 새롭게 생성되는 음향모델이 단순히 상기 모델 데이터베이스(140)에 저장되지 않도록 제한할 수 있다. 다른 일 실시예에 있어서, 상기 모델 재배치부(150)는 상기 생성된 음향모델들의 우도(likelihood)를 이용하거나 환경별 저장된 데이터 양을 고려하여 각 음향모델들을 가중치합(weighted sum) 또는 선형결합(linear combination) 등을 통해 상기 임계치 이하로 음향모델의 개수를 감소시킬 수 있다. 또한, 다른 일 실시예 있어서, 상기 모델 재배치부(150)는, 상기 환경 식별부(104)로 하여금 음성데이터가 발생하는 환경들을 소정 개수의 유형으로만 분류하여 식별하도록 하여, 상기 모델 변환부(108)가 생성하는 음향모델의 개수를 상기 임계치 이하로 조절할 수 있다.The model repositioner 150 may apply various methods to limit the number of acoustic models below the threshold. For example, in one embodiment, when the number of acoustic models stored in the model database 140 exceeds the threshold, the model rearrangement unit 150 simply selects the newly generated acoustic model from the model database 140. You can restrict it from being stored in. In another embodiment, the model rearrangement unit 150 may use the likelihood of the generated acoustic models or the weighted sum or linear combination of each acoustic model in consideration of the amount of stored data for each environment. linear combination) or the like can reduce the number of acoustic models below the threshold. In addition, in another embodiment, the model rearrangement unit 150 causes the environment identification unit 104 to classify and identify environments in which voice data is generated by only a predetermined number of types. The number of acoustic models generated by 108 may be adjusted below the threshold.

그 다음, 상기 음성 데이터베이스(106)에 새롭게 저장된 음성데이터가 존재하는 경우(S422), 상기 음성인식 시스템은, 상술한 S416 내지 S420 단계들을 반복하게 된다. 다만, 상기 음성인식 시스템이 음성인식을 수행하는 인식 모드 상태로 전환되는 경우, 상술한 S400 내지 S412 단계들을 반복하게 된다.Then, when there is newly stored voice data in the voice database 106 (S422), the voice recognition system repeats the above steps S416 to S420. However, when the voice recognition system is switched to the recognition mode for performing voice recognition, the above-described steps S400 to S412 are repeated.

앞서 설명한 바와 같이, 각각의 환경별 음향모델들은, 상기 저장된 음성데이터들이 모두 특정 사용자로부터 획득된 것인 점에서 화자 종속적 특징을 지니게 되며, 각각의 환경별로 음향모델을 훈련하는 점에서 환경 종속적 특징을 지니게 된다.As described above, each environment-specific acoustic model has speaker-dependent characteristics in that all of the stored voice data are obtained from a specific user, and the environment-dependent characteristics in terms of training the acoustic model for each environment are described. Will be carried.

결과적으로, 본 발명에 따른 음성인식 시스템은, 특징 변환 적응과 모델 변환 적응을 반복적으로 수행하는 진화 학습(evolutional learning)을 통해 화자 적응 및 환경 적응을 동시에 수행하게 된다.As a result, the speech recognition system according to the present invention simultaneously performs speaker adaptation and environmental adaptation through evolutionary learning that repeatedly performs feature transformation adaptation and model transformation adaptation.

한편, 상기 인식부(160)는, 인식 대상 음성데이터에 대한 음성인식 수행시, 상기 환경 식별부(104)에 의해 상기 인식 대상 음성데이터의 발생 환경을 식별하고, 상기 모델 데이터베이스(140)에서 상기 식별된 환경에 대응하는 음향모델을 검색 및 이용한다. 즉, 상기 인식부(160)는, 상기 음성인식 시스템의 인식 모드에서 상기 특징 변환부(102)에 의해 변환된 특징벡터와 상기 모델 변환부(108)에 의해 변환된 화자 및 환경 종속 음향모델을 이용하여 음성인식을 수행하게 된다.On the other hand, the recognition unit 160, when performing the voice recognition for the recognition target voice data, by the environment identification unit 104 identifies the generation environment of the recognition target voice data, the model database 140 in the Search and use the acoustic model corresponding to the identified environment. In other words, the recognition unit 160, the recognition mode of the speech recognition system and the feature vector converted by the feature converting unit 102 and the speaker and the environment-dependent acoustic model converted by the model converting unit 108 Voice recognition is performed by using.

일 실시예에 있어서, 상기 인식부(160)는, 인식 대상 음성데이터에 대한 음성인식 수행시, 상기 모델 데이터베이스(140)에 저장된 음향모델들과 우도(likelihood)를 계산하여 가장 큰 값을 지니는 음향모델을, 상기 인식 대상 음성 데이터의 발생 환경(즉, 현재 음성인식 환경)에 종속된 것으로 판단하고, 해당 음향모델을 이용하여 음성인식을 수행할 수 있다.In one embodiment, the recognition unit 160, when performing the voice recognition for the voice data to be recognized, the sound having the highest value by calculating the acoustic models and likelihood stored in the model database 140 The model may be determined to be dependent on a generation environment (ie, a current speech recognition environment) of the speech data to be recognized, and speech recognition may be performed using the corresponding acoustic model.

한편, 본 발명은, 컴퓨터로 판독할 수 있는 기록매체에 컴퓨터가 읽어들일 수 있는 프로그램 코드로 구현하는 것이 가능하다. 본 발명이 소프트웨어를 통해 실행될 때, 본 발명의 구성 수단들은 필요한 작업을 실행하는 코드 세그먼트들이다. 또한, 프로그램 또는 코드 세그먼트들은 컴퓨터의 프로세서 판독가능 매체에 저장되거나 전송 매체 또는 통신망을 통해 반송파와 결합된 컴퓨터 데이터 신호로 전송될 수 있다.On the other hand, the present invention can be implemented in a program code that can be read by a computer on a computer-readable recording medium. When the present invention is executed through software, the constituent means of the present invention are code segments for performing necessary tasks. In addition, the program or code segments may be stored in a computer readable medium of a computer or transmitted as a computer data signal coupled with a carrier via a transmission medium or a communication network.

컴퓨터 판독가능 기록매체에는 컴퓨터 시스템이 읽어들일 수 있는 데이터를 저장하는 모든 종류의 기록장치가 포함된다. 예컨대, 컴퓨터 판독가능 기록매체에는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 포함될 수 있다. 또한, 컴퓨터 판독가능 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 컴퓨터가 읽어들일 수 있는 코드를 분산방식으로 저장하고 실행되도록 할 수 있다.Computer-readable recording media include all kinds of recording devices for storing data that can be read by a computer system. For example, the computer-readable recording medium may include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer can store and execute the code that is readable.

상술한 바와 같이, 본 발명은 음성인식 시스템의 화자 적응을 위해 진화 학습에 의한 비지도 적응 방식을 적용함으로써 사용자 편의성을 도모하는 이점을 제공한다. 또한, 음성인식 시스템의 동작 상태에 따라 특징 변환 및 모델 변환을 적절하게 수행함으로써, 화자 적응 방식의 효율성 및 음성인식 시스템의 인식 성능을 개선하는 이점을 제공한다. 나아가, 음성인식 환경별로 화자 종속 음향모델을 생성함으로써, 화자 적응은 물론 환경 적응까지 동시에 수행하는 이점을 제공한다.As described above, the present invention provides an advantage of user convenience by applying an unsupervised adaptation method by evolutionary learning for speaker adaptation of a speech recognition system. In addition, by appropriately performing the feature conversion and model conversion according to the operating state of the speech recognition system, it provides an advantage of improving the efficiency of the speaker adaptation method and the recognition performance of the speech recognition system. Furthermore, by generating a speaker-dependent acoustic model for each speech recognition environment, it provides an advantage of simultaneously performing speaker adaptation and environmental adaptation.

지금까지 본 발명에 대해 실시예들을 참고하여 설명하였다. 그러나 당업자라면 본 발명의 본질적인 기술적 사상으로부터 벗어나지 않는 범위에서 본 발명이 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 즉, 본 발명의 진정한 기술적 범위는 첨부된 특허청구범위에 나타나 있으며, 그와 균등범위 내에 있는 모든 차이점은 본 발명에 포함되는 것으로 해석되어야 할 것이다.So far, the present invention has been described with reference to the embodiments. However, one of ordinary skill in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential technical spirit of the present invention. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. That is, the true technical scope of the present invention is shown in the appended claims, and all differences within the equivalent scope will be construed as being included in the present invention.

도 1은 본 발명에 따른 진화 학습에 의한 화자 적응 장치가 적용된 음성인식 시스템의 일례를 나타낸 블록도이다.1 is a block diagram showing an example of a speech recognition system to which the speaker adaptation apparatus according to the present invention is applied.

도 2는 본 발명에 따른 진화 학습에 의한 화자 적응 방법의 일례를 나타낸 흐름도이다.2 is a flowchart illustrating an example of a speaker adaptation method by evolutionary learning according to the present invention.

도 3은 화자 적응 기법의 기본원리를 나타낸 도면이다.3 is a diagram illustrating the basic principle of a speaker adaptation technique.

도 4는 본 발명에 따른 진화 학습에 의한 음성인식 방법의 일례를 나타낸 흐름도이다.4 is a flowchart illustrating an example of a speech recognition method by evolutionary learning according to the present invention.

도 5는 진화 학습을 통해 화자 및 환경 종속 음향모델을 생성하는 과정을 나타낸 도면이다.5 is a diagram illustrating a process of generating a speaker and an environment dependent acoustic model through evolutionary learning.

Claims

Apparatus for performing speaker adaptation by evolutionary learning in speech recognition system,

A feature conversion unit for performing feature transformation on a feature vector of speech data to be recognized using a predetermined environment parameter in a recognition mode in which the speech recognition system is performing speech recognition;

A voice database storing the recognition target voice data; And

In a standby mode in which the speech recognition system does not perform speech recognition, a model conversion unit for performing a model conversion for the existing acoustic model using the speech data stored in the speech database,

And the speech recognition system performs speaker adaptation and environment adaptation simultaneously by repeating the feature transformation and the model transformation according to a performance mode.

The method of claim 1,

And the feature transform unit determines the environment parameter in advance using a maximum likelihood method.

The method of claim 1,

The speaker adaptation apparatus is configured to extract an ambient noise feature in a non-voice section other than a voice section detected through voice activity detection in the voice recognition system to identify an environment in which the voice object to be recognized is generated. Speaker adaptation apparatus by evolutionary learning, further comprising an identification unit.

The method of claim 3, wherein

And the voice database stores voice data generated in a corresponding environment for each of the environments identified by the environment identification unit.

5. The method of claim 4,

And the model converter is configured to perform model transformation on the existing acoustic model for each of the identified environments using the speech data stored in the speech database.

In the method of speaker adaptation by evolutionary learning in speech recognition system,

A feature transformation step of performing a feature transformation on a feature vector of speech data to be recognized using a predetermined environment parameter in a recognition mode in which the speech recognition system is performing speech recognition;

A voice data storage step of storing the recognition target voice data in a database; And

And a model conversion step of performing a model transformation on an existing acoustic model using the voice data stored in the database in the standby mode in which the voice recognition system does not perform voice recognition.

The method of claim 6,

The feature conversion step includes the step of pre-determining the environmental parameters using a maximum likelihood method (speaker).

The method of claim 6,

In the speaker adaptation method, an environment in which the voice recognition system extracts features of ambient noise in a non-speech section other than the voice section detected through voice activity detection to identify an environment in which the voice data to be recognized is generated. Speaker adaptation method by evolutionary learning, characterized in that it further comprises an identification step.

The method of claim 8,

The voice data storing step is a step of storing the speech data generated in the environment for each environment identified through the environment identification step, speaker adaptation method by evolutionary learning.

10. The method of claim 9,

And the model conversion step is a step of performing a model transformation on the existing acoustic model for each of the identified environments by using the voice data stored in the database.

A computer readable recording medium having recorded thereon a program for executing a method according to any one of claims 6 to 10 by a computer.

In a speech recognition system that recognizes speech by performing speaker adaptation by evolutionary learning,

A feature converter configured to perform feature transformation using a predetermined environment parameter on a feature vector of speech data to be recognized in a recognition mode in which the speech recognition system is performing speech recognition;

A voice database storing the recognition target voice data;

A model converter configured to perform model transformation on an existing acoustic model using voice data stored in the voice database in a standby mode in which the voice recognition system does not perform voice recognition; And

In the recognition mode, a recognition unit for performing speech recognition using the feature vector converted by the feature conversion unit and the acoustic model converted by the model conversion unit,

The speech recognition system is a speech recognition system by evolutionary learning to perform speaker adaptation and environment adaptation by repeating the feature transformation and the model transformation in accordance with the performance mode.

The method of claim 12,

The feature conversion unit, the speech recognition system by evolutionary learning, characterized in that for determining the environment parameters in advance using a maximum likelihood method (maximum likelihood method).

The method of claim 12,

The voice recognition system may further include an environment identification unit for extracting a feature of ambient noise in a non-voice section other than the voice section detected through voice activity detection to identify an environment in which the voice object to be recognized is generated. Speech recognition system by evolutionary learning, characterized in that.

The method of claim 14,

The voice database, the voice recognition system by the evolution learning, characterized in that for storing the voice data generated in the environment for each environment identified by the environment identification unit.

The method of claim 15,

The model conversion unit, the speech recognition system by the evolution learning, characterized in that for performing the model transformation for the existing acoustic model for each identified environment using the speech data stored in the speech database.

The method of claim 16,

The speech recognition system, the speech recognition system by evolution learning further comprises a model database for storing the acoustic model of the environment model-specific conversion by the model conversion unit by the identified environment.

The method of claim 17,

The speech recognition system further includes a model repositioning unit configured to limit the number of acoustic models stored in the model database by a predetermined threshold.

The method of claim 17,

The recognition unit, the speech recognition system by the evolutionary learning, characterized in that for performing the speech recognition using the acoustic model corresponding to the environment in which the speech data to be recognized from the acoustic model stored in the model database.