KR20140077780A - Apparatus for adapting language model scale using signal-to-noise ratio - Google Patents

Apparatus for adapting language model scale using signal-to-noise ratio Download PDF

Info

Publication number
KR20140077780A
KR20140077780A KR1020120146911A KR20120146911A KR20140077780A KR 20140077780 A KR20140077780 A KR 20140077780A KR 1020120146911 A KR1020120146911 A KR 1020120146911A KR 20120146911 A KR20120146911 A KR 20120146911A KR 20140077780 A KR20140077780 A KR 20140077780A
Authority
KR
South Korea
Prior art keywords
signal
language model
noise ratio
model scale
present
Prior art date
Application number
KR1020120146911A
Other languages
Korean (ko)
Other versions
KR102020782B1 (en
Inventor
정훈
전형배
박전규
오유리
강점자
이윤근
Original Assignee
한국전자통신연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자통신연구원 filed Critical 한국전자통신연구원
Priority to KR1020120146911A priority Critical patent/KR102020782B1/en
Publication of KR20140077780A publication Critical patent/KR20140077780A/en
Application granted granted Critical
Publication of KR102020782B1 publication Critical patent/KR102020782B1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a voice recognition system. More particularly, the present invention relates to an apparatus for adapting language model scale to improve voice recognition performance in the voice recognition system. According to the present invention, in case of a voice signal of a low signal-to-noise ratio, a weighted value is applied to the discrimination of a language model. Therefore, recognition performance with regard to noisy environment is improved.

Description

[0001] The present invention relates to a language model scale adaptation apparatus using a signal-to-noise ratio

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a speech recognition system, and more particularly, to a language model scale adaptation apparatus for enhancing speech recognition performance in a speech recognition system.

Speech recognition technology is relatively common and is being used in various applications. However, since speech recognition technology of isolated word level is commercialized, there is an increasing demand for speech recognition products having higher functions in terms of users.

That is, there is a need for a key word spotting technique capable of recognizing even if another word is included before and after a recognition target word, or a continuous speech recognition technique capable of recognizing a natural sentence type.

However, in the case of continuous speech recognition, the user's expectation level has not been reached yet.

In other words, there is a problem of how good a language model can be applied in addition to the performance of an acoustic model.

In most cases, the language model is constructed using text data, which is constructed using a text corpus to obtain various text data.

For example, if you have versatility such as dictation, you will use newspaper articles, novels, and other materials available on the Internet. However, in this case, the performance of the language model made using the data is limited.

In particular, if a language model is not sufficient for a particular application, the performance expected by the user becomes difficult to obtain.

The most ideal method is to obtain textual data suitable for the application field, but this is difficult in reality.

Efforts to overcome these problems have been made in many ways. Bilingual model adaptation can also be seen as one of these efforts.

However, acoustic models and language models have different ranges of probabilities due to differences in modeling methods, and the role of correcting these differences is the language model scale.

In general, the optimal language model scale is obtained through experimentation and the optimal value of speed vs. performance is used for the given evaluation corpus and system.

In general, when the signal-to-noise ratio is good, the discrimination power between the acoustic models is good, but when the signal-to-noise ratio is bad, the discrimination power between the acoustic models is deteriorated.

However, there is a problem that the probability value or the discriminating power of the language model is maintained irrespective of the quality of the input signal.

1. Korean Patent Publication No. 10-2012-0066530

The present invention has been proposed in order to solve the problems described in the background art. In order to maintain a stable recognition performance even in a noisy environment, the language model scale is adjusted according to the degree of noise of an input signal.

In general, when the signal-to-noise ratio is good, the discrimination power between the acoustic models is good, but when the signal-to-noise ratio is bad, the discrimination power between the acoustic models is deteriorated.

However, the probability value or discriminating power of the language model is maintained irrespective of the quality of the input signal.

Therefore, if the signal-to-noise ratio is good, the probability value of the acoustic model is weighted more. Otherwise, the probability value of the acoustic model is more weighted so that the language model scale is adjusted so that the discrimination power of the language model is used more in the noisy environment. The present invention provides a language model scale adaptation apparatus using a signal-to-noise ratio that improves recognition performance in an environment.

In order to overcome the problems raised in the background art, the present invention is based on the assumption that the probability value of the acoustic model is weighted more when the signal-to-noise ratio is good, and is further weighted to the probability value of the acoustic model, The present invention provides a language model scale adaptation apparatus using a signal-to-noise ratio that improves recognition performance in a noisy environment by adjusting a language model scale to use more discriminating power.

Wherein the language model scale adaptation apparatus adjusts a language model scale by assigning different weights to a probability value of an acoustic model based on the signal-to-noise ratio in a language model scale adaptation apparatus using a signal-to-noise ratio of a speech recognition method .

On the other hand, another embodiment of the present invention is a speech signal input method comprising the steps of: inputting a voice signal; An end point detecting step of detecting an end point of the input voice signal; A signal-to-noise ratio measurement step of measuring a signal-to-noise ratio (SNR) for a speech signal as an end point is detected; A language model scale adaptation step of weighting the probability value of the acoustic model if the signal-to-noise ratio is good according to the measured signal-to-noise ratio, and adapting the language model scale by weighting the probability value of the acoustic model in a good case; Generating a search space for the speech signal as the language model scale is adapted; And a decoding step of decoding the search space signal to generate a final speech recognition result.

According to the present invention, the recognition performance of the noise environment is improved by weighting the discrimination power of the language model for a speech signal having a low signal-to-noise ratio.

That is, if the signal-to-noise ratio is good, the probability value of the acoustic model is weighted more, and if it is not good, the probability value of the acoustic model is more weighted so that the language model scale is adjusted so that the discrimination power of the language model is used more in the noisy environment, The recognition performance can be improved in the environment.

1 is a block diagram of a language model scale adaptation apparatus using a signal-to-noise ratio according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating a language model scale adaptation process using a signal-to-noise ratio according to an exemplary embodiment of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Like reference numerals are used for similar elements in describing each drawing.

The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. The term "and / or" includes any combination of a plurality of related listed items or any of a plurality of related listed items.

Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Should not.

Hereinafter, a language model scale adaptation apparatus using a signal-to-noise ratio according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram of a language model scale adaptation apparatus using a signal-to-noise ratio according to an embodiment of the present invention. 1, the language model scale adaptation apparatus comprises an endpoint detector 100 for detecting an end point of an input speech signal, a signal-to-noise ratio (SNR) A signal-to-noise ratio measuring unit 110 for measuring a signal-to-noise ratio of the acoustic model, and a signal-to-noise ratio measuring unit 110 for weighting the probability value of the acoustic model if the signal- A language model scale adaptation unit 120 for adapting the model scale, a search space generation unit 130 for generating a search space according to the language model scale adaptation of the language model scale adaptation unit 120, And a decoding unit 140 for generating a final speech recognition result.

Generally, a probability-based speech recognition system obtains a word sequence W having a maximum likelihood a posteriori probability (ML-APP) with respect to an input speech signal X as shown in Equation (1).

Figure pat00001

At this time,

Figure pat00002
Acoustic model,
Figure pat00003
The language model, alpha, is called the language model scale.

The acoustic model is the probability that each word or phoneme will generate a specific speech signal, and the language model is the probability of occurrence for successive words.

The acoustic model and the language model have different ranges of probabilities due to differences in modeling methods, and the language model scale plays a role of correcting the differences.

In an embodiment of the present invention, a language model scale adaptive scheme based on the signal-to-noise ratio is used, and the expression is expressed by the following equation. As shown in Equation (2), the language model scale is a function of the time t and the signal-to-noise ratio.

Figure pat00004

Figure pat00005

Here, SNR (t) is the signal-to-noise ratio in time frame t, α is the optimal language model scale obtained through experiments, and β is obtained through experimentation with a weighting factor. At this time, the sigmoid function is obtained by the following equation.

Figure pat00006

FIG. 2 is a flowchart illustrating a language model scale adaptation process using a signal-to-noise ratio according to an embodiment of the present invention.

2, the language model scale adaptation process includes a speech signal input step S200 for inputting a speech signal, an end point detection step S210 for detecting an end point of the input speech signal, A signal-to-noise ratio measuring step (S220) of measuring a signal-to-noise ratio (SNR) of a speech signal; and a step of calculating a weighted value of the probability value of the acoustic model if the signal- A language model scale adaptation step (S230) of adapting a language model scale by weighting a probability value of an acoustic model in a good case, and a search space creation step of generating a search space for the speech signal as the language model scale is adapted A decoding step S250 of decoding the signal in the search space to generate a final speech recognition result, and the like.

In particular, in particular, the language model scale adaptation method using a signal-to-noise ratio according to an embodiment of the present invention may be implemented in the form of program command code that can be executed through various computer means and recorded in a computer-readable storage medium.

The computer-readable storage medium may include program instructions, data files, data structures, and the like, alone or in combination.

The program instructions recorded on the medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software.

Examples of computer-readable storage media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magneto-optical media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

The medium may be a transmission medium such as an optical or metal line, a wave guide, or the like, including a carrier wave for transmitting a signal designating a program command, a data structure, or the like.

Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

In addition, one embodiment of the present invention may be implemented in hardware, software, or a combination thereof. (DSP), a programmable logic device (PLD), a field programmable gate array (FPGA), a processor, a controller, a microprocessor, and the like, which are designed to perform the above- , Other electronic units, or a combination thereof.

In a software implementation, it may be implemented as a module that performs the functions described above. The software may be stored in a memory unit and executed by a processor. The memory unit or processor may employ various means well known to those skilled in the art.

100: End point detector
110: signal-to-noise ratio measuring unit
120: language model scale adaptation unit
130: Search space generating unit
140:

Claims (1)

A language model scale adaptation apparatus using a speech recognition scheme signal-to-noise ratio,
Wherein the language model scale is adjusted by assigning a different weight to the probability value of the acoustic model based on the signal-to-noise ratio, and the speech model scale adaptation apparatus using the signal-to-noise ratio.
KR1020120146911A 2012-12-14 2012-12-14 Apparatus for adapting language model scale using signal-to-noise ratio KR102020782B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020120146911A KR102020782B1 (en) 2012-12-14 2012-12-14 Apparatus for adapting language model scale using signal-to-noise ratio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020120146911A KR102020782B1 (en) 2012-12-14 2012-12-14 Apparatus for adapting language model scale using signal-to-noise ratio

Publications (2)

Publication Number Publication Date
KR20140077780A true KR20140077780A (en) 2014-06-24
KR102020782B1 KR102020782B1 (en) 2019-09-11

Family

ID=51129629

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020120146911A KR102020782B1 (en) 2012-12-14 2012-12-14 Apparatus for adapting language model scale using signal-to-noise ratio

Country Status (1)

Country Link
KR (1) KR102020782B1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010057A1 (en) * 2006-07-05 2008-01-10 General Motors Corporation Applying speech recognition adaptation in an automated speech recognition system of a telematics-equipped vehicle
KR20100138520A (en) * 2009-06-25 2010-12-31 한국전자통신연구원 Speech recognition apparatus and its method
KR20120066530A (en) 2010-12-14 2012-06-22 한국전자통신연구원 Method of estimating language model weight and apparatus for the same

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080010057A1 (en) * 2006-07-05 2008-01-10 General Motors Corporation Applying speech recognition adaptation in an automated speech recognition system of a telematics-equipped vehicle
KR20100138520A (en) * 2009-06-25 2010-12-31 한국전자통신연구원 Speech recognition apparatus and its method
KR20120066530A (en) 2010-12-14 2012-06-22 한국전자통신연구원 Method of estimating language model weight and apparatus for the same

Also Published As

Publication number Publication date
KR102020782B1 (en) 2019-09-11

Similar Documents

Publication Publication Date Title
CN109741736B (en) System and method for robust speech recognition using generative countermeasure networks
US10930270B2 (en) Processing audio waveforms
US11210475B2 (en) Enhanced attention mechanisms
US11798535B2 (en) On-device custom wake word detection
US10679643B2 (en) Automatic audio captioning
CN109036391B (en) Voice recognition method, device and system
US9779730B2 (en) Method and apparatus for speech recognition and generation of speech recognition engine
US9202462B2 (en) Key phrase detection
EP3966813A1 (en) Online verification of custom wake word
JP7351018B2 (en) Proper noun recognition in end-to-end speech recognition
JP5861649B2 (en) Model adaptation device, model adaptation method, and model adaptation program
US10096317B2 (en) Hierarchical speech recognition decoder
WO2016144988A1 (en) Token-level interpolation for class-based language models
EP3739583A1 (en) Dialog device, dialog method, and dialog computer program
US20190027133A1 (en) Spoken language understanding using dynamic vocabulary
US12125482B2 (en) Adaptively recognizing speech using key phrases
WO2014183411A1 (en) Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound
JP7326596B2 (en) Voice data creation device
CN112863496B (en) Voice endpoint detection method and device
CN112037772A (en) Multi-mode-based response obligation detection method, system and device
US9892726B1 (en) Class-based discriminative training of speech models
KR20200102309A (en) System and method for voice recognition using word similarity
KR20140077780A (en) Apparatus for adapting language model scale using signal-to-noise ratio
JP2014092750A (en) Acoustic model generating device, method for the same, and program
Scarcella Recurrent neural network language models in the context of under-resourced South African languages

Legal Events

Date Code Title Description
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant