US8972256B2 - System and method for dynamic noise adaptation for robust automatic speech recognition - Google Patents
System and method for dynamic noise adaptation for robust automatic speech recognition Download PDFInfo
- Publication number
- US8972256B2 US8972256B2 US13/274,694 US201113274694A US8972256B2 US 8972256 B2 US8972256 B2 US 8972256B2 US 201113274694 A US201113274694 A US 201113274694A US 8972256 B2 US8972256 B2 US 8972256B2
- Authority
- US
- United States
- Prior art keywords
- model
- dna
- noise
- dna model
- null
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 230000006978 adaptation Effects 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 title claims description 20
- 230000000694 effects Effects 0.000 claims abstract description 7
- 238000012935 Averaging Methods 0.000 claims description 13
- 230000001052 transient effect Effects 0.000 claims description 8
- 230000003993 interaction Effects 0.000 abstract description 8
- 238000003672 processing method Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000000593 degrading effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- QPUKANZXOGADOB-UHFFFAOYSA-N n-dodecyl-n-methylnitrous amide Chemical compound CCCCCCCCCCCCN(C)N=O QPUKANZXOGADOB-UHFFFAOYSA-N 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- VTYYLEPIZMXCLO-UHFFFAOYSA-L Calcium carbonate Chemical compound [Ca+2].[O-]C([O-])=O VTYYLEPIZMXCLO-UHFFFAOYSA-L 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005312 nonlinear dynamic Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000005477 standard model Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Definitions
- the present invention relates to speech processing, and more specifically to noise adaptation in automatic speech recognition.
- ASR Automatic speech recognition systems try to determine a representative meaning (e.g., text) corresponding to speech inputs.
- the speech input is processed into a sequence of digital frames which are multi-dimensional vectors that represent various characteristics of the speech signal present during a short time window of the speech.
- variable numbers of frames are organized as “utterances” representing a period of speech followed by a pause which in real life loosely corresponds to a spoken sentence or phrase.
- the ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. More formally, given some input observations A, the probability that some string of words W were spoken is represented as P(W
- W ⁇ arg ⁇ max W ⁇ P ( W ) ⁇ P ( A
- the acoustic models are typically probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds using mixtures of probability distribution functions (Gaussians). Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
- HMMs hidden Markov models
- Gaussians mixtures of probability distribution functions
- Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts.
- State sequence models can be scaled up to represent words as connected sequences of
- the words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses.
- a system may produce a single best recognition candidate—the recognition result—or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
- Some ASR systems pre-process the input speech frames (observation vectors) to account for channel effects and noise, for example, using explicit models of noise, channel distortion, and their interaction with speech.
- Many interesting and effective approximate modeling and inference techniques have been developed to represent these acoustic entities and the reasonably well understood but complicated interactions between them. While there are many results showing the promise of these techniques on less sophisticated systems trained on small amounts of artificially mixed data, there has been little evidence that these techniques can improve state of the art large vocabulary ASR systems.
- Dynamic noise adaptation is a model-based technique for improving ASR performance in the presence of noise. See Rennie et al. Dynamic Noise Adaptation , Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2006, 14-19 May 2006; Rennie and Dognin, Beyond Linear Transforms: Efficient Non - Linear Dynamic Adaptation For Noise Robust Speech Recognition , in Proceedings of the 9th International Conference of Interspeech 2008, Brisbane, Australia, Sep. 23-26, 2008; Rennie et al., Robust Speech Recognition Using Dynamic Noise Adaptation , in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2011, Prague, Czech Republic, May 22-27, 2011; all incorporated herein by reference.
- IICASSP Independent Speech Recognition
- Embodiments of the present invention are directed to a speech processing method and arrangement.
- a dynamic noise adaptation (DNA) model characterizes a speech input reflecting effects of background noise.
- a null noise DNA model characterizes the speech input based on reflecting a null noise mismatch condition.
- a model adaptation module performs Bayesian model selection and re-weighting of the DNA model and the null noise DNA model to realize a modified DNA model characterizing the speech input for automatic speech recognition and compensating for noise to a varying degree depending on relative probabilities of the DNA model and the null noise DNA model.
- the Bayesian model selection and re-weighting may reflect a competing likelihood of which model best characterizes the speech input, for example, by averaging the models, and/or by further decreasing the probability of the DNA model when it does not best characterize the speech input, for example, to zero, and/or by increasing the probability of the DNA model when it best characterizes the input, for example by doubling the probability, and then subtracting 1.
- the DNA model may include a probability based noise model reflecting transient and evolving components of a current noise estimate.
- FIG. 1 shows various hardware components of an ASR system according to an embodiment of the present invention.
- FIG. 2 shows an arrangement for null noise DNA processing according to an embodiment.
- FIG. 3 shows a graph illustrating use of a hard threshold probability between the competing DNA models.
- FIG. 1 shows various hardware components of an embodiment of an ASR system which uses a language model according to the present invention.
- a computer system 10 includes a speech input microphone 11 which is connected through a suitable preamplifier 13 to an analog-to-digital (A/D) converter 15 .
- a front-end DNA pre-processor 17 typically performs a Fourier transform so as to extract spectral features to characterize the input speech as a sequence of representative multi-dimensional vectors and performs the DNA analysis and adaptation in a potentially derived feature space.
- a speech recognition processor 12 e.g., an Intel Core i7 processor or the like, is programmed to run one or more specialized computer software processes to determine a recognition output corresponding to the speech input.
- processor memory 120 e.g., random access memory (RAM) and/or read-only memory (ROM) stores the speech processing software routines, the speech recognition models and data for use by the speech recognition processor 12 .
- the recognition output may be displayed, for example, as representative text on computer workstation display 14 .
- Such a computer workstation would also typically include a keyboard 16 and a mouse 18 for user interaction with the system 10 .
- ASR implemented for a mobile device such as a cell phone
- ASR for the cabin of an automobile
- client-server based ASR etc.
- FIG. 2 shows a simplified diagram of the DNA architecture (omitting an explicit channel distortion model).
- the interaction model for that frame y t includes a speech observation vector component x, and a noise component n 1 like Eq. 4 above:
- the speech model can specifically use a band-quantized gaussian mixture model (BQ-GMM) which is a constrained, diagonal covariance Gaussian Mixture Model (GMM).
- BQ-GMMs have B ⁇ S shared Gaussians per feature, where S is the number of acoustic components, and so can be evaluated very efficiently.
- Noise can be separated into evolving and transient components, which facilitates robust tracking of the noise level during inference.
- the dynamically evolving component of this noise—the noise level—is assumed to be changing slowly relative to the frame rate, and can be modeled as follows: p ( l f,0 ) ( l f,0 ; ⁇ f , ⁇ f,0 2 ), (5) p ( l f, ⁇
- l f, ⁇ 1 ) ( l f, ⁇ ;l f, ⁇ 1 , ⁇ f 2 ), (6) where l f, ⁇ is a random variable representing the noise level in frequency band f at frame ⁇ .
- ⁇ a ⁇ ⁇ f ⁇ ⁇ ⁇ x ⁇
- the posterior distribution of x and n is Gaussian.
- the posterior distribution of l can be determined by integrating out the speech and transient noise to get a Gaussian posterior likelihood for l, and then combining it with the current noise level prior. This is more efficient than unnecessarily computing the joint posterior of x, n, and l.
- MMSE Minimum Mean Square Error
- f(y t ) consists of two terms—g(y t ) which is simply the log likelihood ratio of the two models, and c which is a bias term equal to the log of the prior ratio of the models.
- Equation (15) does not directly take into account the relative complexity of the models that are competing to explain the observed speech data. When deciding what model best represents the observed test features, it makes sense to penalize model complexity. In this case, one model is actually contained within the other. If the clean model can explain the speech data just as well as the DNA model, then the clean model should have higher posterior probability because it has fewer parameters. Equation (15) estimates a frame-level model posterior for the DNA model which itself evolves stochastically in online fashion to adapt to changing noise conditions.
- y 0 : t ] p ⁇ ( M DNA
- the state of the DNA noise model is not affected by the current posterior probability of the competing model.
- a competing noise model was introduced to make DNA more robust to abrupt changes in the noise level.
- the evolving noise model in DNA would be re-initialized. But in embodiments of the present invention, the NN model competes with DNA only for influence in the reconstructed speech estimate.
- FIG. 3 shows one example of use of such a thresholding arrangement where:
- Embodiments of the present invention improve ASR performance in clean noise conditions, by allowing a noise-free NN speech model to compete with the DNA model.
- Experimental results indicate that use of the NN model improves the Sentence Error Rate (SER) of a state-of-the-art embedded speech recognizer that utilizes commercial grade feature-space Maximum Mutual Information (fMMI), boosted MMI (bMMI), and feature-space Maximum Likelihood Linear Regression (fMLLR) compensation by 15% relative at signal-to-noise ratios (SNRs) below 10 dB, and over 8% relative overall.
- SER Sentence Error Rate
- fMMI boosted MMI
- fMLLR feature-space Maximum Likelihood Linear Regression
- Embodiments can be implemented in whole or in part as a computer program product for use with a computer system.
- Such implementation may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium.
- the medium may be either a tangible medium (e.g., optical or analog communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques).
- the series of computer instructions embodies all or part of the functionality previously described herein with respect to the system.
- Such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention are implemented as entirely hardware, or entirely software (e.g., a computer program product).
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
Description
Given a system of statistical acoustic models, this formula can be re-expressed as:
where P(A|W) corresponds to the acoustic models and P(W) represents the value of a statistical language model reflecting the probability of given word in the recognition vocabulary occurring.
y(t)=h(t)*x(t)+n(t). (1)
where * denotes linear convolution. In the frequency domain:
where |X| and θx represent the magnitude and phase spectrum of x(t), and θ=θx+θh−θn. Ignoring the phase term ε and assuming that the channel response |H| is constant over each Mel frequency band, in the log Mel spectral domain:
y≈f(x+h, n)=log(exp(x+h)+exp(n)) (3)
where y represents the log Mel transform of |Y|2. The error of this approximation can be modeled as zero mean and Gaussian distributed:
p(y|x+h,n)=(y:f(x+h+n), ψ2). (4)
p(l f,0)=(l f,0;βf, ωf,0 2), (5)
p(l f,τ |l f,τ−1)=(l f,τ ;l f,τ−1, γf 2), (6)
where lf,τis a random variable representing the noise level in frequency band f at frame τ. Note that it is assumed that the noise evolves independently at each frequency band. The transient component of the noise process at each frequency band is modeled as zero-mean and Gaussian:
p(n f,τ |l f,τ)=(n f,τ ;l f,τ,φf 2). (7)
p(h f,τ)=δ(h f,τ −ĥ f(τ)), (8)
where ^ hf(r) is the current estimate of the channel in frequency bin f at frame τ.
p(l f,τ+1)≈(l f,τ+1;βf,τ+1, ωf,τ+1 2) (9)(10)
A variation of Algonquin can be used to iteratively estimate the conditional posterior of the noise level and speech for each speech Gaussian. Algonquin iteratively linearizes the interaction function given a context-dependent expansion point, usually taken as the current estimates of the speech and noise. For a given Gaussian α:
Given αa, the posterior distribution of x and n is Gaussian. Once the final estimate of αa has been determined, the posterior distribution of l can be determined by integrating out the speech and transient noise to get a Gaussian posterior likelihood for l, and then combining it with the current noise level prior. This is more efficient than unnecessarily computing the joint posterior of x, n, and l.
These features can be passed to the ASR backend for speech recognition.
and α=1. This is simply Bayes' rule for a binary random variable, with states MDNA and Mmatched respectively. α can be tuned to control how “sharp” the posterior estimate is. f(yt) consists of two terms—g(yt) which is simply the log likelihood ratio of the two models, and c which is a bias term equal to the log of the prior ratio of the models.
p( DNA |y 0:t)=γp( DNA |y 0:t−1)+(1−γ)p( matched |y t), γε(0.1) (18)
Note that the state of the DNA noise model is not affected by the current posterior probability of the competing model. In a previous investigation a competing noise model was introduced to make DNA more robust to abrupt changes in the noise level. When a reset condition was triggered by a high noise model probability, the evolving noise model in DNA would be re-initialized. But in embodiments of the present invention, the NN model competes with DNA only for influence in the reconstructed speech estimate.
-
- Process DNA_Null_Noise
- DNA(speech_input);
- DNA_NN(speech input);
- DNA_select(DNA, DNA_NN).
- Process DNA_Null_Noise
Claims (12)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/274,694 US8972256B2 (en) | 2011-10-17 | 2011-10-17 | System and method for dynamic noise adaptation for robust automatic speech recognition |
US14/600,503 US9741341B2 (en) | 2011-10-17 | 2015-01-20 | System and method for dynamic noise adaptation for robust automatic speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/274,694 US8972256B2 (en) | 2011-10-17 | 2011-10-17 | System and method for dynamic noise adaptation for robust automatic speech recognition |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/600,503 Continuation US9741341B2 (en) | 2011-10-17 | 2015-01-20 | System and method for dynamic noise adaptation for robust automatic speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
US20130096915A1 US20130096915A1 (en) | 2013-04-18 |
US8972256B2 true US8972256B2 (en) | 2015-03-03 |
Family
ID=48086575
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/274,694 Active 2034-01-01 US8972256B2 (en) | 2011-10-17 | 2011-10-17 | System and method for dynamic noise adaptation for robust automatic speech recognition |
US14/600,503 Active 2032-02-03 US9741341B2 (en) | 2011-10-17 | 2015-01-20 | System and method for dynamic noise adaptation for robust automatic speech recognition |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/600,503 Active 2032-02-03 US9741341B2 (en) | 2011-10-17 | 2015-01-20 | System and method for dynamic noise adaptation for robust automatic speech recognition |
Country Status (1)
Country | Link |
---|---|
US (2) | US8972256B2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170076719A1 (en) * | 2015-09-10 | 2017-03-16 | Samsung Electronics Co., Ltd. | Apparatus and method for generating acoustic model, and apparatus and method for speech recognition |
CN106683663A (en) * | 2015-11-06 | 2017-05-17 | 三星电子株式会社 | Neural network training apparatus and method, and speech recognition apparatus and method |
US11881211B2 (en) | 2020-03-24 | 2024-01-23 | Samsung Electronics Co., Ltd. | Electronic device and controlling method of electronic device for augmenting learning data for a recognition model |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10075630B2 (en) | 2013-07-03 | 2018-09-11 | HJ Laboratories, LLC | Providing real-time, personal services by accessing components on a mobile device |
US9373324B2 (en) | 2013-12-06 | 2016-06-21 | International Business Machines Corporation | Applying speaker adaption techniques to correlated features |
US9378735B1 (en) * | 2013-12-19 | 2016-06-28 | Amazon Technologies, Inc. | Estimating speaker-specific affine transforms for neural network based speech recognition systems |
US9530408B2 (en) | 2014-10-31 | 2016-12-27 | At&T Intellectual Property I, L.P. | Acoustic environment recognizer for optimal speech processing |
CN109087659A (en) * | 2018-08-03 | 2018-12-25 | 三星电子(中国)研发中心 | Audio optimization method and apparatus |
US11887583B1 (en) * | 2021-06-09 | 2024-01-30 | Amazon Technologies, Inc. | Updating models with trained model update objects |
Citations (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5749068A (en) * | 1996-03-25 | 1998-05-05 | Mitsubishi Denki Kabushiki Kaisha | Speech recognition apparatus and method in noisy circumstances |
US5970446A (en) * | 1997-11-25 | 1999-10-19 | At&T Corp | Selective noise/channel/coding models and recognizers for automatic speech recognition |
US6188982B1 (en) * | 1997-12-01 | 2001-02-13 | Industrial Technology Research Institute | On-line background noise adaptation of parallel model combination HMM with discriminative learning using weighted HMM for noisy speech recognition |
US20020087306A1 (en) * | 2000-12-29 | 2002-07-04 | Lee Victor Wai Leung | Computer-implemented noise normalization method and system |
US20020165712A1 (en) * | 2000-04-18 | 2002-11-07 | Younes Souilmi | Method and apparatus for feature domain joint channel and additive noise compensation |
US20030115055A1 (en) * | 2001-12-12 | 2003-06-19 | Yifan Gong | Method of speech recognition resistant to convolutive distortion and additive distortion |
US20030182114A1 (en) * | 2000-05-04 | 2003-09-25 | Stephane Dupont | Robust parameters for noisy speech recognition |
US20030191636A1 (en) * | 2002-04-05 | 2003-10-09 | Guojun Zhou | Adapting to adverse acoustic environment in speech processing using playback training data |
US20040064314A1 (en) * | 2002-09-27 | 2004-04-01 | Aubert Nicolas De Saint | Methods and apparatus for speech end-point detection |
US20040093210A1 (en) * | 2002-09-18 | 2004-05-13 | Soichi Toyama | Apparatus and method for speech recognition |
US20040158465A1 (en) * | 1998-10-20 | 2004-08-12 | Cannon Kabushiki Kaisha | Speech processing apparatus and method |
US20040260546A1 (en) * | 2003-04-25 | 2004-12-23 | Hiroshi Seo | System and method for speech recognition |
US20050071159A1 (en) * | 2003-09-26 | 2005-03-31 | Robert Boman | Speech recognizer performance in car and home applications utilizing novel multiple microphone configurations |
US20060195317A1 (en) * | 2001-08-15 | 2006-08-31 | Martin Graciarena | Method and apparatus for recognizing speech in a noisy environment |
US20070050189A1 (en) * | 2005-08-31 | 2007-03-01 | Cruz-Zeno Edgardo M | Method and apparatus for comfort noise generation in speech communication systems |
US20070055508A1 (en) * | 2005-09-03 | 2007-03-08 | Gn Resound A/S | Method and apparatus for improved estimation of non-stationary noise for speech enhancement |
US7236930B2 (en) * | 2004-04-12 | 2007-06-26 | Texas Instruments Incorporated | Method to extend operating range of joint additive and convolutive compensating algorithms |
US20090076813A1 (en) * | 2007-09-19 | 2009-03-19 | Electronics And Telecommunications Research Institute | Method for speech recognition using uncertainty information for sub-bands in noise environment and apparatus thereof |
US20090187402A1 (en) * | 2004-06-04 | 2009-07-23 | Koninklijke Philips Electronics, N.V. | Performance Prediction For An Interactive Speech Recognition System |
US20090271188A1 (en) * | 2008-04-24 | 2009-10-29 | International Business Machines Corporation | Adjusting A Speech Engine For A Mobile Computing Device Based On Background Noise |
US20100204988A1 (en) * | 2008-09-29 | 2010-08-12 | Xu Haitian | Speech recognition method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005036525A1 (en) * | 2003-10-08 | 2005-04-21 | Philips Intellectual Property & Standards Gmbh | Adaptation of environment mismatch for speech recognition systems |
US8180635B2 (en) * | 2008-12-31 | 2012-05-15 | Texas Instruments Incorporated | Weighted sequential variance adaptation with prior knowledge for noise robust speech recognition |
-
2011
- 2011-10-17 US US13/274,694 patent/US8972256B2/en active Active
-
2015
- 2015-01-20 US US14/600,503 patent/US9741341B2/en active Active
Patent Citations (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5749068A (en) * | 1996-03-25 | 1998-05-05 | Mitsubishi Denki Kabushiki Kaisha | Speech recognition apparatus and method in noisy circumstances |
US5970446A (en) * | 1997-11-25 | 1999-10-19 | At&T Corp | Selective noise/channel/coding models and recognizers for automatic speech recognition |
US6188982B1 (en) * | 1997-12-01 | 2001-02-13 | Industrial Technology Research Institute | On-line background noise adaptation of parallel model combination HMM with discriminative learning using weighted HMM for noisy speech recognition |
US20040158465A1 (en) * | 1998-10-20 | 2004-08-12 | Cannon Kabushiki Kaisha | Speech processing apparatus and method |
US20020165712A1 (en) * | 2000-04-18 | 2002-11-07 | Younes Souilmi | Method and apparatus for feature domain joint channel and additive noise compensation |
US20030182114A1 (en) * | 2000-05-04 | 2003-09-25 | Stephane Dupont | Robust parameters for noisy speech recognition |
US20020087306A1 (en) * | 2000-12-29 | 2002-07-04 | Lee Victor Wai Leung | Computer-implemented noise normalization method and system |
US20060195317A1 (en) * | 2001-08-15 | 2006-08-31 | Martin Graciarena | Method and apparatus for recognizing speech in a noisy environment |
US20030115055A1 (en) * | 2001-12-12 | 2003-06-19 | Yifan Gong | Method of speech recognition resistant to convolutive distortion and additive distortion |
US20030191636A1 (en) * | 2002-04-05 | 2003-10-09 | Guojun Zhou | Adapting to adverse acoustic environment in speech processing using playback training data |
US20040093210A1 (en) * | 2002-09-18 | 2004-05-13 | Soichi Toyama | Apparatus and method for speech recognition |
US20040064314A1 (en) * | 2002-09-27 | 2004-04-01 | Aubert Nicolas De Saint | Methods and apparatus for speech end-point detection |
US20040260546A1 (en) * | 2003-04-25 | 2004-12-23 | Hiroshi Seo | System and method for speech recognition |
US20050071159A1 (en) * | 2003-09-26 | 2005-03-31 | Robert Boman | Speech recognizer performance in car and home applications utilizing novel multiple microphone configurations |
US7236930B2 (en) * | 2004-04-12 | 2007-06-26 | Texas Instruments Incorporated | Method to extend operating range of joint additive and convolutive compensating algorithms |
US20090187402A1 (en) * | 2004-06-04 | 2009-07-23 | Koninklijke Philips Electronics, N.V. | Performance Prediction For An Interactive Speech Recognition System |
US20070050189A1 (en) * | 2005-08-31 | 2007-03-01 | Cruz-Zeno Edgardo M | Method and apparatus for comfort noise generation in speech communication systems |
US20070055508A1 (en) * | 2005-09-03 | 2007-03-08 | Gn Resound A/S | Method and apparatus for improved estimation of non-stationary noise for speech enhancement |
US20090076813A1 (en) * | 2007-09-19 | 2009-03-19 | Electronics And Telecommunications Research Institute | Method for speech recognition using uncertainty information for sub-bands in noise environment and apparatus thereof |
US20090271188A1 (en) * | 2008-04-24 | 2009-10-29 | International Business Machines Corporation | Adjusting A Speech Engine For A Mobile Computing Device Based On Background Noise |
US20100204988A1 (en) * | 2008-09-29 | 2010-08-12 | Xu Haitian | Speech recognition method |
Non-Patent Citations (3)
Title |
---|
Kristjansson, et al. "Towards non-stationary model-based noise adaptation for large vocabulary speech recognition." Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on. vol. 1. IEEE, 2001. * |
Steven J. Rennie et al. "Dynamic noise adaptation." Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. vol. 1. IEEE, 2006. * |
Steven J. Rennie, "Graphical Models for Robust Speech Recognition in Adverse Environments", A PhD thesis submit to Department of Electrical and Computer Engineering University of Toronto, 2008. * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170076719A1 (en) * | 2015-09-10 | 2017-03-16 | Samsung Electronics Co., Ltd. | Apparatus and method for generating acoustic model, and apparatus and method for speech recognition |
US10127905B2 (en) * | 2015-09-10 | 2018-11-13 | Samsung Electronics Co., Ltd. | Apparatus and method for generating acoustic model for speech, and apparatus and method for speech recognition using acoustic model |
CN106683663A (en) * | 2015-11-06 | 2017-05-17 | 三星电子株式会社 | Neural network training apparatus and method, and speech recognition apparatus and method |
CN106683663B (en) * | 2015-11-06 | 2022-01-25 | 三星电子株式会社 | Neural network training apparatus and method, and speech recognition apparatus and method |
US11881211B2 (en) | 2020-03-24 | 2024-01-23 | Samsung Electronics Co., Ltd. | Electronic device and controlling method of electronic device for augmenting learning data for a recognition model |
Also Published As
Publication number | Publication date |
---|---|
US9741341B2 (en) | 2017-08-22 |
US20130096915A1 (en) | 2013-04-18 |
US20150199964A1 (en) | 2015-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9741341B2 (en) | System and method for dynamic noise adaptation for robust automatic speech recognition | |
Tu et al. | Speech enhancement based on teacher–student deep learning using improved speech presence probability for noise-robust speech recognition | |
US9406299B2 (en) | Differential acoustic model representation and linear transform-based adaptation for efficient user profile update techniques in automatic speech recognition | |
EP2216775B1 (en) | Speaker recognition | |
US8612224B2 (en) | Speech processing system and method | |
US7664643B2 (en) | System and method for speech separation and multi-talker speech recognition | |
US9280979B2 (en) | Online maximum-likelihood mean and variance normalization for speech recognition | |
US8386254B2 (en) | Multi-class constrained maximum likelihood linear regression | |
US10460729B1 (en) | Binary target acoustic trigger detecton | |
EP2189976A1 (en) | Method for adapting a codebook for speech recognition | |
Chowdhury et al. | Bayesian on-line spectral change point detection: a soft computing approach for on-line ASR | |
US9037463B2 (en) | Efficient exploitation of model complementariness by low confidence re-scoring in automatic speech recognition | |
US20070143112A1 (en) | Time asynchronous decoding for long-span trajectory model | |
US10460722B1 (en) | Acoustic trigger detection | |
Stouten et al. | Model-based feature enhancement with uncertainty decoding for noise robust ASR | |
US20040064315A1 (en) | Acoustic confidence driven front-end preprocessing for speech recognition in adverse environments | |
Soe Naing et al. | Discrete Wavelet Denoising into MFCC for Noise Suppressive in Automatic Speech Recognition System. | |
US9478216B2 (en) | Guest speaker robust adapted speech recognition | |
CN113327596A (en) | Training method of voice recognition model, voice recognition method and device | |
Yu et al. | Bayesian adaptive inference and adaptive training | |
Li et al. | Improved cepstra minimum-mean-square-error noise reduction algorithm for robust speech recognition | |
Raj | Real-time pre-processing for improved feature extraction of noisy speech | |
Delcroix et al. | Discriminative feature transforms using differenced maximum mutual information | |
BabaAli et al. | A model distance maximizing framework for speech recognizer-based speech enhancement | |
Shigli et al. | Automatic dialect and accent speech recognition of South Indian English |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NUANCE COMMUNICATIONS, INC., MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:RENNIE, STEVEN J.;DOGNIN, PIERRE;FOUSEK, PETR;SIGNING DATES FROM 20111003 TO 20111004;REEL/FRAME:027160/0029 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
AS | Assignment |
Owner name: CERENCE INC., MASSACHUSETTS Free format text: INTELLECTUAL PROPERTY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:050836/0191 Effective date: 20190930 |
|
AS | Assignment |
Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE NAME PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE INTELLECTUAL PROPERTY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:050871/0001 Effective date: 20190930 |
|
AS | Assignment |
Owner name: BARCLAYS BANK PLC, NEW YORK Free format text: SECURITY AGREEMENT;ASSIGNOR:CERENCE OPERATING COMPANY;REEL/FRAME:050953/0133 Effective date: 20191001 |
|
AS | Assignment |
Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:BARCLAYS BANK PLC;REEL/FRAME:052927/0335 Effective date: 20200612 |
|
AS | Assignment |
Owner name: WELLS FARGO BANK, N.A., NORTH CAROLINA Free format text: SECURITY AGREEMENT;ASSIGNOR:CERENCE OPERATING COMPANY;REEL/FRAME:052935/0584 Effective date: 20200612 |
|
AS | Assignment |
Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE REPLACE THE CONVEYANCE DOCUMENT WITH THE NEW ASSIGNMENT PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:059804/0186 Effective date: 20190930 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |