US10204626B2 - Method and apparatus for recognizing speech by lip reading - Google Patents

Method and apparatus for recognizing speech by lip reading Download PDF

Info

Publication number
US10204626B2
US10204626B2 US15/976,834 US201815976834A US10204626B2 US 10204626 B2 US10204626 B2 US 10204626B2 US 201815976834 A US201815976834 A US 201815976834A US 10204626 B2 US10204626 B2 US 10204626B2
Authority
US
United States
Prior art keywords
dictation
video
vehicle component
audio
control device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US15/976,834
Other versions
US20180261222A1 (en
Inventor
Yuichiro Takayanagi
Masashi Kusaka
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Intellectual Property Corp of America
Original Assignee
Panasonic Intellectual Property Corp of America
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Panasonic Intellectual Property Corp of America filed Critical Panasonic Intellectual Property Corp of America
Priority to US15/976,834 priority Critical patent/US10204626B2/en
Assigned to PANASONIC INTELLECTUAL PROPERTY CORPORATION OF AMERICA reassignment PANASONIC INTELLECTUAL PROPERTY CORPORATION OF AMERICA ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KUSAKA, MASASHI, TAKAYANAGI, YUICHIRO
Priority to US16/126,410 priority patent/US10424301B2/en
Publication of US20180261222A1 publication Critical patent/US20180261222A1/en
Application granted granted Critical
Publication of US10204626B2 publication Critical patent/US10204626B2/en
Priority to US16/541,110 priority patent/US10565992B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the technical field relates generally to a dictation device, server end device and system for performing speech recognition based upon audio and video signals.
  • Audio based speech recognition services such as, for example, Dragon Dictation, Siri, and SILVIA, can transcribe audio signals including voice data representing speech into text to be rendered on a display.
  • image based speech recognition services transcribe speech into words by, for example, recognizing lip motion.
  • a local binary pattern (LBP) of a series of images in video of lip motion is recognized as text by comparison to a database.
  • LBP local binary pattern
  • extracting the LBP from a video can consume a great amount of processing and memory resources.
  • Both types of speech recognition services will be referred to here as speech-to-text services.
  • the present disclosure concerns various embodiments of a dictation device, a server end device, a system including the dictation device and the server end device, and methods which can perform speech-to-text services at a lower error rate in a user friendly manner.
  • a method for generating text from speech includes: receiving an audio signal representing a voice utterance of one or more words; performing an audio based speech recognition algorithm to convert the voice utterance into first dictation; determining if each of the one or more words has a predetermined characteristic; and performing a video based speech recognition algorithm to convert a portion of a video signal representing lip motion associated with the voice utterance associated with the word determined to have the predetermined characteristic into second dictation.
  • the predetermined characteristic can be the word in the first dictation has less than a predetermined number of syllables or the word is less than a predetermined length or time duration.
  • Determining if each of the words has the predetermined characteristic can further include determining for each of a plurality of prototype candidate words a probability that the respective prototype candidate word can be generated by a feature signal associated with the portion of the audio signal corresponding to the word.
  • the predetermined characteristic can be that none of the plurality of prototype candidate words have a determined probability higher than a predetermined standard.
  • the performing a video based speech recognition algorithm to convert the portion of the video signal into second dictation can further include receiving a portion of the video signal that is synchronized with the portion of the audio signal including the word determined to have the predetermined characteristic.
  • the determining if each of the one or more words has a predetermined characteristic can include comparing each of the words in the first dictation to the corresponding word in the second dictation.
  • the predetermined characteristic can be that the word in the first dictation is different from the corresponding word in the second dictation.
  • One embodiment of a system for generating text from speech includes a dictation device such as a smartphone and a server and/or server end devices. All of the server and server end devices will be referred to here as a server end device for simplicity.
  • the smartphone and server end devices can execute a software application stored locally or on a remote server.
  • the dictation device includes: an audio input device configured to receive a voice utterance including a plurality of words; a video input device configured to receive video of lip motion during the voice utterance; a memory portion; a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and a transceiver for sending the first data packets to the server end device (a remote apparatus) and receiving second data packets including combined dictation based upon the audio stream and the video stream from the server end device.
  • first dictation generated based upon the audio stream has been corrected by second dictation generated based upon the video stream.
  • the controller can be further be configured to render the combined dictation as text on a display.
  • the controller can further configured to generate synchronization data to synchronize the audio stream and the video stream to be sent by the transceiver to the server end device.
  • the transceiver can be further configured to send the second data packets to a downstream application such as an Internet website or other hardware such as a television or home appliance.
  • the dictation device can be, for example, a smartphone.
  • the server end device includes: a transceiver configured to send and receive data packets to a remote mobile station (the dictation device) via a connection to a network; an interface for adding formatting information to data to generate the data packets to be sent and removing formatting information from received data packets, the received data packets including an audio stream and a video stream associated with a voice utterance of a plurality of words; a controller coupled to the interface; an audio based speech recognition module coupled to the controller and configured to generate first dictation based upon the audio stream received from the remote mobile station; a video based speech recognition module coupled to the controller and configured to generate second dictation based upon the video stream received from the remote mobile station; and a memory including instructions for configuring the controller to generate a combined dictation based upon a comparison between the first dictation and the second dictation.
  • the controller can be configured to determine if at least one of the words in the first dictation has a predetermined characteristic.
  • the combined dictation can be generated based upon the second dictation for the at least one of the words having the predetermined characteristic and based upon the first dictation for the other of the plurality of words.
  • the predetermined characteristic can be that at least one word is not similar to the corresponding at least one word in the second dictation, that the at least one word has less than a predetermined number of syllables or that the at least one word is determined to be less than a predetermined length or time duration.
  • the audio based speech recognition module can generate the first dictation by: extracting a feature signal associated with the audio stream; for each of a plurality of candidate prototype words, determining the probability that the respective candidate prototype word generates the feature signal; and choosing the candidate prototype word having highest probability among the plurality of candidate prototype words.
  • the video based speech recognition module can generate the second dictation by generating a feature signal parameter sequence such as a LBP from the sequence of image frames in the video stream; matching the feature signal vector to a stored feature signal parameter sequence; determining the probability of each of a plurality of candidate prototype words generating the stored feature vector; and selecting the candidate prototype word of the highest probability as the text in the second dictation.
  • a feature signal parameter sequence such as a LBP from the sequence of image frames in the video stream
  • matching the feature signal vector to a stored feature signal parameter sequence
  • determining the probability of each of a plurality of candidate prototype words generating the stored feature vector and selecting the candidate prototype word of the highest probability as the text in the second dictation.
  • the audio based speech recognition module and video based speech recognition can be two separate processors included in two separate computing devices configured according to instructions in memory.
  • the modules can be a processor in a single computing device configured according to instructions in memory. In both cases, the computing device(s) will be connected to the controller.
  • the controller can be configured by instruction in memory to perform audio based and video based speech recognition.
  • the controller can be further configured to obtain synchronization data indicating a portion of the video signal that is synchronized with the portion of the audio signal including the words determined to have the predetermined characteristic.
  • the audio stream and video stream in the received data packets can be in the transport or program stream format associated with the various MPEG format.
  • the transport or program stream includes synchronization data for synchronizing the audio with video for playback.
  • a dictation device includes: an audio input device that receives an audio signal representing user voice; a video input device that receives a video signal representative of lip movement associated with the user voice; and a controller configured according to instructions stored in a memory.
  • the controller is configured to: generate a first dictation based on the audio signal and assign a first conversion value based on a first set of conversion criteria; generate a second dictation based on the video data and assign a second conversion value based on a second set of conversion criteria; and generate a variable text conversion value based on the first conversion value and the second conversion value and generate a third dictation based on the variable text conversion value.
  • the controller generates the variable text conversion value based on said first conversion value when a number of syllables in a word is greater than N and based on said second conversion value when the number of syllable is N or less.
  • the controller generates the variable text conversion value based on said second conversion value during a period when a length of syllable of a word is less than M.
  • the audio input device is disabled when a signal to noise ratio between the audio signal and a background noise is below a predetermined threshold.
  • the video input device is disabled when a signal to brightness ratio is below a predetermined threshold.
  • the controller if the volume of the audio signal is lower than a predetermined value, the controller generates the variable text conversion value without the first conversion value.
  • the audio input device when video input device detects no lip movement, the audio input device is disabled.
  • said controller generates a text conversion value based on said first conversion value, said second conversion value and global positioning system (GPS) data.
  • the first set of conversion criteria can include pre-registered data representing a value associated with the user voice.
  • the second set of conversion criteria can include pre-registered data representing a value associated with the user voice.
  • variable text conversion value is generated based on a predetermined criteria which includes pre-registered data representing a value of the user voice.
  • the controller is configured to prioritize either the first conversion value or the second conversion value by a predetermined setting.
  • the predetermined setting can be assigned by a user.
  • FIGS. 1A-1B are diagrams illustrating exemplary operations of a dictation device.
  • FIG. 1C is a schematic diagram illustrating first and second dictations generated by the dictation device.
  • FIG. 2 is a schematic diagram illustrating exemplary simplified and representative operating environment of as system in which a mobile station communicates with a server via the Internet.
  • FIG. 3 is a block diagram of exemplary portions of a dictation device according to exemplary embodiments.
  • FIG. 4 is a block diagram of exemplary portions of a server end device according to exemplary embodiments.
  • FIG. 5A is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
  • FIG. 5B is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
  • FIG. 6 is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
  • FIG. 7 is a flow diagram illustrating an exemplary process for performing speech-to-text based upon images representing lip motion.
  • FIG. 8A is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
  • FIG. 8B is a schematic diagram illustrating first and second dictations generated by the dictation device.
  • FIG. 9 is a schematic illustration of a voice commanded vehicle system incorporating the dictation device and system.
  • the present disclosure concerns a system including a dictation device for performing speech-to-text services and a server end device.
  • the dictation device may be a mobile station such as, for example, a smart phone, a portable computer or an in-vehicle user apparatus.
  • the server end device can be a server or one or more apparatus connected to the server or a combination of both communicating with the mobile station. Either combination of the server and one or more apparatus connected to the server will be referred to here as a server end device for simplicity.
  • the present disclosure also concerns processors for the mobile station and server end device, memories associated with the processors, and computer readable instructions stored in the memories for configuring the processors. More particularly, various inventive concepts and principles are embodied in systems, apparatuses, and methods for configuring a mobile station to be able to provide more accurate and user friendly speech-to-text recognition services.
  • the computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which suitably execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in block diagrams or flowchart steps discussed below.
  • the computer program instructions may also be stored in a computer usable or computer-readable non-transitory memory that can suitably direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the memory produce an article of manufacture, preferably including instruction means that suitably implement the function specified in the flowchart block or blocks.
  • the portions may be modules implemented by a software or hardware component, such as a field programmable gate-array (FPGA) or application-specific integrated circuit (ASIC), which performs certain tasks.
  • a module may preferably be configured to reside on an addressable storage medium and preferably be configured to execute on one or more processors.
  • an exemplary module may preferably include, but may not be limited only to, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
  • the functionality provided for in the components and modules may be suitably combined into fewer components and modules or further separated into additional components and modules.
  • the dictation device 11 may be implemented within a mobile station such as a smartphone.
  • a problem associated with an audio speech-to-text service will be discussed.
  • a user 10 of the dictation device 11 inputs a voice utterance into an audio input device such as, for example, a microphone associated with the dictation device 11 .
  • the spoken voice utterance is “Please state Pi up to 100 digits.”
  • the voice utterance is converted into an audio stream 12 that includes a number of portions (N).
  • Each of the N portions may be, for example, an audio frame of predetermined time duration (t) such as, for example, 45 ms.
  • each audio frame may represent a certain predetermined number of syllables, words or phoneme.
  • An audio based speech recognition service is used to transcribe the speech into text, which is rendered on the display 14 A associated with the dictation device 11 .
  • the audio based speech recognition service fails to accurately transcribe the words “Pi” and “one hundred”.
  • the dictation device 11 when a user 10 of the dictation device 11 speaks into an audio input device such as, for example, a microphone associated with the mobile station, not only is the input speech converted into the audio stream 12 , but video including a series of images of lip motion of the user 10 is also recorded and converted into a video stream 16 that includes N portions.
  • the N portions may be image frames. Also, the number N of image frames can be different from the number N of audio frames.
  • the audio stream 12 and the video stream 16 are converted into first and second dictations textually representing the plurality of words in the utterance.
  • the dictation device 11 can recognize that certain words such as “Pi” and “one hundred” are not likely to be transcribed accurately by the audio based speech recognition service in accordance with a predetermined criteria or characteristic. However, the second dictation obtained from the video stream by video based speech recognition (lip reading) can be used to correct such words. Thus, the words “Pi” and “one hundred” are accurately transcribed into combined dictation and rendered in the display 14 B. The transcribed text can then be input into another downstream application for, for example, displaying Pi to 100 digits as requested.
  • the predetermined criteria can be that a comparison 19 is performed for each of the word (such as word 123 ) in the first dictation to determine if it is different from the corresponding word (such as word 163 ) in the second dictation.
  • the dictation device 11 can be implemented by a mobile station such as a smartphone which includes a video camera and/or microphone for recording the user's voice utterances and lip movement.
  • the mobile station can also include one or more processors configured to perform the audio and video based speech recognition.
  • the system can be implemented by a plurality of distributed devices such as a mobile station 202 , a server 206 and server end modules 208 , 210 .
  • This system will be referred to as a distributed device format.
  • the mobile station 202 generates audio and video streams of the voice utterances and lip movement of the user and transmits the streams to the server 206 via a network such as, for example, the Internet 204 .
  • the server 206 utilizes the server end module 208 depicted as “Audio Speech recognition” to perform audio based speech-to-text recognition and the server end module 210 depicted as “Video Speech recognition” to perform video based speech recognition.
  • each of the server end modules 208 , 210 are shown as separate modules and connected to the server 206 , it should be noted that the server end modules 208 , 210 may be a portion of the server 206 . Further, one server end module may perform both audio and speech based recognition. Also, the server end module(s) may be connected to the server 206 via the Internet 204 and/or via another server not shown. Moreover, the server end module(s) may utilize online databases of common feature patterns in order to recognize certain words.
  • the mobile station 300 can include a transceiver 302 , an interface 304 , a controller 306 , a memory 308 , an audio input device 314 , a video input device 316 , a digital and audio video processor 318 , a display 320 , and a common bus 312 .
  • the transceiver 302 provides the physical layer functions such as modulating packet bits into electromagnetic waves to be transmitted and demodulating received waves into packet bits to be processed by higher layers.
  • the transceiver 302 can include radio technology circuitry such as, for example, ZigBee, Bluetooth and WiFi.
  • the transceiver 302 may also include Ethernet and a USB connection.
  • the transceiver 302 can include an antenna portion capable of receiving and transmitting the electromagnetic waves from and to, for example, an access point of the mobile station.
  • the antenna portion can also be separate from the transceiver 302 .
  • the antenna portion can include, for example, an RF antenna, a coil antenna and/or a capacitive plate antenna.
  • the interface 304 can provide the data link layer and network layer functions of the mobile station 300 such as formatting the packet bits to an appropriate format for transmission by the transceiver 302 or received packet bits into an appropriate format for processing by the controller 306 .
  • the interface 304 can be configured in accordance with the 802.11 media access control (MAC) protocol and the TCP/IP protocol.
  • MAC media access control
  • TCP/IP Transmission Control Protocol
  • packet bits are encapsulated into frames for transmission and the encapsulation is removed from received frames.
  • error control is introduced and addressing to ensure end to end delivery.
  • both the interface 304 and the transceiver 302 may be implemented by a network interface consisting of a few integrated circuits.
  • the memory 308 can be one or a combination of a variety of types of memory or computer readable medium such as random access memory (RAM), read only memory (ROM), flash memory, dynamic RAM (DRAM), hard disk drive (HDD) or any type of non-transitory memory.
  • RAM random access memory
  • ROM read only memory
  • DRAM dynamic RAM
  • HDD hard disk drive
  • the memory 308 generally includes instructions for configuring the controller 306 as well as a basic operating system, executable code, and data and variables.
  • the bus 312 is a common bus for providing communications between the portions of the mobile station 300 with the controller 306 .
  • the display 320 can be a conventional liquid-crystal display unit for displaying transcribed text.
  • the audio input device 314 can include a microphone internal to the mobile station 300 and/or a connection for an external microphone and a processor for processing the sound signal.
  • the video input device 316 can include a video camera or be coupled to a video camera external to the mobile station 300 to generate digital video.
  • the video input device 316 can include a CCD image sensor that generates image information by capturing a subject image formed through a lens tube. Light from the subject that becomes an object to be captured passes through the lens tube and then forms an image on the light reception surface of the CCD image sensor. The formed subject image can be converted into R, G, or B color information. As a result of the conversion, image information that indicates an entire subject image is generated.
  • the CCD image sensor can generate image information of a new frame at each constant time.
  • the video input device 316 can include an 8 megapixel iSight camera having 1.5 micron pixels, 1.2 megapixel photos and capability to capture 1080p HD at 60 fps.
  • the digital audio/video processor 318 can perform conversion between analog and digital and other necessary processing before storing the digital audio and video data in the memory 308 or an external storage medium or memory as an audio stream and video stream.
  • the controller 306 is the main processing unit for sending control signals and exchanging data with the other portions of the mobile station 300 via, for example, the bus 312 .
  • the controller 306 can be a general purpose CPU or an application specific integrated circuit (ASIC).
  • the memory 308 and controller 306 can be part of the ASIC.
  • the controller 306 , memory 308 , bus 312 , interface 304 and digital audio video processor 318 will likely be integrated onto a single chip set, referred to as a system on chip (SoC), including a quad-core or dual-core CPU.
  • SoC system on chip
  • controller 306 can be alternatively configured to perform the A/D conversion, etc. so that the digital audio/video processor 318 can be eliminated.
  • the mobile station 300 can only perform some of the necessary steps of the speech-to-text services.
  • the mobile station 202 generates the voice and video streams while the speech-to-text recognition is performed by the server and/or server end modules.
  • a server 206 is connected to or includes one or more modules 208 , 210 for performing audio and video based speech recognition.
  • the mobile station 202 can exchange data with the server 206 via a connection to the Internet 204 .
  • the mobile station 202 can send first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion to the server 206 .
  • the mobile station 202 can receive second data packets including combined dictation based upon the audio stream and the video stream from the server 206 .
  • the mobile station 300 can perform all of the speech-to-text services.
  • the controller 306 can be configured by the instructions in the memory 308 to perform audio based speech recognition and/or video based speech recognition in accordance with the discussion of the audio based recognition module and the lip reading based recognition module below with reference to FIG. 4 and to perform the routines discussed below with reference to the flow diagrams of FIGS. 5A-7 .
  • the mobile station 300 can include the capability to perform all of the speech-to-text services, but can choose to use more powerful resources available from the server 206 depending on the quality of the connection to the network.
  • server end device 400 the one or more modules for performing audio speech and video speech recognition and the server are depicted by server end device 400 .
  • the server end device 400 can include a transceiver 402 , an interface 404 , a controller 406 , a memory 408 , an audio (speech-to-text) recognition module 410 , a video (lip reading) recognition module 412 , and a common bus 414 .
  • the transceiver 402 of the server end device 400 provides the physical layer functions and the interface 404 provides the data link layer and network layer functions.
  • the interface 402 and the transceiver 404 may be implemented by a network interface consisting of a few integrated circuits.
  • the memory 408 can be one or a combination of a variety of types of memory or computer readable medium such as RAM, ROM, DRAM, HDD or any type of non-transitory memory.
  • the memory 408 includes instructions for configuring the controller 406 as well as a basic operating system, executable code, and data and variables.
  • the memory 408 may include: (1) predetermined feature signals (e.g, sets of acoustic feature vector signals for audio streams and LBP features for video streams); (2) reference patterns representative of identified words or phrases of a vocabulary set; (3) one or more recognition criteria for comparing feature signals of voice stream and video stream portions to the reference patterns; and (4) pre-registered data representing a value associated with the user voice.
  • predetermined feature signals e.g, sets of acoustic feature vector signals for audio streams and LBP features for video streams
  • reference patterns representative of identified words or phrases of a vocabulary set
  • one or more recognition criteria for comparing feature signals of voice stream and video stream portions to the reference patterns
  • pre-registered data representing a value associated with the user voice For example, one recognition criteria may be a hidden Markov model for each reference pattern.
  • the controller 406 is the main processing unit for sending control signals and exchanging data with the other portions via, for example, the bus 414 .
  • the controller 406 can receive first and second dictations from the modules 410 , 412 via bus 414 .
  • the modules 410 , 412 can receive the audio and video streams in the data packets from the interface 404 via bus 414 .
  • the controller 406 can be configured by the instructions in the memory 408 to perform audio based speech recognition and/or video based speech recognition in accordance with a routine discussed below with reference to FIGS. 5A-5B, 6 and 7 .
  • the bus 414 is a common bus for providing communications between the portions of the server end device 400 with the controller 406 .
  • the audio based recognition module 410 and the video based recognition module 412 can include one or more processors configured according to instructions stored in, for example, the memory 408 for performing an audio based speech recognition algorithm to convert the audio stream into first dictation a video based speech recognition algorithm and to convert the video stream into second dictation.
  • both modules 410 , 412 can alternatively be implemented by the controller 406 .
  • both modules 410 and 412 can perform dictation by generating a feature signal from the audio and video stream, matching the feature signal to predetermined feature signal vectors, determining a probability of the predetermined feature signal vector being generated by a reference pattern model for a word or phrase and selecting the word associated with a highest probability.
  • the reference pattern models and predetermined feature signals can be stored in the memory 408 .
  • the reference pattern models and/or predetermined feature signals may have been generated based upon a user speaking a training sequence.
  • the modules 410 , 412 can alternatively be configured to refer to databases on other devices via a connection to, for example, the Internet in order to achieve more robust speech recognition.
  • the algorithm can be, for example, a Hidden Markov Model Speech Recognition Arrangement algorithm as discussed in, for example, U.S. Pat. No. 4,587,670 to Levinson et al, the contents of which are incorporated by reference.
  • the module 410 can be configured to group the audio stream digitally representing the voice input into successive overlapping frames of predetermined time duration such as, for example, 45 ms.
  • the module 410 generates a time frame sequence of acoustic feature signals such as linear prediction coefficient (LPC) vectors from each frame.
  • LPC linear prediction coefficient
  • the module 410 can also be configured to determine the endpoint of the input speech pattern based upon energy of the feature signal sequence.
  • the module 410 compares the generated feature signals to feature signals stored in memory 408 (or in another device such as an online database) and selects the closest corresponding stored feature signal feature for each speech pattern feature signal as a vector quantized feature signal for every successive frame of the speech pattern.
  • the closest corresponding stored feature signal can be determined based upon the Itakura distance metric signal.
  • Each of the quantized feature signals is one of a finite set of M symbols that may be input to hidden Markov model processing.
  • the module 410 can then determine a probability that a sequence was generated by the reference pattern model.
  • the probability that the sequence O was generated by a reference pattern model K can be determined in accordance with following Formula (1): P ( O
  • M ) P i1 b i1 ( O 1 ) a i1i2 b i2 ( O 2 ) . . . a iT-1 b iT ( O T ) (1)
  • i 1 , i 2 , . . . i T is the maximum likelihood sequence of Markov model states and O 1 , O 2 , . . . , O T is the observed sequence.
  • the input speech pattern may be identified as the reference pattern model corresponding to the highest probability signal.
  • the video based recognition module 412 can be configured to group the digital video stream representing the lip motion into a sequence of successive overlapping image frames of 45 ms duration. In each image frame, the mouth region is preferably cropped off from the face. Similarly to the text to speech recognition process, feature signals are generated by being extracted from the sequence of image frames.
  • a feature signal can be generated from LBP texture descriptors extracted from the sequence of image frames when analyzed in the XYT space.
  • XY stands for the image plane
  • T temporal positions of individual frames.
  • the movement of a talking mouth is considered as a continuous process
  • the speech video is viewed as a set of images sampled at a fixed rate along a curve representing the voice utterance in the space of the visual features extracted from the images.
  • the module 412 can then identify a reference pattern similar to the process for voice recognition. Particularly, the module 412 can compare the generated feature signals to feature signals stored in memory 408 (or in another device such as an online database) and selects the closest corresponding stored feature signal feature for each lip motion feature signal as a vector for every successive frame. The module 412 can then determine a probability that a sequence was generated by the reference pattern model.
  • the input speech pattern may be identified as the reference pattern model corresponding to the highest probability signal.
  • the probability can be determined by a dynamic Bayesian network (DBN) model or a hidden Markov model.
  • DBN dynamic Bayesian network
  • the unknown speech pattern is identified as the closest matching reference pattern in accordance with predetermined recognition criteria.
  • the accuracy of such recognition systems is highly dependent on the selected features and the recognition criteria.
  • the server end device 400 and the mobile station 300 can communicate with each other by exchanging data within, for example, media access control (MAC) frame messages in accordance with the message format defined by IEE Std. 802.11.
  • Each frame includes a MAC header, a frame body and a frame check sequence (FCS).
  • the header can include frame control, duration, address, optional sequence control, information, optional quality of service control information, and optional control fields.
  • the frame body can be a variable-length frame body and contains information specific to the frame type and subtype.
  • the FCS can include an IEEE 32-bit CRC.
  • the frame body of the messages from the mobile station 300 can include the video and audio streams.
  • the frame body of the messages from the server end device 400 can include the transcribed text.
  • the header can indicate if the message is a control, data or management frame.
  • the mobile station 300 can send the audio and video stream to the server end device 400 in a predetermined data stream format such as a Moving Picture Experts Group (MPEG) data stream (program stream, transport stream, etc.).
  • MPEG Moving Picture Experts Group
  • the format is sufficient for recognition of synchronization of time of the video stream and audio stream.
  • the controller recognize which time in the video stream is the same as the time in the audio stream for which the words “Pi” and “one hundred” were stated in a most efficient manner.
  • An advantage of the MPEG format is that the time relationship between the audio and video streams are maintained. However, other data formats are sufficient.
  • the controller 406 can be configured to generate a data set including combined dictation from the first dictation and the second dictation.
  • the interface 404 can add formatting information to the data set to generate the data packets to be sent to the mobile station by the transceiver 402 via a connection to a network such as the Internet.
  • both audio and video-based recognition is used, and a combined dictation is generated based upon a comparison between each of the words of the first and second dictations.
  • the audio input device 314 receives an audio signal representing a voice utterance of one or more words.
  • the video input device 316 receives video of lip motion during the voice utterance. Although described as separate steps, 502 and 504 may and likely occur simultaneously.
  • the controller 306 and interface 304 generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion. As mentioned above, the data packets may include an MPEG stream.
  • the transceiver 302 sends the first data packets to the server end device 400 .
  • the audio based recognition module 410 performs an audio based speech recognition algorithm to convert the voice utterance in the audio stream into first dictation.
  • the video based speech recognition module performs a video based speech recognition algorithm to convert the video stream into second dictation.
  • the controller 406 and interface 404 can extract the audio and video streams from the first data packets and pass them to modules 410 , 412 .
  • Each of the first and second dictations is a textual representation of the words in the voice utterance.
  • the controller 406 determines if each of the words in the first dictation has a predetermined characteristic. Particularly, the controller 406 compares the word of the first dictation with the corresponding word of the second dictation to determine if they are different. At 512 , if a word of the first dictation is different from the corresponding word of the second dictation, the word of the second dictation is chosen for the combined dictation. For example, as shown in FIG. 1C , word 121 , word 122 , word 124 , word 125 and word 126 of first dictation 12 are same as corresponding word 161 , word 162 , word 164 , word 165 and word 166 of second dictation 16 . Word 123 “Fi” is different from word 163 “Pi”. Thus, in the combined dictation, the word 123 in the first dictation is replaced by word 163 , or only second dictation 16 is used.
  • the controller 406 generates combined dictation, in which each word in the first dictation having the first characteristic has been corrected by corresponding word in the second dictation.
  • the controller 406 and interface 404 can generate second data packets including the combined dictation to be sent by the transceiver 402 to the mobile station 300 .
  • the controller 306 can render the combined dictation as text on the display 320 and/or send the second data packets as input to a downstream application such as an Internet website or a control command to other devices such as a television or other home appliance.
  • the audio input device 314 receives input speech (voice utterance). If the input speech is analog data, the audio input device 314 and processor 318 can convert the analog voice data into digital data and (together with controller 306 ) generate an audio stream including the input speech.
  • the controller 306 and interface 304 generate first data packets including an audio stream representative of the voice utterance.
  • the transceiver 302 sends the first data packets to the server end device 400 .
  • the audio based recognition module 410 performs the audio based speech recognition algorithm to convert the voice utterance in the audio stream into first dictation.
  • the controller 406 analyzes each word of the first dictation to determine if it has a predetermined characteristic. For example, here it can be determined if the word is shorter than a predetermined time or includes less than a predetermined number of syllables. Such shorter words may have a higher error rate when used by voice based recognition services to generate speech. The end of a word or the number of syllables can be determined by extracting a feature signal of each word.
  • the word is determined to not have the predetermined characteristic (NO at 522 )
  • only the first dictation is use.
  • the word is determined to have the predetermined characteristic (YES at 522 )
  • the portion of the video stream corresponding to this portion of the audio stream is input to the video based recognition module 412 .
  • the previous Y time units should be greater than or equal to the X time units of the corresponding audio stream. If the audio and video streams are MPEG stream, the synchronization data for synchronizing the audio and video streams can be used.
  • the Y time units of the video stream can be obtained by, for example, sending a request message requesting the specific units to the mobile station 300 and receiving further data packets including the Y time units in a reply message.
  • the video stream can be included in the first data packets and stored in the memory 408 or other memory.
  • the video based recognition module 412 performs the video based recognition algorithm (lip reading) on the corresponding video data to generate second dictation.
  • the second dictation for the entire stream is used in place of the first dictation, or the second dictation generated from the video stream is merged with that of the audio stream to generate combined dictation.
  • the high probability candidate words generated by the voice-based approach can be combined with the word generated by the video-based approach.
  • a text for an entire sentence can be generated by the video-based approach.
  • the first dictation was generated at 521 prior analyzing each word to determine if it has a predetermined characteristic.
  • the word can be analyzed prior to generating the first dictation by, for example, performing an energy analysis of the audio frame of the stream.
  • a criteria for using audio or video-based recognition is based upon probability model calculations such as, for example, the Hidden Markov Model (HMM) Speech Recognition discussed above.
  • HMM Hidden Markov Model
  • the audio input device 314 receives input speech (voice utterance).
  • the audio input device 314 (or the controller 306 operating a program stored in memory 308 ) converts the analog voice data into digital data and forms an audio stream including the input speech.
  • the audio based recognition module 410 extracts the input speech pattern from the audio stream and generates a sequence of acoustic feature signal parameters based upon analysis of the input speech pattern.
  • each acoustic feature can be a LPC coefficient features signal vector.
  • the audio based recognition module 410 compares the generated feature signal vectors to previously stored LPC feature signals to generate a sequence of vector quantized LPC signals representative of the input speech pattern.
  • the probability of the quantized signal having been generated by a reference pattern model of a vocabulary set is determined. After the probability signal for each reference pattern model has been determined, the input speech pattern may be identified as the candidate reference pattern model corresponding to the highest probability signal. If a candidate reference pattern model with suitable probability is determined (YES at 610 ), at 612 the text for the portion of the audio stream is generated based upon the voice data (first dictation) from the candidate. That is, the word associated with the candidate reference pattern module is chosen as the word in the first dictation.
  • the audio stream In the distributed device format in which the voice speech-to-text recognition module 414 is implemented at the server end device 400 , the audio stream must first be communicated to the server end device 400 .
  • the interface 304 can generate a message including the audio stream such as a MAC frame message as discussed above to be sent by the transceiver 302 .
  • the transceiver 402 of the server end device 400 receives the message including the audio stream.
  • the interface 404 (or the processor 406 ) can extract the audio stream from the message and the audio based recognition module 410 can determine the candidate reference pattern model with suitable probability.
  • the digital voice stream In the single device format in which the voice speech-to-text recognition module is implemented in the controller 306 of the mobile station 300 , the digital voice stream can be sent to the controller 306 via the bus 312 .
  • no candidate reference pattern model with suitable probability can be determined (NO at 610 ).
  • the prototype signal with the highest probability is chosen.
  • the portion of the video stream corresponding to this portion of the audio stream is input to the video based recognition module 412 .
  • the previous Y time units should be greater than or equal to the X time units of the corresponding audio stream.
  • the audio and video streams are MPEG stream, the synchronization data for synchronizing the audio and video streams can be used.
  • the video based speech recognition module is used to perform speech-to-text based upon the Y time units of the video of the speaker (lip reading).
  • either the second dictation for the entire portion of the video corresponding to the audio stream is generated based upon the video stream, or the second dictation generated from the video stream is merged with that of the audio stream (combined dictation).
  • the high probability candidate words generated by the voice-based approach can be combined with the word generated by the video-based approach.
  • a text for an entire sentence can be generated by the video-based approach.
  • the characteristic word was determined after audio-based recognition being performed.
  • the characteristic word could be determined based upon the prototype candidate signal. If the prototype candidate signal corresponded to a word having less than a predetermined number of syllables or a predetermined length or time duration, the flow diagram could proceed to 524 . Further, the decision to proceed to 524 could be decided based upon a combination of if the word is a characteristic word and the probability of the prototype signal (step 610 ). Particularly, the standard for probability could be lowered for non-characteristic words and can be higher for characteristic words. Further, FIGS. 5B and 6 show the video based processing only being performed in response to the decisions at 522 and 610 .
  • FIGS. 5B and 6 have the advantage of limiting unnecessary processing consumption associated with the video based recognition.
  • the video stream is converted into a sequence of N image frames, each having a time duration T.
  • the time durations T may be equal, or may be set for each word of the utterance.
  • the mouth region can be cropped in each frame.
  • a feature signal parameter sequence is generated from the sequence of image frames. For example, a local binary pattern (LBP) operator can be used to generate a feature signal vector.
  • LBP local binary pattern
  • the feature signal vector is matched to a stored feature signal parameter sequence.
  • the probability of each of a plurality of candidate prototype words generating the stored feature vector is determined.
  • the candidate prototype word of the highest probability is selected, and the text (second dictation) for the candidate prototype word is generated.
  • the video based recognition module 412 can be configured according to instructions in memory to perform the video based speech recognition algorithm.
  • the audio input device 314 receives an audio signal representing the user voice and the video input device 316 receives a video signal representative of lip movement associated with the user voice.
  • the audio and video signals can be analog or digital signals.
  • the audio signals are converted into audio streams including a sequence of frames representing the voice utterance.
  • the video signals are converted into video streams including a sequence of image frames representing lip motion associated with the voice utterance as shown in FIG. 1B .
  • the audio based recognition module 410 generates a first dictation based on the audio signal.
  • the controller 406 assigns a first conversion value based on a first set of conversion criteria. Referring to FIG. 8B , the controller 406 assigned first conversion values 820 , 822 , 824 , 826 , 828 , 830 based upon whether the words are long (L) or short (S) as the conversion criteria.
  • the video based recognition module 412 generates a second dictation based on the video data.
  • the controller 406 assigns a second conversion value based on a second set of conversion criteria. Referring to FIG. 8B , the controller 406 assigned second conversion values 840 , 842 , 844 , 846 , 848 , 850 based upon whether the words are different from corresponding words of the first dictation (Y or N).
  • the first and second dictations can be generated based upon known subject dependent (SD) and subject independent (SI) voice based and lip motion based speech to text algorithm such as discussed above in FIG. 6 (steps 602 - 612 ) and FIG. 7 .
  • SD subject dependent
  • SI subject independent
  • the controller 406 generates a variable text conversion value based on the first conversion value and the second conversion value.
  • the variable text conversion value can be one or more rules for determining whether to use a word or words of the first dictation or the second dictation.
  • the variable text conversion value is to use the second dictation when a length of a word of the first dictation is greater than a predetermined time duration.
  • Example rules include: (1) use the second dictation when a number of syllables in a word of the first dictation is not greater than N; (2) use the second dictation when a signal to noise ratio between the audio signal and background noise is not below a predetermined threshold; (3) use the first dictation when a signal to brightness ratio is above a predetermined threshold; (4) use the second dictation when the volume of the audio signal is lower than a predetermined value; (5) use the first dictation when video input device detects no lip movement; and (6) use the first dictation for certain Global Positioning Satellite (GPS) data.
  • GPS Global Positioning Satellite
  • the signal to noise data and signal to brightness data can be obtained by the audio input device 314 and the video input device 316 together with the controller 306 . Particularly, this data can be included in the first data packets sent by the transceiver 302 to the server end device 400 .
  • the controller 406 , audio based recognition module 410 and video based recognition module 412 at the server end device 400 can be configured to determine the signal to noise value of the audio stream and the signal to brightness value of the video stream.
  • the GPS data can be determined by the mobile station 300 and included in the first data packets sent by the transceiver 302 .
  • the controller 306 of the mobile station 300 can utilize the transceiver 302 and/or an associated base station (not shown) for obtaining the GPS data.
  • the controller 406 generates a third dictation based on the variable text conversion value.
  • the third dictation can include only the words of the second dictation, or the words of the first dictation except for word 123 which is replaced with word 163 .
  • the first and second conversion values can be, for example, an evaluation value that increases along a probability that a produced word is a corresponding keyword as the conversion criteria.
  • the evaluation value for example, confidence measure used at the time of determining whether a voice included in the voice data and lip motion associated with the voice data and obtained by publicly known voice and video recognition processes, coincide with a corresponding keyword may be adopted.
  • the confidence measure may be calculated based on a posterior probability in a hidden Markov model (HMM) or a likelihood (an acoustic likelihood or a linguistic likelihood) obtained by voice and video recognition processing.
  • HMM hidden Markov model
  • the dictation device previously stores an acoustic model, an LBP model, a dictionary, and a language model, which relate to the above mentioned keywords, and performs the voice and video recognition processes with reference to this information, thereby calculating the above-mentioned conversion values.
  • the conversion values may be obtained from, for example, similarity of distances calculated between the pieces of voice data of the preliminarily registered keywords and the input voice data may be adopted.
  • the similarity may be calculated by calculating distances between pieces of voice data using, for example, a publicly known distance calculation method that utilizes a predetermined sound feature quantity such as a spectrum and by performing dynamic programming (DP) matching in a voice interval.
  • DP dynamic programming
  • the dictation device preliminarily stores therein the pieces of voice data or the sound feature quantities of the above-mentioned keywords.
  • the first and second sets of conversion criteria can include pre-registered data representing a value associated with the user voice or lip motion such as in a SD recognition system.
  • a SD recognition system As discussed above, in the SD voice based system, the reference pattern models and/or predetermined feature signals may have been generated based upon a user speaking a training sequence. Similarly, in the SD lip reading services, the user inputs a training sample size so that the service can “learn” the user's unique characteristics.
  • the controller can be configured to prioritize either the first conversion value or the second conversion value by a predetermined setting assigned by a user. For example, when the mobile station is in a very noisy location, it may be best to use the second dictation. In this case, the user can manually set the mobile station to use only the second dictation. Alternatively, the mobile station can be set to only use the first or second dictation when the mobile station is determined to be at a certain location.
  • the mobile station can be configured to receive global positioning system (GPS) data so that the location can be determined. Alternatively, the controller can generates the variable text conversion value based on the GPS data as well as the first and second conversion values.
  • GPS global positioning system
  • one rule associated with the variable text conversion value can be a signal to noise ratio between the audio signal and a background noise is below a predetermined threshold.
  • the controller can be configured to disable the audio input device.
  • one rule associated with the variable text conversion value can be a signal to brightness ratio is below a predetermined threshold.
  • the controller can be configured to disable the video input device.
  • one rule associated with the variable text conversion value can be the video input device detects no lip movement.
  • the controller can be configured to disable the audio input device.
  • the dictation generated by the speech-to-text services of the various embodiments can be passed to another machine for as a command data or control data to achieve voice based navigation or control.
  • the dictation device of the various embodiment can be incorporated in a control system for vehicle components.
  • a voice utterance: “Turn on Air Conditioner” is converted to dictation as discussed in the above embodiments.
  • the controller can activate the air conditioning system based upon the dictation.
  • Other vehicle components such as the radio, vehicle navigation system, windshield wiper, etc. can be controlled based upon the dictation.
  • the speech is converted into dictation, those skilled in the art will appreciate that the dictation will only be represented in machine code or the like recognizable by the controller.
  • the dictation device of the various embodiments can be incorporated into home appliances such as a television.
  • Voice speech of commands such as to change the channel, activate or deactivate the television, etc. can be converted into dictation to perform control of the television in accordance with the command.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A dictation device includes: an audio input device configured to receive a voice utterance including a plurality of words; a video input device configured to receive video of lip motion during the voice utterance; a memory portion; a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and a transceiver for sending the first data packets to a server end device and receiving second data packets including combined dictation based upon the audio stream and the video stream from the server end device. In the combined dictation, first dictation generated based upon the audio stream has been corrected by second dictation generated based upon the video stream.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS
The present application is a continuation of U.S. patent application Ser. No. 15/649,251 filed on Jul. 13, 2017, which is a continuation of U.S. patent application Ser. No. 14/826,151 filed on Aug. 13, 2015 and now U.S. Pat. No. 9,741,342, which claims the benefit of U.S. provisional application No. 62/085,243 filed on Nov. 26, 2014.
TECHNICAL FIELD
The technical field relates generally to a dictation device, server end device and system for performing speech recognition based upon audio and video signals.
BACKGROUND
Audio based speech recognition services such as, for example, Dragon Dictation, Siri, and SILVIA, can transcribe audio signals including voice data representing speech into text to be rendered on a display. On the other hand, image based speech recognition services transcribe speech into words by, for example, recognizing lip motion. In one such approach, a local binary pattern (LBP) of a series of images in video of lip motion is recognized as text by comparison to a database. However, extracting the LBP from a video can consume a great amount of processing and memory resources. Both types of speech recognition services will be referred to here as speech-to-text services.
SUMMARY
Although the capability and competence of speech-to-text services have recently improved, such services may still have a prevalent error rate. Using common evaluation measures such as the word error rate and sentence error rate, commonly used audio based speech recognition services have at times demonstrated error rates as high as 18% for words and 85% for sentences. Image based speech recognition services such as lip reading have demonstrated a lower error rate. However, in subject independent (SI) lip reading services, errors can occur due to large variations within lip shapes, skin textures around the mouth, varying speaking speeds and different accents, which could significantly affect the spatiotemporal appearances of a speaking mouth. A recent SI lip reading algorithm developed by Zhou et al. can reportedly achieve recognition rates as high as 92.8% (See “Towards a Practical Lipreading System”, authored by ZIHENG ZHOU, GUOYING ZHAO and MATTI PIETIKAINEN, [database online], [retrieved on Nov. 5, 2014], Retrieved from the Internet: <URL:https://www.researchgate.net/publication/221361770_Towards_a_practical_lipreading_system>, the contents of which are incorporated herein by reference). However, even this algorithm requires the video to have at least 9 frames. Thus, the processing and memory resources consumed can be high. Subject dependent (SD) lip reading services require the user to input a training sample size so that the service can “learn” the user's unique characteristics. Therefore, SD lip reading services are considered user unfriendly.
In view of the above problems, as well as other concerns, the present disclosure concerns various embodiments of a dictation device, a server end device, a system including the dictation device and the server end device, and methods which can perform speech-to-text services at a lower error rate in a user friendly manner.
A method for generating text from speech according to one embodiment includes: receiving an audio signal representing a voice utterance of one or more words; performing an audio based speech recognition algorithm to convert the voice utterance into first dictation; determining if each of the one or more words has a predetermined characteristic; and performing a video based speech recognition algorithm to convert a portion of a video signal representing lip motion associated with the voice utterance associated with the word determined to have the predetermined characteristic into second dictation.
The predetermined characteristic can be the word in the first dictation has less than a predetermined number of syllables or the word is less than a predetermined length or time duration.
Determining if each of the words has the predetermined characteristic can further include determining for each of a plurality of prototype candidate words a probability that the respective prototype candidate word can be generated by a feature signal associated with the portion of the audio signal corresponding to the word. In this case, the predetermined characteristic can be that none of the plurality of prototype candidate words have a determined probability higher than a predetermined standard.
The performing a video based speech recognition algorithm to convert the portion of the video signal into second dictation can further include receiving a portion of the video signal that is synchronized with the portion of the audio signal including the word determined to have the predetermined characteristic.
The determining if each of the one or more words has a predetermined characteristic can include comparing each of the words in the first dictation to the corresponding word in the second dictation. In this case, the predetermined characteristic can be that the word in the first dictation is different from the corresponding word in the second dictation.
One embodiment of a system for generating text from speech includes a dictation device such as a smartphone and a server and/or server end devices. All of the server and server end devices will be referred to here as a server end device for simplicity. The smartphone and server end devices can execute a software application stored locally or on a remote server.
In one embodiment, the dictation device includes: an audio input device configured to receive a voice utterance including a plurality of words; a video input device configured to receive video of lip motion during the voice utterance; a memory portion; a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and a transceiver for sending the first data packets to the server end device (a remote apparatus) and receiving second data packets including combined dictation based upon the audio stream and the video stream from the server end device. In the combined dictation, first dictation generated based upon the audio stream has been corrected by second dictation generated based upon the video stream. The controller can be further be configured to render the combined dictation as text on a display. The controller can further configured to generate synchronization data to synchronize the audio stream and the video stream to be sent by the transceiver to the server end device. The transceiver can be further configured to send the second data packets to a downstream application such as an Internet website or other hardware such as a television or home appliance. The dictation device can be, for example, a smartphone.
In one embodiment, the server end device includes: a transceiver configured to send and receive data packets to a remote mobile station (the dictation device) via a connection to a network; an interface for adding formatting information to data to generate the data packets to be sent and removing formatting information from received data packets, the received data packets including an audio stream and a video stream associated with a voice utterance of a plurality of words; a controller coupled to the interface; an audio based speech recognition module coupled to the controller and configured to generate first dictation based upon the audio stream received from the remote mobile station; a video based speech recognition module coupled to the controller and configured to generate second dictation based upon the video stream received from the remote mobile station; and a memory including instructions for configuring the controller to generate a combined dictation based upon a comparison between the first dictation and the second dictation.
The controller can be configured to determine if at least one of the words in the first dictation has a predetermined characteristic. The combined dictation can be generated based upon the second dictation for the at least one of the words having the predetermined characteristic and based upon the first dictation for the other of the plurality of words.
The predetermined characteristic can be that at least one word is not similar to the corresponding at least one word in the second dictation, that the at least one word has less than a predetermined number of syllables or that the at least one word is determined to be less than a predetermined length or time duration.
The audio based speech recognition module can generate the first dictation by: extracting a feature signal associated with the audio stream; for each of a plurality of candidate prototype words, determining the probability that the respective candidate prototype word generates the feature signal; and choosing the candidate prototype word having highest probability among the plurality of candidate prototype words.
The video based speech recognition module can generate the second dictation by generating a feature signal parameter sequence such as a LBP from the sequence of image frames in the video stream; matching the feature signal vector to a stored feature signal parameter sequence; determining the probability of each of a plurality of candidate prototype words generating the stored feature vector; and selecting the candidate prototype word of the highest probability as the text in the second dictation.
The audio based speech recognition module and video based speech recognition can be two separate processors included in two separate computing devices configured according to instructions in memory. Alternatively, the modules can be a processor in a single computing device configured according to instructions in memory. In both cases, the computing device(s) will be connected to the controller. Alternatively, the controller can be configured by instruction in memory to perform audio based and video based speech recognition.
The controller can be further configured to obtain synchronization data indicating a portion of the video signal that is synchronized with the portion of the audio signal including the words determined to have the predetermined characteristic. Particularly, according to one aspect, the audio stream and video stream in the received data packets can be in the transport or program stream format associated with the various MPEG format. The transport or program stream includes synchronization data for synchronizing the audio with video for playback.
According to another embodiment, a dictation device includes: an audio input device that receives an audio signal representing user voice; a video input device that receives a video signal representative of lip movement associated with the user voice; and a controller configured according to instructions stored in a memory. The controller is configured to: generate a first dictation based on the audio signal and assign a first conversion value based on a first set of conversion criteria; generate a second dictation based on the video data and assign a second conversion value based on a second set of conversion criteria; and generate a variable text conversion value based on the first conversion value and the second conversion value and generate a third dictation based on the variable text conversion value.
According to a first aspect, the controller generates the variable text conversion value based on said first conversion value when a number of syllables in a word is greater than N and based on said second conversion value when the number of syllable is N or less.
According to a second aspect, the controller generates the variable text conversion value based on said second conversion value during a period when a length of syllable of a word is less than M.
According to a third aspect, the audio input device is disabled when a signal to noise ratio between the audio signal and a background noise is below a predetermined threshold.
According to a fourth aspect, the video input device is disabled when a signal to brightness ratio is below a predetermined threshold.
According to a fifth aspect, if the volume of the audio signal is lower than a predetermined value, the controller generates the variable text conversion value without the first conversion value.
According to a sixth aspect, when video input device detects no lip movement, the audio input device is disabled.
According to a seventh aspect, said controller generates a text conversion value based on said first conversion value, said second conversion value and global positioning system (GPS) data. The first set of conversion criteria can include pre-registered data representing a value associated with the user voice. The second set of conversion criteria can include pre-registered data representing a value associated with the user voice.
According to an eighth aspect, the variable text conversion value is generated based on a predetermined criteria which includes pre-registered data representing a value of the user voice.
According to a ninth aspect, the controller is configured to prioritize either the first conversion value or the second conversion value by a predetermined setting.
According to a tenth aspect, the predetermined setting can be assigned by a user.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying figures, in which like reference numerals refer to identical or functionally similar elements, together with the detailed description below are incorporated in and form part of the specification and serve to further illustrate various exemplary embodiments and explain various principles and advantages in accordance with the present invention.
FIGS. 1A-1B are diagrams illustrating exemplary operations of a dictation device.
FIG. 1C is a schematic diagram illustrating first and second dictations generated by the dictation device.
FIG. 2 is a schematic diagram illustrating exemplary simplified and representative operating environment of as system in which a mobile station communicates with a server via the Internet.
FIG. 3 is a block diagram of exemplary portions of a dictation device according to exemplary embodiments.
FIG. 4 is a block diagram of exemplary portions of a server end device according to exemplary embodiments.
FIG. 5A is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
FIG. 5B is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
FIG. 6 is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
FIG. 7 is a flow diagram illustrating an exemplary process for performing speech-to-text based upon images representing lip motion.
FIG. 8A is a flow diagram illustrating exemplary operations of the system according to an exemplary embodiment.
FIG. 8B is a schematic diagram illustrating first and second dictations generated by the dictation device.
FIG. 9 is a schematic illustration of a voice commanded vehicle system incorporating the dictation device and system.
DETAILED DESCRIPTION
In overview, the present disclosure concerns a system including a dictation device for performing speech-to-text services and a server end device. The dictation device may be a mobile station such as, for example, a smart phone, a portable computer or an in-vehicle user apparatus. The server end device can be a server or one or more apparatus connected to the server or a combination of both communicating with the mobile station. Either combination of the server and one or more apparatus connected to the server will be referred to here as a server end device for simplicity.
The present disclosure also concerns processors for the mobile station and server end device, memories associated with the processors, and computer readable instructions stored in the memories for configuring the processors. More particularly, various inventive concepts and principles are embodied in systems, apparatuses, and methods for configuring a mobile station to be able to provide more accurate and user friendly speech-to-text recognition services.
The instant disclosure is provided to further explain in an enabling fashion the best modes of performing one or more embodiments of the present invention. The use of relational terms such as first and second, top and bottom, and the like, if any, are used solely to distinguish one from another entity, item, or action without necessarily requiring or implying any actual such relationship or order between such entities, items or actions. It is noted that some embodiments may include a plurality of processes or steps, which can be performed in any order, unless expressly and necessarily limited to a particular order; i.e., processes or steps that are not so limited may be performed in any order.
Much of the inventive functionality and the inventive principles when implemented, are best supported with or in computer program instructions (software) or integrated circuits (ICs), and/or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions or ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the exemplary embodiments.
The computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which suitably execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in block diagrams or flowchart steps discussed below. The computer program instructions may also be stored in a computer usable or computer-readable non-transitory memory that can suitably direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the memory produce an article of manufacture, preferably including instruction means that suitably implement the function specified in the flowchart block or blocks.
Further, in certain preferred embodiments, the portions may be modules implemented by a software or hardware component, such as a field programmable gate-array (FPGA) or application-specific integrated circuit (ASIC), which performs certain tasks. Preferably, a module may preferably be configured to reside on an addressable storage medium and preferably be configured to execute on one or more processors. Accordingly, in further preferred embodiments, an exemplary module may preferably include, but may not be limited only to, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Preferably, the functionality provided for in the components and modules may be suitably combined into fewer components and modules or further separated into additional components and modules.
Referring to the diagrams of FIGS. 1A-1B, an exemplary operation of a dictation device 11 for performing speech-to-text will be discussed. The dictation device 11 may be implemented within a mobile station such as a smartphone. Referring first to FIG. 1A, a problem associated with an audio speech-to-text service will be discussed. A user 10 of the dictation device 11 inputs a voice utterance into an audio input device such as, for example, a microphone associated with the dictation device 11. In this example, the spoken voice utterance is “Please state Pi up to 100 digits.” The voice utterance is converted into an audio stream 12 that includes a number of portions (N). Each of the N portions may be, for example, an audio frame of predetermined time duration (t) such as, for example, 45 ms. Alternatively, or in combination with the time duration t, each audio frame may represent a certain predetermined number of syllables, words or phoneme. An audio based speech recognition service is used to transcribe the speech into text, which is rendered on the display 14A associated with the dictation device 11. However, the audio based speech recognition service fails to accurately transcribe the words “Pi” and “one hundred”.
As shown in FIG. 1B, in the dictation device 11 according to various exemplary embodiments, when a user 10 of the dictation device 11 speaks into an audio input device such as, for example, a microphone associated with the mobile station, not only is the input speech converted into the audio stream 12, but video including a series of images of lip motion of the user 10 is also recorded and converted into a video stream 16 that includes N portions. The N portions may be image frames. Also, the number N of image frames can be different from the number N of audio frames.
As shown in FIG. 1C, the audio stream 12 and the video stream 16 are converted into first and second dictations textually representing the plurality of words in the utterance.
The dictation device 11 can recognize that certain words such as “Pi” and “one hundred” are not likely to be transcribed accurately by the audio based speech recognition service in accordance with a predetermined criteria or characteristic. However, the second dictation obtained from the video stream by video based speech recognition (lip reading) can be used to correct such words. Thus, the words “Pi” and “one hundred” are accurately transcribed into combined dictation and rendered in the display 14B. The transcribed text can then be input into another downstream application for, for example, displaying Pi to 100 digits as requested. In one embodiment discussed more fully below, the predetermined criteria can be that a comparison 19 is performed for each of the word (such as word 123) in the first dictation to determine if it is different from the corresponding word (such as word 163) in the second dictation.
As discussed with respect to the block diagrams below, the dictation device 11 can be implemented by a mobile station such as a smartphone which includes a video camera and/or microphone for recording the user's voice utterances and lip movement. The mobile station can also include one or more processors configured to perform the audio and video based speech recognition.
As shown in FIG. 2, the system can be implemented by a plurality of distributed devices such as a mobile station 202, a server 206 and server end modules 208, 210. This system will be referred to as a distributed device format. The mobile station 202 generates audio and video streams of the voice utterances and lip movement of the user and transmits the streams to the server 206 via a network such as, for example, the Internet 204. The server 206 utilizes the server end module 208 depicted as “Audio Speech recognition” to perform audio based speech-to-text recognition and the server end module 210 depicted as “Video Speech recognition” to perform video based speech recognition. Although each of the server end modules 208, 210 are shown as separate modules and connected to the server 206, it should be noted that the server end modules 208, 210 may be a portion of the server 206. Further, one server end module may perform both audio and speech based recognition. Also, the server end module(s) may be connected to the server 206 via the Internet 204 and/or via another server not shown. Moreover, the server end module(s) may utilize online databases of common feature patterns in order to recognize certain words.
Referring to the block diagram of FIG. 3, portions of an exemplary mobile station 300 implementing the dictation device will be discussed. The mobile station 300 can include a transceiver 302, an interface 304, a controller 306, a memory 308, an audio input device 314, a video input device 316, a digital and audio video processor 318, a display 320, and a common bus 312.
Referencing the Open Systems Interconnection reference model (OSI model), the transceiver 302 provides the physical layer functions such as modulating packet bits into electromagnetic waves to be transmitted and demodulating received waves into packet bits to be processed by higher layers. The transceiver 302 can include radio technology circuitry such as, for example, ZigBee, Bluetooth and WiFi. The transceiver 302 may also include Ethernet and a USB connection. Further, the transceiver 302 can include an antenna portion capable of receiving and transmitting the electromagnetic waves from and to, for example, an access point of the mobile station. The antenna portion can also be separate from the transceiver 302. The antenna portion can include, for example, an RF antenna, a coil antenna and/or a capacitive plate antenna.
The interface 304 can provide the data link layer and network layer functions of the mobile station 300 such as formatting the packet bits to an appropriate format for transmission by the transceiver 302 or received packet bits into an appropriate format for processing by the controller 306. For example, the interface 304 can be configured in accordance with the 802.11 media access control (MAC) protocol and the TCP/IP protocol. According to the MAC protocol, packet bits are encapsulated into frames for transmission and the encapsulation is removed from received frames. According to the TCP/IP protocol, error control is introduced and addressing to ensure end to end delivery. Although shown separately here for simplicity, it should be noted that both the interface 304 and the transceiver 302 may be implemented by a network interface consisting of a few integrated circuits.
The memory 308 can be one or a combination of a variety of types of memory or computer readable medium such as random access memory (RAM), read only memory (ROM), flash memory, dynamic RAM (DRAM), hard disk drive (HDD) or any type of non-transitory memory. The memory 308 generally includes instructions for configuring the controller 306 as well as a basic operating system, executable code, and data and variables.
The bus 312 is a common bus for providing communications between the portions of the mobile station 300 with the controller 306.
The display 320 can be a conventional liquid-crystal display unit for displaying transcribed text.
The audio input device 314 can include a microphone internal to the mobile station 300 and/or a connection for an external microphone and a processor for processing the sound signal.
The video input device 316 can include a video camera or be coupled to a video camera external to the mobile station 300 to generate digital video. In a case in which the device 316 includes the video camera, the video input device 316 can include a CCD image sensor that generates image information by capturing a subject image formed through a lens tube. Light from the subject that becomes an object to be captured passes through the lens tube and then forms an image on the light reception surface of the CCD image sensor. The formed subject image can be converted into R, G, or B color information. As a result of the conversion, image information that indicates an entire subject image is generated. The CCD image sensor can generate image information of a new frame at each constant time. The video input device 316 can include an 8 megapixel iSight camera having 1.5 micron pixels, 1.2 megapixel photos and capability to capture 1080p HD at 60 fps.
The digital audio/video processor 318 can perform conversion between analog and digital and other necessary processing before storing the digital audio and video data in the memory 308 or an external storage medium or memory as an audio stream and video stream.
The controller 306 is the main processing unit for sending control signals and exchanging data with the other portions of the mobile station 300 via, for example, the bus 312.
The controller 306 can be a general purpose CPU or an application specific integrated circuit (ASIC). The memory 308 and controller 306 can be part of the ASIC. For example, if the mobile station 300 is a smartphone, the controller 306, memory 308, bus 312, interface 304 and digital audio video processor 318 will likely be integrated onto a single chip set, referred to as a system on chip (SoC), including a quad-core or dual-core CPU.
It should be noted that the controller 306 can be alternatively configured to perform the A/D conversion, etc. so that the digital audio/video processor 318 can be eliminated.
In the distributed device format, the mobile station 300 can only perform some of the necessary steps of the speech-to-text services. Particularly, the mobile station 202 generates the voice and video streams while the speech-to-text recognition is performed by the server and/or server end modules. Particularly, as shown in FIG. 2, a server 206 is connected to or includes one or more modules 208, 210 for performing audio and video based speech recognition. The mobile station 202 can exchange data with the server 206 via a connection to the Internet 204. Particularly, the mobile station 202 can send first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion to the server 206. The mobile station 202 can receive second data packets including combined dictation based upon the audio stream and the video stream from the server 206.
In an alternative embodiment referred to here as the single device format, the mobile station 300 can perform all of the speech-to-text services. In such a case the controller 306 can be configured by the instructions in the memory 308 to perform audio based speech recognition and/or video based speech recognition in accordance with the discussion of the audio based recognition module and the lip reading based recognition module below with reference to FIG. 4 and to perform the routines discussed below with reference to the flow diagrams of FIGS. 5A-7.
A combination of the single device and distributed device format is also possible. For example, the mobile station 300 can include the capability to perform all of the speech-to-text services, but can choose to use more powerful resources available from the server 206 depending on the quality of the connection to the network.
Referring to the block diagram of FIG. 4, the one or more modules for performing audio speech and video speech recognition and the server are depicted by server end device 400.
The server end device 400 can include a transceiver 402, an interface 404, a controller 406, a memory 408, an audio (speech-to-text) recognition module 410, a video (lip reading) recognition module 412, and a common bus 414.
Similarly to the transceiver and interface of the mobile station, the transceiver 402 of the server end device 400 provides the physical layer functions and the interface 404 provides the data link layer and network layer functions. Although shown separately here for simplicity, it should be noted that both the interface 402 and the transceiver 404 may be implemented by a network interface consisting of a few integrated circuits.
The memory 408 can be one or a combination of a variety of types of memory or computer readable medium such as RAM, ROM, DRAM, HDD or any type of non-transitory memory. The memory 408 includes instructions for configuring the controller 406 as well as a basic operating system, executable code, and data and variables.
The memory 408 may include: (1) predetermined feature signals (e.g, sets of acoustic feature vector signals for audio streams and LBP features for video streams); (2) reference patterns representative of identified words or phrases of a vocabulary set; (3) one or more recognition criteria for comparing feature signals of voice stream and video stream portions to the reference patterns; and (4) pre-registered data representing a value associated with the user voice. For example, one recognition criteria may be a hidden Markov model for each reference pattern.
The controller 406 is the main processing unit for sending control signals and exchanging data with the other portions via, for example, the bus 414. For example, the controller 406 can receive first and second dictations from the modules 410, 412 via bus 414. The modules 410, 412 can receive the audio and video streams in the data packets from the interface 404 via bus 414.
The controller 406 can be configured by the instructions in the memory 408 to perform audio based speech recognition and/or video based speech recognition in accordance with a routine discussed below with reference to FIGS. 5A-5B, 6 and 7.
The bus 414 is a common bus for providing communications between the portions of the server end device 400 with the controller 406.
The audio based recognition module 410 and the video based recognition module 412 can include one or more processors configured according to instructions stored in, for example, the memory 408 for performing an audio based speech recognition algorithm to convert the audio stream into first dictation a video based speech recognition algorithm and to convert the video stream into second dictation. Although shown separately, both modules 410, 412 can alternatively be implemented by the controller 406. Generally, both modules 410 and 412 can perform dictation by generating a feature signal from the audio and video stream, matching the feature signal to predetermined feature signal vectors, determining a probability of the predetermined feature signal vector being generated by a reference pattern model for a word or phrase and selecting the word associated with a highest probability. As discussed above, the reference pattern models and predetermined feature signals can be stored in the memory 408. For example, for SD recognition, the reference pattern models and/or predetermined feature signals may have been generated based upon a user speaking a training sequence. However, the modules 410, 412 can alternatively be configured to refer to databases on other devices via a connection to, for example, the Internet in order to achieve more robust speech recognition.
For example, for audio based recognition, the algorithm can be, for example, a Hidden Markov Model Speech Recognition Arrangement algorithm as discussed in, for example, U.S. Pat. No. 4,587,670 to Levinson et al, the contents of which are incorporated by reference. In such as case, the module 410 can be configured to group the audio stream digitally representing the voice input into successive overlapping frames of predetermined time duration such as, for example, 45 ms. The module 410 generates a time frame sequence of acoustic feature signals such as linear prediction coefficient (LPC) vectors from each frame. The module 410 can also be configured to determine the endpoint of the input speech pattern based upon energy of the feature signal sequence. The module 410 compares the generated feature signals to feature signals stored in memory 408 (or in another device such as an online database) and selects the closest corresponding stored feature signal feature for each speech pattern feature signal as a vector quantized feature signal for every successive frame of the speech pattern. The closest corresponding stored feature signal can be determined based upon the Itakura distance metric signal. Each of the quantized feature signals is one of a finite set of M symbols that may be input to hidden Markov model processing. The module 410 can then determine a probability that a sequence was generated by the reference pattern model. Particularly, referring to the quantized LPC vector feature sequence for an utterance as the sequence O, the probability that the sequence O was generated by a reference pattern model K (a word of a vocabulary set) can be determined in accordance with following Formula (1):
P(O|M)=P i1 b i1(O 1)a i1i2 b i2(O 2) . . . a iT-1 b iT(O T)  (1)
where i1, i2, . . . iT is the maximum likelihood sequence of Markov model states and O1, O2, . . . , OT is the observed sequence.
After the probability signal for each reference pattern model has been generated, the input speech pattern may be identified as the reference pattern model corresponding to the highest probability signal.
The video based recognition module 412 can be configured to group the digital video stream representing the lip motion into a sequence of successive overlapping image frames of 45 ms duration. In each image frame, the mouth region is preferably cropped off from the face. Similarly to the text to speech recognition process, feature signals are generated by being extracted from the sequence of image frames.
For example, a feature signal can be generated from LBP texture descriptors extracted from the sequence of image frames when analyzed in the XYT space. Here XY stands for the image plane and T for temporal positions of individual frames. In this approach, the movement of a talking mouth is considered as a continuous process, and the speech video is viewed as a set of images sampled at a fixed rate along a curve representing the voice utterance in the space of the visual features extracted from the images.
The module 412 can then identify a reference pattern similar to the process for voice recognition. Particularly, the module 412 can compare the generated feature signals to feature signals stored in memory 408 (or in another device such as an online database) and selects the closest corresponding stored feature signal feature for each lip motion feature signal as a vector for every successive frame. The module 412 can then determine a probability that a sequence was generated by the reference pattern model.
After the probability signal for each reference pattern model has been generated, the input speech pattern may be identified as the reference pattern model corresponding to the highest probability signal. The probability can be determined by a dynamic Bayesian network (DBN) model or a hidden Markov model.
As a result of the comparison, the unknown speech pattern is identified as the closest matching reference pattern in accordance with predetermined recognition criteria. The accuracy of such recognition systems is highly dependent on the selected features and the recognition criteria.
The server end device 400 and the mobile station 300 can communicate with each other by exchanging data within, for example, media access control (MAC) frame messages in accordance with the message format defined by IEE Std. 802.11. Each frame includes a MAC header, a frame body and a frame check sequence (FCS). The header can include frame control, duration, address, optional sequence control, information, optional quality of service control information, and optional control fields. The frame body can be a variable-length frame body and contains information specific to the frame type and subtype. The FCS can include an IEEE 32-bit CRC. The frame body of the messages from the mobile station 300 can include the video and audio streams. The frame body of the messages from the server end device 400 can include the transcribed text. The header can indicate if the message is a control, data or management frame.
The mobile station 300 can send the audio and video stream to the server end device 400 in a predetermined data stream format such as a Moving Picture Experts Group (MPEG) data stream (program stream, transport stream, etc.). Preferably, the format is sufficient for recognition of synchronization of time of the video stream and audio stream. Particularly, returning to FIGS. 1B-1C, it is preferable that the controller recognize which time in the video stream is the same as the time in the audio stream for which the words “Pi” and “one hundred” were stated in a most efficient manner. An advantage of the MPEG format is that the time relationship between the audio and video streams are maintained. However, other data formats are sufficient.
The controller 406 can be configured to generate a data set including combined dictation from the first dictation and the second dictation. The interface 404 can add formatting information to the data set to generate the data packets to be sent to the mobile station by the transceiver 402 via a connection to a network such as the Internet.
Exemplary processes for generating the combined dictation are discussed in the following flow diagrams. Each of the flow diagrams will be discussed with respect to the mobile station 300 and server end device 400.
Referring to the exemplary flow diagram of FIG. 5A, an exemplary process for performing speech-to-text according to a first embodiment will be discussed. In this first embodiment, both audio and video-based recognition is used, and a combined dictation is generated based upon a comparison between each of the words of the first and second dictations.
At 502, the audio input device 314 receives an audio signal representing a voice utterance of one or more words. At 504, the video input device 316 receives video of lip motion during the voice utterance. Although described as separate steps, 502 and 504 may and likely occur simultaneously. The controller 306 and interface 304 generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion. As mentioned above, the data packets may include an MPEG stream. The transceiver 302 sends the first data packets to the server end device 400.
At 506, the audio based recognition module 410 performs an audio based speech recognition algorithm to convert the voice utterance in the audio stream into first dictation. At 508, the video based speech recognition module performs a video based speech recognition algorithm to convert the video stream into second dictation. Although not shown, prior to 506 the controller 406 and interface 404 can extract the audio and video streams from the first data packets and pass them to modules 410, 412. Each of the first and second dictations is a textual representation of the words in the voice utterance.
At 510, the controller 406 determines if each of the words in the first dictation has a predetermined characteristic. Particularly, the controller 406 compares the word of the first dictation with the corresponding word of the second dictation to determine if they are different. At 512, if a word of the first dictation is different from the corresponding word of the second dictation, the word of the second dictation is chosen for the combined dictation. For example, as shown in FIG. 1C, word 121, word 122, word 124, word 125 and word 126 of first dictation 12 are same as corresponding word 161, word 162, word 164, word 165 and word 166 of second dictation 16. Word 123 “Fi” is different from word 163 “Pi”. Thus, in the combined dictation, the word 123 in the first dictation is replaced by word 163, or only second dictation 16 is used.
At 514, the controller 406 generates combined dictation, in which each word in the first dictation having the first characteristic has been corrected by corresponding word in the second dictation. The controller 406 and interface 404 can generate second data packets including the combined dictation to be sent by the transceiver 402 to the mobile station 300. The controller 306 can render the combined dictation as text on the display 320 and/or send the second data packets as input to a downstream application such as an Internet website or a control command to other devices such as a television or other home appliance.
As discussed above, at 512 only the word in the first dictation which was different was replaced with the corresponding word of the second dictation. However, alternatively, all of the words in the first dictation can be replaced with the second dictation.
Referring to the exemplary flow diagram of FIG. 5B, an exemplary process for performing speech-to-text according to a second embodiment will be discussed. At 520, the audio input device 314 receives input speech (voice utterance). If the input speech is analog data, the audio input device 314 and processor 318 can convert the analog voice data into digital data and (together with controller 306) generate an audio stream including the input speech. The controller 306 and interface 304 generate first data packets including an audio stream representative of the voice utterance. The transceiver 302 sends the first data packets to the server end device 400.
At 521, the audio based recognition module 410 performs the audio based speech recognition algorithm to convert the voice utterance in the audio stream into first dictation. At 522, the controller 406 analyzes each word of the first dictation to determine if it has a predetermined characteristic. For example, here it can be determined if the word is shorter than a predetermined time or includes less than a predetermined number of syllables. Such shorter words may have a higher error rate when used by voice based recognition services to generate speech. The end of a word or the number of syllables can be determined by extracting a feature signal of each word.
If a word is determined to not have the predetermined characteristic (NO at 522), only the first dictation is use. On the other hand, when the word is determined to have the predetermined characteristic (YES at 522), at 524 the portion of the video stream corresponding to this portion of the audio stream (the previous Y time units) is input to the video based recognition module 412. The previous Y time units should be greater than or equal to the X time units of the corresponding audio stream. If the audio and video streams are MPEG stream, the synchronization data for synchronizing the audio and video streams can be used. The Y time units of the video stream can be obtained by, for example, sending a request message requesting the specific units to the mobile station 300 and receiving further data packets including the Y time units in a reply message. Alternatively, the video stream can be included in the first data packets and stored in the memory 408 or other memory.
At 526, the video based recognition module 412 performs the video based recognition algorithm (lip reading) on the corresponding video data to generate second dictation. At 528, either the second dictation for the entire stream is used in place of the first dictation, or the second dictation generated from the video stream is merged with that of the audio stream to generate combined dictation. For example, when a word by word approach is used for speech-to-text, the high probability candidate words generated by the voice-based approach can be combined with the word generated by the video-based approach. Alternatively, a text for an entire sentence can be generated by the video-based approach.
As discussed above, the first dictation was generated at 521 prior analyzing each word to determine if it has a predetermined characteristic. Alternatively, the word can be analyzed prior to generating the first dictation by, for example, performing an energy analysis of the audio frame of the stream.
Referring to the exemplary flow diagram of FIG. 6, an exemplary process for performing speech-to-text according to a third embodiment will be discussed. In this example, a criteria for using audio or video-based recognition is based upon probability model calculations such as, for example, the Hidden Markov Model (HMM) Speech Recognition discussed above.
At 602, the audio input device 314 receives input speech (voice utterance). At 604, the audio input device 314 (or the controller 306 operating a program stored in memory 308) converts the analog voice data into digital data and forms an audio stream including the input speech.
At 606, the audio based recognition module 410 extracts the input speech pattern from the audio stream and generates a sequence of acoustic feature signal parameters based upon analysis of the input speech pattern. For example, each acoustic feature can be a LPC coefficient features signal vector.
At 608, the audio based recognition module 410 compares the generated feature signal vectors to previously stored LPC feature signals to generate a sequence of vector quantized LPC signals representative of the input speech pattern.
At 610 the probability of the quantized signal having been generated by a reference pattern model of a vocabulary set is determined. After the probability signal for each reference pattern model has been determined, the input speech pattern may be identified as the candidate reference pattern model corresponding to the highest probability signal. If a candidate reference pattern model with suitable probability is determined (YES at 610), at 612 the text for the portion of the audio stream is generated based upon the voice data (first dictation) from the candidate. That is, the word associated with the candidate reference pattern module is chosen as the word in the first dictation.
In the distributed device format in which the voice speech-to-text recognition module 414 is implemented at the server end device 400, the audio stream must first be communicated to the server end device 400. The interface 304 can generate a message including the audio stream such as a MAC frame message as discussed above to be sent by the transceiver 302. The transceiver 402 of the server end device 400 receives the message including the audio stream. The interface 404 (or the processor 406) can extract the audio stream from the message and the audio based recognition module 410 can determine the candidate reference pattern model with suitable probability. In the single device format in which the voice speech-to-text recognition module is implemented in the controller 306 of the mobile station 300, the digital voice stream can be sent to the controller 306 via the bus 312.
Returning to 608, there will be cases when no candidate reference pattern model with suitable probability can be determined (NO at 610). For example, under the conventional HMM speech recognition approach, the prototype signal with the highest probability is chosen. However, here if no prototype signal has a probability greater than a predetermined standard such as, for example, 90%, the portion of the video stream corresponding to this portion of the audio stream (the previous Y time units) is input to the video based recognition module 412. The previous Y time units should be greater than or equal to the X time units of the corresponding audio stream. If the audio and video streams are MPEG stream, the synchronization data for synchronizing the audio and video streams can be used. At 615, the video based speech recognition module is used to perform speech-to-text based upon the Y time units of the video of the speaker (lip reading). At 616, either the second dictation for the entire portion of the video corresponding to the audio stream is generated based upon the video stream, or the second dictation generated from the video stream is merged with that of the audio stream (combined dictation). For example, if a word by word approach is used for speech-to-text, the high probability candidate words generated by the voice-based approach can be combined with the word generated by the video-based approach. Alternatively, a text for an entire sentence can be generated by the video-based approach.
Returning to the flow diagram of FIG. 5B, in this example the characteristic word was determined after audio-based recognition being performed. For example, the characteristic word could be determined based upon the prototype candidate signal. If the prototype candidate signal corresponded to a word having less than a predetermined number of syllables or a predetermined length or time duration, the flow diagram could proceed to 524. Further, the decision to proceed to 524 could be decided based upon a combination of if the word is a characteristic word and the probability of the prototype signal (step 610). Particularly, the standard for probability could be lowered for non-characteristic words and can be higher for characteristic words. Further, FIGS. 5B and 6 show the video based processing only being performed in response to the decisions at 522 and 610. However, video based speech-to-text recognition may be being performed in parallel with voice based speech-to-text recognition similar to FIG. 5A discussed above and FIG. 8A discussed later. However, the processes of FIGS. 5B and 6 have the advantage of limiting unnecessary processing consumption associated with the video based recognition.
Referring to the exemplary flow diagram of FIG. 7, an exemplary video based speech recognition algorithm for performing speech-to-text will be discussed. At 702, the video stream is converted into a sequence of N image frames, each having a time duration T. The time durations T may be equal, or may be set for each word of the utterance. The mouth region can be cropped in each frame. At 704, a feature signal parameter sequence is generated from the sequence of image frames. For example, a local binary pattern (LBP) operator can be used to generate a feature signal vector. At 706, the feature signal vector is matched to a stored feature signal parameter sequence. At 708, the probability of each of a plurality of candidate prototype words generating the stored feature vector is determined. At 710, the candidate prototype word of the highest probability is selected, and the text (second dictation) for the candidate prototype word is generated. The video based recognition module 412 can be configured according to instructions in memory to perform the video based speech recognition algorithm.
Referring to the exemplary flow diagram of FIG. 8A, an exemplary process for performing speech-to-text in accordance with a fourth embodiment will be discussed.
At 802, the audio input device 314 receives an audio signal representing the user voice and the video input device 316 receives a video signal representative of lip movement associated with the user voice. The audio and video signals can be analog or digital signals. The audio signals are converted into audio streams including a sequence of frames representing the voice utterance. The video signals are converted into video streams including a sequence of image frames representing lip motion associated with the voice utterance as shown in FIG. 1B.
At 804, the audio based recognition module 410 generates a first dictation based on the audio signal. The controller 406 assigns a first conversion value based on a first set of conversion criteria. Referring to FIG. 8B, the controller 406 assigned first conversion values 820, 822, 824, 826, 828, 830 based upon whether the words are long (L) or short (S) as the conversion criteria.
At 806, the video based recognition module 412 generates a second dictation based on the video data. The controller 406 assigns a second conversion value based on a second set of conversion criteria. Referring to FIG. 8B, the controller 406 assigned second conversion values 840, 842, 844, 846, 848, 850 based upon whether the words are different from corresponding words of the first dictation (Y or N).
The first and second dictations can be generated based upon known subject dependent (SD) and subject independent (SI) voice based and lip motion based speech to text algorithm such as discussed above in FIG. 6 (steps 602-612) and FIG. 7.
At 808, the controller 406 generates a variable text conversion value based on the first conversion value and the second conversion value. Generally, the variable text conversion value can be one or more rules for determining whether to use a word or words of the first dictation or the second dictation. In FIG. 8B, the variable text conversion value is to use the second dictation when a length of a word of the first dictation is greater than a predetermined time duration.
However, other rules or a combination thereof may be used as the variable text conversion value. Example rules include: (1) use the second dictation when a number of syllables in a word of the first dictation is not greater than N; (2) use the second dictation when a signal to noise ratio between the audio signal and background noise is not below a predetermined threshold; (3) use the first dictation when a signal to brightness ratio is above a predetermined threshold; (4) use the second dictation when the volume of the audio signal is lower than a predetermined value; (5) use the first dictation when video input device detects no lip movement; and (6) use the first dictation for certain Global Positioning Satellite (GPS) data.
The signal to noise data and signal to brightness data can be obtained by the audio input device 314 and the video input device 316 together with the controller 306. Particularly, this data can be included in the first data packets sent by the transceiver 302 to the server end device 400. Alternatively, the controller 406, audio based recognition module 410 and video based recognition module 412 at the server end device 400 can be configured to determine the signal to noise value of the audio stream and the signal to brightness value of the video stream. The GPS data can be determined by the mobile station 300 and included in the first data packets sent by the transceiver 302. Particularly, the controller 306 of the mobile station 300 can utilize the transceiver 302 and/or an associated base station (not shown) for obtaining the GPS data.
At 810, the controller 406 generates a third dictation based on the variable text conversion value. Returning to FIG. 8B, the third dictation can include only the words of the second dictation, or the words of the first dictation except for word 123 which is replaced with word 163.
The first and second conversion values can be, for example, an evaluation value that increases along a probability that a produced word is a corresponding keyword as the conversion criteria.
As the evaluation value, for example, confidence measure used at the time of determining whether a voice included in the voice data and lip motion associated with the voice data and obtained by publicly known voice and video recognition processes, coincide with a corresponding keyword may be adopted. The confidence measure may be calculated based on a posterior probability in a hidden Markov model (HMM) or a likelihood (an acoustic likelihood or a linguistic likelihood) obtained by voice and video recognition processing. In this case, the dictation device previously stores an acoustic model, an LBP model, a dictionary, and a language model, which relate to the above mentioned keywords, and performs the voice and video recognition processes with reference to this information, thereby calculating the above-mentioned conversion values.
Alternatively, the conversion values may be obtained from, for example, similarity of distances calculated between the pieces of voice data of the preliminarily registered keywords and the input voice data may be adopted. The similarity may be calculated by calculating distances between pieces of voice data using, for example, a publicly known distance calculation method that utilizes a predetermined sound feature quantity such as a spectrum and by performing dynamic programming (DP) matching in a voice interval. In this case, the dictation device preliminarily stores therein the pieces of voice data or the sound feature quantities of the above-mentioned keywords.
The first and second sets of conversion criteria can include pre-registered data representing a value associated with the user voice or lip motion such as in a SD recognition system. As discussed above, in the SD voice based system, the reference pattern models and/or predetermined feature signals may have been generated based upon a user speaking a training sequence. Similarly, in the SD lip reading services, the user inputs a training sample size so that the service can “learn” the user's unique characteristics.
The controller can be configured to prioritize either the first conversion value or the second conversion value by a predetermined setting assigned by a user. For example, when the mobile station is in a very noisy location, it may be best to use the second dictation. In this case, the user can manually set the mobile station to use only the second dictation. Alternatively, the mobile station can be set to only use the first or second dictation when the mobile station is determined to be at a certain location. The mobile station can be configured to receive global positioning system (GPS) data so that the location can be determined. Alternatively, the controller can generates the variable text conversion value based on the GPS data as well as the first and second conversion values.
As discussed above, one rule associated with the variable text conversion value can be a signal to noise ratio between the audio signal and a background noise is below a predetermined threshold. In this case, the controller can be configured to disable the audio input device.
As discussed above, one rule associated with the variable text conversion value can be a signal to brightness ratio is below a predetermined threshold. In this case, the controller can be configured to disable the video input device.
As discussed above, one rule associated with the variable text conversion value can be the video input device detects no lip movement. In this case, the controller can be configured to disable the audio input device.
The dictation generated by the speech-to-text services of the various embodiments can be passed to another machine for as a command data or control data to achieve voice based navigation or control. For example, as shown in FIG. 9, the dictation device of the various embodiment can be incorporated in a control system for vehicle components. A voice utterance: “Turn on Air Conditioner” is converted to dictation as discussed in the above embodiments. The controller can activate the air conditioning system based upon the dictation. Other vehicle components such as the radio, vehicle navigation system, windshield wiper, etc. can be controlled based upon the dictation. In such cases, although the speech is converted into dictation, those skilled in the art will appreciate that the dictation will only be represented in machine code or the like recognizable by the controller. Alternatively, the dictation device of the various embodiments can be incorporated into home appliances such as a television. Voice speech of commands such as to change the channel, activate or deactivate the television, etc. can be converted into dictation to perform control of the television in accordance with the command.
Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

Claims (16)

The invention claimed is:
1. A vehicle component control device comprising:
an audio input device configured to receive a voice utterance including a plurality of words;
a video input device configured to receive video of lip motion of a user;
a memory portion;
a controller configured according to instructions in the memory portion to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and
a transceiver for sending the first data packets to a remote apparatus and receiving second data packets including machine code for controlling a vehicle component,
wherein the machine code is assigned from configured dictation generated based upon the audio stream and the video stream from the remote apparatus,
wherein in the configured dictation, at least one word in first dictation generated based upon the audio stream which has a predetermined characteristic has been corrected by a feature signal parameter sequence based upon the video stream,
wherein the first data packets further include geographical data and the remote apparatus determines a location associated with the vehicle component control device based on the geographical data,
wherein the machine code is assigned from the first dictation or the configured dictation based upon the location associated with the vehicle component control device.
2. The vehicle component control device of claim 1, wherein the geographical data includes global positioning system (GPS) data.
3. The vehicle component control device of claim 1, wherein a location where the machine code is assigned from the configured dictation is manually set by the user.
4. A vehicle component control device comprising:
an audio input device that receives an audio signal representing a voice utterance;
a video input device that receives a video signal representative of movement of a user;
a controller configured according to instructions stored in a memory, the controller configured to:
generate first dictation based on the audio signal;
generate a feature signal parameter sequence based on the video signal;
generate configured dictation based on the first dictation and the feature signal parameter sequence;
determine a location associated with the vehicle component control device based on geographical data associated with the vehicle component control device; and
assign machine code for controlling a vehicle component based on the first dictation or the configured dictation based upon the location associated with the vehicle component control device.
5. The vehicle component control device of claim 4, wherein the video input device is disabled when a signal to brightness ratio is below a predetermined threshold.
6. The vehicle component control device of claim 4, wherein the controller is further configured to combine the audio signal and the video signal into a Moving Picture Experts Group (MPEG) stream according to an MPEG format, and use synchronization data of the MPEG stream to determine a portion of the video signal that corresponds to a portion of the audio signal when generating the configured dictation.
7. The vehicle component control device of claim 4, wherein:
the controller is configured to refer to a first set of conversion criteria when generating the first dictation; and
the first set of conversion criteria includes pre-registered data representing a value associated with a user voice.
8. The vehicle component control device of claim 4, wherein the configured dictation is generated based on a predetermined criteria which includes pre-registered data representing a user voice.
9. The vehicle component control device of claim 4, wherein a predetermined setting is assigned by the user.
10. The vehicle components control device of claim 4, wherein the video input device generates the feature signal parameter sequence based upon the video stream by:
extracting a sequence of image frames from a predetermined portion of the video stream;
generating a local binary pattern (LBP) from a series of images;
matching the LBP to a feature signal vector stored in the memory;
determining a probability for each of a plurality of candidate prototype words generating the feature signal vector;
selecting a candidate prototype word of the plurality of candidate prototype words of highest probability to be the feature signal parameter sequence.
11. A vehicle component control device comprising:
an audio input device configured to receive a voice utterance including a plurality of words;
a video input device configured to receive video of lip motion of a user;
a controller configured to generate first data packets including an audio stream representative of the voice utterance and a video stream representative of the lip motion; and
a transceiver for sending the first data packets to a remote apparatus and receiving second data packets including machine code for controlling a vehicle component, the machine code assigned from configured dictation generated based upon the audio stream and the video stream from the remote apparatus,
wherein in the configured dictation, at least one word in first dictation generated based upon the audio stream which has a predetermined characteristic has been corrected by a feature signal parameter sequence based upon the video stream,
wherein the vehicle component control device configured to receive global position data (GPS) data to determine a location of the vehicle components control device,
wherein the machine code is assigned from the first dictation or the configured dictation based upon the location associated with the vehicle component control device.
12. The vehicle component control device of claim 11, further comprising a memory portion for storing pre-registered data representing a user voice.
13. The vehicle component control device of claim 11, further comprising a memory portion storing an instruction for performing an audio based speech recognition algorithm to convert the audio stream into the first dictation.
14. The vehicle component control device of claim 11, wherein the vehicle component controlled by the vehicle component control device includes one of an air conditioner, a radio, a vehicle navigation system and/or a windshield wiper.
15. The vehicle component control device of claim 11, wherein the location is determined by GPS data included in the first data packets.
16. The vehicle component control device of claim 11, wherein the predetermined characteristic further includes the at least one word has four or less syllables and is less than a predetermined length or time duration.
US15/976,834 2014-11-26 2018-05-10 Method and apparatus for recognizing speech by lip reading Active US10204626B2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US15/976,834 US10204626B2 (en) 2014-11-26 2018-05-10 Method and apparatus for recognizing speech by lip reading
US16/126,410 US10424301B2 (en) 2014-11-26 2018-09-10 Method and apparatus for recognizing speech by lip reading
US16/541,110 US10565992B2 (en) 2014-11-26 2019-08-14 Method and apparatus for recognizing speech by lip reading

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201462085243P 2014-11-26 2014-11-26
US14/826,151 US9741342B2 (en) 2014-11-26 2015-08-13 Method and apparatus for recognizing speech by lip reading
US15/649,251 US9997159B2 (en) 2014-11-26 2017-07-13 Method and apparatus for recognizing speech by lip reading
US15/976,834 US10204626B2 (en) 2014-11-26 2018-05-10 Method and apparatus for recognizing speech by lip reading

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/649,251 Continuation US9997159B2 (en) 2014-11-26 2017-07-13 Method and apparatus for recognizing speech by lip reading

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/126,410 Continuation US10424301B2 (en) 2014-11-26 2018-09-10 Method and apparatus for recognizing speech by lip reading

Publications (2)

Publication Number Publication Date
US20180261222A1 US20180261222A1 (en) 2018-09-13
US10204626B2 true US10204626B2 (en) 2019-02-12

Family

ID=56010842

Family Applications (5)

Application Number Title Priority Date Filing Date
US14/826,151 Active US9741342B2 (en) 2014-11-26 2015-08-13 Method and apparatus for recognizing speech by lip reading
US15/649,251 Active US9997159B2 (en) 2014-11-26 2017-07-13 Method and apparatus for recognizing speech by lip reading
US15/976,834 Active US10204626B2 (en) 2014-11-26 2018-05-10 Method and apparatus for recognizing speech by lip reading
US16/126,410 Active US10424301B2 (en) 2014-11-26 2018-09-10 Method and apparatus for recognizing speech by lip reading
US16/541,110 Active US10565992B2 (en) 2014-11-26 2019-08-14 Method and apparatus for recognizing speech by lip reading

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US14/826,151 Active US9741342B2 (en) 2014-11-26 2015-08-13 Method and apparatus for recognizing speech by lip reading
US15/649,251 Active US9997159B2 (en) 2014-11-26 2017-07-13 Method and apparatus for recognizing speech by lip reading

Family Applications After (2)

Application Number Title Priority Date Filing Date
US16/126,410 Active US10424301B2 (en) 2014-11-26 2018-09-10 Method and apparatus for recognizing speech by lip reading
US16/541,110 Active US10565992B2 (en) 2014-11-26 2019-08-14 Method and apparatus for recognizing speech by lip reading

Country Status (1)

Country Link
US (5) US9741342B2 (en)

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9633656B2 (en) 2010-07-27 2017-04-25 Sony Corporation Device registration process from second display
US9916832B2 (en) * 2016-02-18 2018-03-13 Sensory, Incorporated Using combined audio and vision-based cues for voice command-and-control
US10417021B2 (en) 2016-03-04 2019-09-17 Ricoh Company, Ltd. Interactive command assistant for an interactive whiteboard appliance
US10409550B2 (en) * 2016-03-04 2019-09-10 Ricoh Company, Ltd. Voice control of interactive whiteboard appliances
US11455985B2 (en) * 2016-04-26 2022-09-27 Sony Interactive Entertainment Inc. Information processing apparatus
EP3239981B1 (en) * 2016-04-26 2018-12-12 Nokia Technologies Oy Methods, apparatuses and computer programs relating to modification of a characteristic associated with a separated audio signal
US9853758B1 (en) * 2016-06-24 2017-12-26 Harman International Industries, Incorporated Systems and methods for signal mixing
EP3602544A4 (en) 2017-03-23 2020-02-05 Joyson Safety Systems Acquisition LLC System and method of correlating mouth images to input commands
CN106875941B (en) * 2017-04-01 2020-02-18 彭楚奥 Voice semantic recognition method of service robot
US10013979B1 (en) * 2017-04-17 2018-07-03 Essential Products, Inc. Expanding a set of commands to control devices in an environment
CN107221326B (en) * 2017-05-16 2021-05-28 百度在线网络技术(北京)有限公司 Voice awakening method and device based on artificial intelligence and computer equipment
CN107679449B (en) * 2017-08-17 2018-08-03 平安科技(深圳)有限公司 Lip motion method for catching, device and storage medium
EP4130941A1 (en) * 2018-05-04 2023-02-08 Google LLC Hot-word free adaptation of automated assistant function(s)
US11087748B2 (en) * 2018-05-11 2021-08-10 Google Llc Adaptive interface in a voice-activated network
US11386900B2 (en) * 2018-05-18 2022-07-12 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
US10540974B2 (en) * 2018-05-30 2020-01-21 Intel Corporation Detection and disposition of commands methods and apparatuses
CN110580336B (en) * 2018-06-08 2022-03-01 北京得意音通技术有限责任公司 Lip language word segmentation method and device, storage medium and electronic equipment
JP7143665B2 (en) * 2018-07-27 2022-09-29 富士通株式会社 Speech recognition device, speech recognition program and speech recognition method
DE102018212902A1 (en) * 2018-08-02 2020-02-06 Bayerische Motoren Werke Aktiengesellschaft Method for determining a digital assistant for performing a vehicle function from a multiplicity of digital assistants in a vehicle, computer-readable medium, system, and vehicle
US10861457B2 (en) * 2018-10-26 2020-12-08 Ford Global Technologies, Llc Vehicle digital assistant authentication
KR20200056754A (en) * 2018-11-15 2020-05-25 삼성전자주식회사 Apparatus and method for generating personalization lip reading model
KR102717792B1 (en) * 2018-12-14 2024-10-16 삼성전자 주식회사 Method for executing function and Electronic device using the same
CN110415701A (en) * 2019-06-18 2019-11-05 平安科技(深圳)有限公司 The recognition methods of lip reading and its device
CN110196914B (en) * 2019-07-29 2019-12-27 上海肇观电子科技有限公司 Method and device for inputting face information into database
US11069357B2 (en) * 2019-07-31 2021-07-20 Ebay Inc. Lip-reading session triggering events
CN110827799B (en) * 2019-11-21 2022-06-10 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
KR20210112726A (en) * 2020-03-06 2021-09-15 엘지전자 주식회사 Providing interactive assistant for each seat in the vehicle
CN111462733B (en) * 2020-03-31 2024-04-16 科大讯飞股份有限公司 Multi-modal speech recognition model training method, device, equipment and storage medium
CN111464827A (en) * 2020-04-20 2020-07-28 玉环智寻信息技术有限公司 Data processing method and device, computing equipment and storage medium
CN111667835A (en) * 2020-06-01 2020-09-15 马上消费金融股份有限公司 Voice recognition method, living body detection method, model training method and device
CN111930453A (en) * 2020-07-21 2020-11-13 北京字节跳动网络技术有限公司 Dictation interaction method and device and electronic equipment
JP7536566B2 (en) * 2020-09-10 2024-08-20 株式会社ディーアンドエムホールディングス Audio Equipment
KR20220059629A (en) * 2020-11-03 2022-05-10 현대자동차주식회사 Vehicle and method for controlling thereof
WO2022125351A2 (en) * 2020-12-09 2022-06-16 Cerence Operating Company Automotive infotainment system with spatially-cognizant applications that interact with a speech interface
US20220208185A1 (en) * 2020-12-24 2022-06-30 Cerence Operating Company Speech Dialog System for Multiple Passengers in a Car
WO2022147411A1 (en) 2020-12-30 2022-07-07 Assa Abloy Ab Facial expression to augment face id and presentation attack detection
CN112820274B (en) * 2021-01-08 2021-09-28 上海仙剑文化传媒股份有限公司 Voice information recognition correction method and system
CN114676282B (en) * 2022-04-11 2023-02-03 北京女娲补天科技信息技术有限公司 Event entry method and device based on audio and video data and computer equipment
CN116631452B (en) * 2023-04-06 2024-01-02 深圳市亚通桥文化传播有限公司 Management system is read in drawing book record broadcast based on artificial intelligence

Citations (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4587670A (en) 1982-10-15 1986-05-06 At&T Bell Laboratories Hidden Markov model speech recognition arrangement
US6012027A (en) 1997-05-27 2000-01-04 Ameritech Corporation Criteria for usable repetitions of an utterance during speech reference enrollment
US20020178344A1 (en) * 2001-05-22 2002-11-28 Canon Kabushiki Kaisha Apparatus for managing a multi-modal user interface
US20020198710A1 (en) * 2001-05-31 2002-12-26 Gustavo Hernandez-Abrego System and method for speech verification using a robust confidence measure
US20030002726A1 (en) * 2000-03-31 2003-01-02 Fujitsu Limited Image processing apparatus and image processing program
US20030165320A1 (en) * 2001-05-25 2003-09-04 Nobuyuki Ogawa Information signal edition apparatus, information signal edition method, and information edition program
US20040267521A1 (en) * 2003-06-25 2004-12-30 Ross Cutler System and method for audio/video speaker detection
US20050228676A1 (en) * 2002-03-20 2005-10-13 Tohru Ifukube Audio video conversion apparatus and method, and audio video conversion program
US20060199548A1 (en) * 2005-03-01 2006-09-07 Telefonaktiebolaget Lm Ericsson (Publ) Large scale measurement of subjective quality in mobile communications systems
US20080221862A1 (en) * 2007-03-09 2008-09-11 Yahoo! Inc. Mobile language interpreter with localization
US20090010615A1 (en) * 2007-07-05 2009-01-08 Sony Corporation Recording/reproducing apparatus, recording apparatus, reproducing apparatus, recording method, reproducing method and computer program
US20090225791A1 (en) 2008-03-05 2009-09-10 Nobuyuki Nagafuji Ip stream communicating system, ip stream receiving apparatus and reception processing timing sync method for use in ip stream receiving apparatus
US7606580B2 (en) * 2005-05-11 2009-10-20 Aol Llc Personalized location information for mobile devices
US20090300699A1 (en) * 2008-05-30 2009-12-03 Echostar Technologies L.L.C. Methods and apparatus for presenting substitute content in an audio/video stream using text data
US20100216429A1 (en) * 2009-02-26 2010-08-26 Manish Mahajan Methods and systems for recovering lost or stolen mobile devices
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
US20110222556A1 (en) * 2010-03-10 2011-09-15 Shefler David Method circuit and system for adaptive transmission and reception of video
US20110235870A1 (en) * 2010-03-25 2011-09-29 Sony Corporation Information processing device, information processing method, and program
US20110277005A1 (en) * 2010-05-04 2011-11-10 Sony Corporation Geographic internet asset filtering for internet video client
US20130120518A1 (en) * 2011-11-14 2013-05-16 Gold Post Technologies, Inc. Remote Virtual Supervision System
US20140002388A1 (en) * 2012-06-29 2014-01-02 Apple Inc. Biometric Initiated Communication
US20140010418A1 (en) 2011-03-21 2014-01-09 Hewlett-Packard Development Company, L.P. Lip activity detection
US20140156276A1 (en) * 2012-10-12 2014-06-05 Honda Motor Co., Ltd. Conversation system and a method for recognizing speech
US20140223272A1 (en) * 2013-02-04 2014-08-07 Audible, Inc. Selective synchronous presentation
US20140282877A1 (en) * 2013-03-13 2014-09-18 Lookout, Inc. System and method for changing security behavior of a device based on proximity to another device
US20140274144A1 (en) * 2013-03-15 2014-09-18 Comcast Cable Communications, LLC. Information Delivery Targeting
US20140368601A1 (en) * 2013-05-04 2014-12-18 Christopher deCharms Mobile security technology
US20150149169A1 (en) * 2013-11-27 2015-05-28 At&T Intellectual Property I, L.P. Method and apparatus for providing mobile multimodal speech hearing aid
US20150154983A1 (en) * 2013-12-03 2015-06-04 Lenovo (Singapore) Pted. Ltd. Detecting pause in audible input to device
US20150269835A1 (en) * 2012-06-13 2015-09-24 David B. Benoit Systems and methods for managing an emergency situation
US20160035352A1 (en) * 2013-05-21 2016-02-04 Mitsubishi Electric Corporation Voice recognition system and recognition result display apparatus
US20160034253A1 (en) * 2014-07-31 2016-02-04 Samsung Electronics Co., Ltd. Device and method for performing functions
US20160063990A1 (en) * 2014-08-26 2016-03-03 Honeywell International Inc. Methods and apparatus for interpreting clipped speech using speech recognition
US20160080550A1 (en) * 2014-09-11 2016-03-17 Samsung Electronics Co., Ltd. Emergency notification system and server

Family Cites Families (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4271499A (en) * 1978-07-12 1981-06-02 H.F. Communications Corporation Method and apparatus for digitally implementing a linked compressor-expander telecommunications system
CN1157712C (en) * 2000-02-28 2004-07-14 索尼公司 Speed recognition device and method, and recording medium
US7110951B1 (en) * 2000-03-03 2006-09-19 Dorothy Lemelson, legal representative System and method for enhancing speech intelligibility for the hearing impaired
US6785650B2 (en) * 2001-03-16 2004-08-31 International Business Machines Corporation Hierarchical transcription and display of input speech
GB0224806D0 (en) * 2002-10-24 2002-12-04 Ibm Method and apparatus for a interactive voice response system
US7123963B2 (en) * 2002-10-31 2006-10-17 Medtronic, Inc. Method of automatic evoked response sensing vector selection using evoked response waveform analysis
US20040243416A1 (en) * 2003-06-02 2004-12-02 Gardos Thomas R. Speech recognition
US20060122834A1 (en) * 2004-12-03 2006-06-08 Bennett Ian M Emotion detection device & method for use in distributed systems
US8214214B2 (en) * 2004-12-03 2012-07-03 Phoenix Solutions, Inc. Emotion detection device and method for use in distributed systems
WO2007019307A2 (en) * 2005-08-03 2007-02-15 Somatic Technologies, Inc. Somatic, auditory and cochlear communication system and method
US7937052B2 (en) * 2006-06-27 2011-05-03 Cisco Technology, Inc. Multiple input multiple output signal receiving apparatus with optimized performance
US7860716B2 (en) * 2007-04-24 2010-12-28 Microsoft Corporation Speech model refinement with transcription error detection
GB0709329D0 (en) * 2007-05-15 2007-06-20 Ipsotek Ltd Data processing apparatus
JP5098613B2 (en) * 2007-12-10 2012-12-12 富士通株式会社 Speech recognition apparatus and computer program
TWI385932B (en) * 2008-03-26 2013-02-11 Asustek Comp Inc Device and system for remote controlling
US8010545B2 (en) * 2008-08-28 2011-08-30 Palo Alto Research Center Incorporated System and method for providing a topic-directed search
US8676904B2 (en) * 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
JP2010185975A (en) * 2009-02-10 2010-08-26 Denso Corp In-vehicle speech recognition device
CN102349087B (en) * 2009-03-12 2015-05-06 谷歌公司 Automatically providing content associated with captured information, such as information captured in real-time
US20100332229A1 (en) * 2009-06-30 2010-12-30 Sony Corporation Apparatus control based on visual lip share recognition
US8341103B2 (en) * 2009-07-24 2012-12-25 Decision Lens, Inc. Method and system for connecting analytic network process model (ANP) with feedback throughout the ANP model between sub-networks
US8768313B2 (en) * 2009-08-17 2014-07-01 Digimarc Corporation Methods and systems for image or audio recognition processing
US8412530B2 (en) * 2010-02-21 2013-04-02 Nice Systems Ltd. Method and apparatus for detection of sentiment in automated transcriptions
US8700392B1 (en) * 2010-09-10 2014-04-15 Amazon Technologies, Inc. Speech-inclusive device interfaces
US8775172B2 (en) * 2010-10-02 2014-07-08 Noise Free Wireless, Inc. Machine for enabling and disabling noise reduction (MEDNR) based on a threshold
JP5824829B2 (en) * 2011-03-15 2015-12-02 富士通株式会社 Speech recognition apparatus, speech recognition method, and speech recognition program
US20130158977A1 (en) * 2011-06-14 2013-06-20 Andrew Senior System and Method for Evaluating Speech Exposure
US9318129B2 (en) * 2011-07-18 2016-04-19 At&T Intellectual Property I, Lp System and method for enhancing speech activity detection using facial feature detection
JP5799733B2 (en) * 2011-10-12 2015-10-28 富士通株式会社 Recognition device, recognition program, and recognition method
JP6018408B2 (en) * 2012-05-02 2016-11-02 任天堂株式会社 Information processing program, information processing apparatus, information processing system, and information processing method
US20160005150A1 (en) * 2012-09-25 2016-01-07 Benjamin Firooz Ghassabian Systems to enhance data entry in mobile and fixed environment
WO2014087571A1 (en) * 2012-12-07 2014-06-12 テルモ株式会社 Information processing device and information processing method
US9396723B2 (en) * 2013-02-01 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and device for acoustic language model training
US20140278393A1 (en) * 2013-03-12 2014-09-18 Motorola Mobility Llc Apparatus and Method for Power Efficient Signal Conditioning for a Voice Recognition System
US9094584B2 (en) * 2013-07-26 2015-07-28 SkyBell Technologies, Inc. Doorbell communication systems and methods
KR20150031896A (en) * 2013-09-17 2015-03-25 한국전자통신연구원 Speech recognition device and the operation method
US20150086175A1 (en) * 2013-09-25 2015-03-26 Mobile-Vision, Inc. Integrated video and audio recording and transmission
US9058375B2 (en) * 2013-10-09 2015-06-16 Smart Screen Networks, Inc. Systems and methods for adding descriptive metadata to digital content
US20150127340A1 (en) * 2013-11-07 2015-05-07 Alexander Epshteyn Capture
US9373321B2 (en) * 2013-12-02 2016-06-21 Cypress Semiconductor Corporation Generation of wake-up words
US9110635B2 (en) * 2013-12-03 2015-08-18 Lenova (Singapore) Pte. Ltd. Initiating personal assistant application based on eye tracking and gestures
US20150326949A1 (en) * 2014-05-12 2015-11-12 International Business Machines Corporation Display of data of external systems in subtitles of a multi-media system
US20150364140A1 (en) * 2014-06-13 2015-12-17 Sony Corporation Portable Electronic Equipment and Method of Operating a User Interface
US9883043B2 (en) * 2014-08-20 2018-01-30 Pecan Technologies Inc Management of online interactions
US20170061935A1 (en) * 2015-08-31 2017-03-02 Sap Se Adaptive message display duration

Patent Citations (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4587670A (en) 1982-10-15 1986-05-06 At&T Bell Laboratories Hidden Markov model speech recognition arrangement
US6012027A (en) 1997-05-27 2000-01-04 Ameritech Corporation Criteria for usable repetitions of an utterance during speech reference enrollment
US20030002726A1 (en) * 2000-03-31 2003-01-02 Fujitsu Limited Image processing apparatus and image processing program
US20020178344A1 (en) * 2001-05-22 2002-11-28 Canon Kabushiki Kaisha Apparatus for managing a multi-modal user interface
US20030165320A1 (en) * 2001-05-25 2003-09-04 Nobuyuki Ogawa Information signal edition apparatus, information signal edition method, and information edition program
US20020198710A1 (en) * 2001-05-31 2002-12-26 Gustavo Hernandez-Abrego System and method for speech verification using a robust confidence measure
US20050228676A1 (en) * 2002-03-20 2005-10-13 Tohru Ifukube Audio video conversion apparatus and method, and audio video conversion program
US20040267521A1 (en) * 2003-06-25 2004-12-30 Ross Cutler System and method for audio/video speaker detection
US20060199548A1 (en) * 2005-03-01 2006-09-07 Telefonaktiebolaget Lm Ericsson (Publ) Large scale measurement of subjective quality in mobile communications systems
US7606580B2 (en) * 2005-05-11 2009-10-20 Aol Llc Personalized location information for mobile devices
US20080221862A1 (en) * 2007-03-09 2008-09-11 Yahoo! Inc. Mobile language interpreter with localization
US20090010615A1 (en) * 2007-07-05 2009-01-08 Sony Corporation Recording/reproducing apparatus, recording apparatus, reproducing apparatus, recording method, reproducing method and computer program
US20090225791A1 (en) 2008-03-05 2009-09-10 Nobuyuki Nagafuji Ip stream communicating system, ip stream receiving apparatus and reception processing timing sync method for use in ip stream receiving apparatus
US20090300699A1 (en) * 2008-05-30 2009-12-03 Echostar Technologies L.L.C. Methods and apparatus for presenting substitute content in an audio/video stream using text data
US20100216429A1 (en) * 2009-02-26 2010-08-26 Manish Mahajan Methods and systems for recovering lost or stolen mobile devices
US20110071830A1 (en) * 2009-09-22 2011-03-24 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
US8442820B2 (en) 2009-09-22 2013-05-14 Hyundai Motor Company Combined lip reading and voice recognition multimodal interface system
US20110222556A1 (en) * 2010-03-10 2011-09-15 Shefler David Method circuit and system for adaptive transmission and reception of video
US20110235870A1 (en) * 2010-03-25 2011-09-29 Sony Corporation Information processing device, information processing method, and program
US20110277005A1 (en) * 2010-05-04 2011-11-10 Sony Corporation Geographic internet asset filtering for internet video client
US20140010418A1 (en) 2011-03-21 2014-01-09 Hewlett-Packard Development Company, L.P. Lip activity detection
US20130120518A1 (en) * 2011-11-14 2013-05-16 Gold Post Technologies, Inc. Remote Virtual Supervision System
US20150269835A1 (en) * 2012-06-13 2015-09-24 David B. Benoit Systems and methods for managing an emergency situation
US20140002388A1 (en) * 2012-06-29 2014-01-02 Apple Inc. Biometric Initiated Communication
US20140156276A1 (en) * 2012-10-12 2014-06-05 Honda Motor Co., Ltd. Conversation system and a method for recognizing speech
US20140223272A1 (en) * 2013-02-04 2014-08-07 Audible, Inc. Selective synchronous presentation
US20140282877A1 (en) * 2013-03-13 2014-09-18 Lookout, Inc. System and method for changing security behavior of a device based on proximity to another device
US20140274144A1 (en) * 2013-03-15 2014-09-18 Comcast Cable Communications, LLC. Information Delivery Targeting
US20140368601A1 (en) * 2013-05-04 2014-12-18 Christopher deCharms Mobile security technology
US20160035352A1 (en) * 2013-05-21 2016-02-04 Mitsubishi Electric Corporation Voice recognition system and recognition result display apparatus
US20150149169A1 (en) * 2013-11-27 2015-05-28 At&T Intellectual Property I, L.P. Method and apparatus for providing mobile multimodal speech hearing aid
US20150154983A1 (en) * 2013-12-03 2015-06-04 Lenovo (Singapore) Pted. Ltd. Detecting pause in audible input to device
US20160034253A1 (en) * 2014-07-31 2016-02-04 Samsung Electronics Co., Ltd. Device and method for performing functions
US20160063990A1 (en) * 2014-08-26 2016-03-03 Honeywell International Inc. Methods and apparatus for interpreting clipped speech using speech recognition
US20160080550A1 (en) * 2014-09-11 2016-03-17 Samsung Electronics Co., Ltd. Emergency notification system and server

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Lipreading With Local Spatiotemporal Descriptors", authored by Guoying Zhao, Mark Barnard, and Matti Pietikainen, Aug. 18, 2009, IEEE Transactions on Multimedia, vol. 11, No. 7.
"Towards a Practical Lipreading System", authored by Ziheng Zhou, Guoying Zhao and Matti Pietikainen, [database online], [retrieved on Nov. 5, 2014], Retrieved from the Internet: <URL:https://www.researchgate.net/publication/221361770_Towards_a_practical_lipreading_system>.
"Towards a Visual Speech Learning System for the Deaf by Matching Dynamic Lip Shapes", authored by Shizhi Chen, D. Michael Quintian, and Yingli Tian, [database online], [retrieved on Aug. 13, 2015], Retrieved from the Internet: <URL:https://www-ee.ccny.cuny.edu/wwwn/yltian/Publications/ICCHP12_lipRead.pdf>.

Also Published As

Publication number Publication date
US10424301B2 (en) 2019-09-24
US20190027148A1 (en) 2019-01-24
US20170309275A1 (en) 2017-10-26
US20160148616A1 (en) 2016-05-26
US9997159B2 (en) 2018-06-12
US20180261222A1 (en) 2018-09-13
US9741342B2 (en) 2017-08-22
US20190371334A1 (en) 2019-12-05
US10565992B2 (en) 2020-02-18

Similar Documents

Publication Publication Date Title
US10565992B2 (en) Method and apparatus for recognizing speech by lip reading
US20220139393A1 (en) Driver interface with voice and gesture control
EP2801091B1 (en) Method, apparatus and computer program product for joint use of speech and text-based features for sentiment detection
US9552815B2 (en) Speech understanding method and system
WO2019217100A1 (en) Joint neural network for speaker recognition
US9786284B2 (en) Dual-band speech encoding and estimating a narrowband speech feature from a wideband speech feature
US11094316B2 (en) Audio analytics for natural language processing
US9837068B2 (en) Sound sample verification for generating sound detection model
EP2801092A1 (en) Methods, apparatuses and computer program products for implementing automatic speech recognition and sentiment detection on a device
US11626104B2 (en) User speech profile management
CN112489691A (en) Electronic device and operation method thereof
KR20200029314A (en) Apparatus and Methof for controlling the apparatus therof
US11783809B2 (en) User voice activity detection using dynamic classifier
CN116129942A (en) Voice interaction device and voice interaction method
CN116682420B (en) Speech recognition method, electronic device and readable medium
US20220165263A1 (en) Electronic apparatus and method of controlling the same
CN113096666A (en) Electronic device and control method thereof
CN116153291A (en) Voice recognition method and equipment
KR20220057989A (en) Electronic apparatus and the method thereof

Legal Events

Date Code Title Description
AS Assignment

Owner name: PANASONIC INTELLECTUAL PROPERTY CORPORATION OF AME

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TAKAYANAGI, YUICHIRO;KUSAKA, MASASHI;REEL/FRAME:045775/0021

Effective date: 20150813

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4