CN106653020A - Multi-business control method and system for smart sound and video equipment based on deep learning - Google Patents
Multi-business control method and system for smart sound and video equipment based on deep learning Download PDFInfo
- Publication number
- CN106653020A CN106653020A CN201611144430.6A CN201611144430A CN106653020A CN 106653020 A CN106653020 A CN 106653020A CN 201611144430 A CN201611144430 A CN 201611144430A CN 106653020 A CN106653020 A CN 106653020A
- Authority
- CN
- China
- Prior art keywords
- control signal
- feature information
- mfcc
- voice
- voice feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 41
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000000605 extraction Methods 0.000 claims abstract description 22
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims description 28
- 230000009467 reduction Effects 0.000 claims description 16
- 230000011218 segmentation Effects 0.000 claims description 15
- 238000001514 detection method Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 6
- 230000003993 interaction Effects 0.000 abstract description 10
- 238000012549 training Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/34—Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the invention discloses a multi-business control method and system for smart sound and video equipment based on deep learning, and the method comprises the steps that a voice preprocessing module carries out the extraction of a voice control signal, and obtains MFCC original voice feature information; a remote GPU server receives the MFCC original voice feature information, and obtains the depth voice feature information according to the MFCC original voice feature information; an Internet connection module enables control signal identification information to be transmitted to a signal analysis module, and the signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to a control signal output module; the control signal output module transmits a control signal to the smart sound and video equipment according to the control signal code. The method can control various types of smart sound and video equipment which is based on different control protocols and achieves various types of different business, and provides more unified and natural man-machine interaction and control mode for the smart sound and video equipment.
Description
Technical Field
The invention relates to the technical field of intelligent audio-visual equipment multi-service control, in particular to an intelligent audio-visual equipment multi-service control method and system based on deep learning.
Background
With the progress of the internet of things and artificial intelligence technology, the intelligent audio-visual equipment technology is rapidly developed. More and more intelligent audio-visual equipment is designed and produced, and various multimedia audio-visual services are realized so as to meet different requirements in life of people. The devices designed and produced by different manufacturers have different control and man-machine interaction modes. The devices may adopt various control modes such as infrared, Bluetooth, Z-wave and the like, and realize human-computer interaction in modes such as voice, action, touch and the like. The intelligent audio-visual equipment control and the human-computer interaction mode are not unified, so that the threshold for the user to learn to use the intelligent audio-visual equipment is improved, and the problem of poor user experience is easily caused. The integration of multiple service scenes and the provision of a uniform, easy and natural control and man-machine interaction mode for the intelligent audio-visual equipment are problems to be solved urgently.
Deep learning is a sub-field of artificial intelligence. In recent years, with the progress of technologies such as Graphics Processing Unit (GPU) and cloud computing, deep learning theoretical research has been made in a breakthrough. Meanwhile, the introduction of deep learning technology has led to a rapid advance in the fields of computer vision, speech recognition, and the like. This also brings new ideas for intelligent audiovisual equipment control technology.
An existing intelligent home natural interaction system [1] based on audio and video uses a microphone and a camera to collect sound and image information, uses an information fusion module to perform signal processing, then uses a machine learning method to obtain a useful instruction, and then uses a control signal transmitting module to send out a control signal.
The system uses information such as voice, gestures, human faces and actions to control, a simple and unified interaction mode cannot be provided for a user, and the problems that the learning cost for the user to master the use of the system is high, the user experience is poor and the like are caused. The traditional machine learning method is adopted to recognize multimedia information such as voice, images and the like, so that the recognition rate is low, and the system robustness is poor. And its voice, image recognition programs run locally, which increases the hardware and energy costs for the user.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a method and a system for controlling multiple services of intelligent audio-visual equipment based on deep learning, which can control multiple intelligent audio-visual equipment based on different control protocols and realizing multiple different services and provide a more uniform and natural man-machine interaction and control mode for the intelligent audio-visual equipment.
In order to solve the above problems, the present invention provides a method for controlling multiple services of an intelligent audiovisual device based on deep learning, the method comprising:
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts voice control signals to obtain original voice feature information of Mel-scalefree Cepstral Coefficients (MFCC); detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends the MFCC original voice feature information to a remote Graphics Processing Unit (GPU) server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
Preferably, the step of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information includes:
carrying out endpoint detection and segmentation processing on the voice control signal;
carrying out noise reduction processing on the voice control signal after the segmentation processing;
and performing MFCC original voice feature extraction on the voice control signal subjected to the noise reduction processing to obtain MFCC original voice feature information.
Preferably, the step of receiving, by the remote GPU server, the MFCC original speech feature information, performing deep speech feature extraction on the MFCC original speech feature information, and obtaining deep speech feature information includes:
the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a Bidirectional Long Short-Term Memory-neural network (biLSTM) algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.
Preferably, the step of the remote GPU server receiving the MFCC original voice feature information, obtaining the deep voice feature information according to the MFCC original voice feature information, and sending the control signal identification information corresponding to the deep voice feature information to the internet connection module includes:
the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
Correspondingly, the invention also provides a deep learning-based intelligent audio-visual equipment multi-service control system, which comprises: the system comprises a microphone array, a voice preprocessing module, a remote GPU server, an internet connection module, a control signal analysis module and a control signal output module; wherein,
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
Preferably, the voice preprocessing module includes:
the segmentation unit is used for carrying out endpoint detection and segmentation processing on the voice control signal;
the noise reduction unit is used for carrying out noise reduction processing on the voice control signal after the segmentation processing;
and the extraction unit is used for extracting the MFCC original voice feature of the voice control signal after the noise reduction processing to obtain MFCC original voice feature information.
Preferably, the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and performs deep voice feature extraction on the MFCC original voice feature information by using a biLSTM algorithm to obtain deep voice feature information.
Preferably, the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
By implementing the embodiment of the invention, natural voice can be used for controlling various intelligent audio-visual equipment based on different control protocols and realizing various different services, and a unified, natural, efficient and low-cost man-machine interaction mode is provided for the intelligent audio-visual equipment; meanwhile, a complex deep learning task is deployed on the remote server, so that the hardware and energy cost of a user is reduced, high-performance and low-cost intelligent audio-visual equipment voice control instruction recognition service is provided for the user, and the recognition accuracy of the intelligent audio-visual equipment voice control instruction is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flowchart of a method for controlling multiple services of an intelligent audio-visual device based on deep learning according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep learning speech recognition model in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a system and a multi-service control of an intelligent audiovisual device based on deep learning according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flowchart of a method for controlling multiple services of an intelligent audiovisual device based on deep learning according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s1, the microphone array monitors and collects the voice control signal sent by the user with a specific frequency;
s2, the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server; if not, returning to S1;
s3, the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
s4, the Internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to the control signal output module;
and S5, the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
In the process of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information, the method comprises the following steps:
carrying out endpoint detection and segmentation processing on the voice control signal;
carrying out noise reduction processing on the voice control signal after the segmentation processing;
and performing MFCC original voice feature extraction on the voice control signal subjected to the noise reduction processing to obtain MFCC original voice feature information.
Specifically, in S3, the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and performs deep voice feature extraction on the MFCC original voice feature information by using the biLSTM algorithm to obtain deep voice feature information.
Further, the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning control signal identification information to the Internet connection module; if not, returning an error mark to the Internet connection module.
In the embodiment of the invention, as shown in fig. 2, the main structure of the deep learning speech recognition model includes a biLSTM composed of a forward long-and-short-term memory recurrent neural network and a reverse long-and-short-term memory recurrent neural network, and a Softmax classifier. The input to the deep-learning speech recognition model is sent from the local internet connectivity unit MFCC speech features, the output of which is T +1 class identifiers. These category identifiers include T categories, one for each control signal supported by the system, and a Default category. If the model outputs the Default category, it indicates that the MFCC speech features do not correspond to a control signal for the intelligent audiovisual equipment. The deep learning speech recognition model is generated in advance in a training and generating stage and then is deployed on a remote GPU server to provide speech control instruction recognition service for intelligent audio-visual equipment for users.
In the implementation, the training generation process of the deep learning speech recognition model is as follows:
the first step is as follows: simulating a real device use situation according to the types of intelligent audio-visual devices required to be supported and the service functions realized by the devices, and collecting a large number of voice fragments by using a microphone array;
the second step is that: manually marking the control signal types corresponding to the voice segments;
the third step: extracting MFCC voice features from all voice fragments by using a voice preprocessing module to obtain a marked control voice feature data set;
the fourth step: dividing a data Set, namely, taking a certain amount of data in the marked control voice characteristic data Set to form a Training data Set, namely a Training Set, and taking a certain amount of data as a verification data Set, namely a Validation Set;
the fifth step: randomly initializing all parameters in the deep learning speech recognition model;
and a sixth step: executing a deep learning forward propagation process by taking the training data set as input;
the seventh step: executing a deep learning Back Propagation process by adopting a Back Propagation Through Time (BPTT) method, and updating all parameters in a deep learning voice model;
eighth step: if the execution period reaches the verification period, verifying the current deep learning speech recognition model by using the verification data set;
the ninth step: and stopping training if the training stopping condition is met, and returning to the sixth step if the training stopping condition is not met. The stop condition may be that the number of training times reaches a certain value, or that the verification error is less than a certain value.
Correspondingly, an embodiment of the present invention further provides a deep learning-based intelligent audiovisual device multi-service control system, as shown in fig. 3, the system includes: the system comprises a microphone array 1, a voice preprocessing module 2, a remote GPU server 3, an internet connection module 4, a control signal analysis module 5 and a control signal output module 6; wherein,
the microphone array 1 monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module 2 extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module 4 sends MFCC original voice feature information to the remote GPU server 3;
the remote GPU server 3 receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module 4;
the internet connection module 4 transmits the control signal identification information to the control signal analysis module 5, the control signal analysis module 5 generates a control signal code according to the control signal identification information, selects a corresponding control signal output module 6, and transmits the control signal code to the control signal output module 6;
the control signal output module 6 sends a control signal to the intelligent audio-visual equipment according to the control signal code.
In the embodiment of the invention, the microphone array 1 collects the voice signals sent by the user in real time and sends the voice signals to the voice preprocessing module 2.
The voice preprocessing module 2 is responsible for performing endpoint detection, noise reduction and MFCC original voice feature extraction operations on voice signals.
The internet connection module 4 is responsible for establishing network connection with the remote GPU server 3, sending MFCC raw speech feature information to the remote GPU server 3, and receiving feedback messages from the remote GPU server 3.
The control signal analysis module 5 is responsible for analyzing the feedback message from the remote GPU server 3, and enables the corresponding control signal output module 6 according to the message content, or performs error processing.
The control signal output module 6 has a plurality of modules, each control signal output unit is provided with hardware supporting a wireless communication mode and is responsible for controlling all intelligent audio-visual equipment based on the wireless communication mode. These wireless communication means include infrared, bluetooth, Z-wave, etc.
The remote GPU server 3 provides a voice control instruction recognition service for intelligent audio-visual equipment for users.
Further, the voice preprocessing module 2 includes:
the segmentation unit is used for carrying out endpoint detection and segmentation processing on the voice control signal;
the noise reduction unit is used for carrying out noise reduction processing on the voice control signal after the segmentation processing;
and the extraction unit is used for extracting the MFCC original voice feature of the voice control signal after the noise reduction processing to obtain MFCC original voice feature information.
And the remote GPU server 3 receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a bilSTM algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.
The remote GPU server 3 receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module 4;
the remote GPU server 3 classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, control signal identification information is returned to the internet connection module 4.
Specifically, the working principle of the system related function module according to the embodiment of the present invention may refer to the related description of the method embodiment, and is not described herein again.
By implementing the embodiment of the invention, natural voice can be used for controlling various intelligent audio-visual equipment based on different control protocols and realizing various different services, and a unified, natural, efficient and low-cost man-machine interaction mode is provided for the intelligent audio-visual equipment; meanwhile, a complex deep learning task is deployed on the remote server, so that the hardware and energy cost of a user is reduced, high-performance and low-cost intelligent audio-visual equipment voice control instruction recognition service is provided for the user, and the recognition accuracy of the intelligent audio-visual equipment voice control instruction is improved.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
In addition, the above detailed description is given to the method and system for controlling multiple services of an intelligent audio-visual device based on deep learning according to the embodiments of the present invention, and a specific example is applied in the present disclosure to explain the principle and implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (8)
1. A multi-service control method of intelligent audio-visual equipment based on deep learning is characterized by comprising the following steps:
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
2. The intelligent audio-visual equipment multi-service control method based on deep learning of claim 1, wherein the step of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information comprises:
carrying out endpoint detection and segmentation processing on the voice control signal;
carrying out noise reduction processing on the voice control signal after the segmentation processing;
and performing MFCC original voice feature extraction on the voice control signal subjected to the noise reduction processing to obtain MFCC original voice feature information.
3. The intelligent audio-visual device multi-service control method based on deep learning of claim 1, wherein the remote GPU server receives MFCC original voice feature information, and performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, including:
and the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a bilSTM algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.
4. The intelligent audio-visual device multi-service control method based on deep learning of claim 1, wherein the remote GPU server receives MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the internet connection module, including:
the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
5. A smart audiovisual device multi-service control system based on deep learning, the system comprising: the system comprises a microphone array, a voice preprocessing module, a remote GPU server, an internet connection module, a control signal analysis module and a control signal output module; wherein,
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
6. The intelligent deep learning based multi-service control system for intelligent audio-visual devices as claimed in claim 5, wherein the voice pre-processing module comprises:
the segmentation unit is used for carrying out endpoint detection and segmentation processing on the voice control signal;
the noise reduction unit is used for carrying out noise reduction processing on the voice control signal after the segmentation processing;
and the extraction unit is used for extracting the MFCC original voice feature of the voice control signal after the noise reduction processing to obtain MFCC original voice feature information.
7. The deep learning-based intelligent audio-visual device multi-service control system as claimed in claim 5, wherein the remote GPU server receives MFCC original speech feature information, starts a deep learning speech recognition procedure, and performs deep speech feature extraction on the MFCC original speech feature information by using a bilSTM algorithm to obtain deep speech feature information.
8. The intelligent audio-visual equipment multi-service control system based on deep learning of claim 5, wherein the remote GPU server receives MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611144430.6A CN106653020A (en) | 2016-12-13 | 2016-12-13 | Multi-business control method and system for smart sound and video equipment based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611144430.6A CN106653020A (en) | 2016-12-13 | 2016-12-13 | Multi-business control method and system for smart sound and video equipment based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106653020A true CN106653020A (en) | 2017-05-10 |
Family
ID=58824998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611144430.6A Pending CN106653020A (en) | 2016-12-13 | 2016-12-13 | Multi-business control method and system for smart sound and video equipment based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106653020A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108074575A (en) * | 2017-12-14 | 2018-05-25 | 广州势必可赢网络科技有限公司 | Identity verification method and device based on recurrent neural network |
CN109559761A (en) * | 2018-12-21 | 2019-04-02 | 广东工业大学 | A kind of risk of stroke prediction technique based on depth phonetic feature |
CN110428821A (en) * | 2019-07-26 | 2019-11-08 | 广州市申迪计算机系统有限公司 | A kind of voice command control method and device for crusing robot |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111783892A (en) * | 2020-07-06 | 2020-10-16 | 广东工业大学 | Robot instruction identification method and device, electronic equipment and storage medium |
CN113921016A (en) * | 2021-10-15 | 2022-01-11 | 阿波罗智联(北京)科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
CN118708384A (en) * | 2024-08-30 | 2024-09-27 | 江苏博云科技股份有限公司 | Method and system for optimizing GPU remote call performance based on pre-analysis and calculation service |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101221762A (en) * | 2007-12-06 | 2008-07-16 | 上海大学 | MP3 compression field audio partitioning method |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
CN105045122A (en) * | 2015-06-24 | 2015-11-11 | 张子兴 | Intelligent household natural interaction system based on audios and videos |
CN105700359A (en) * | 2014-11-25 | 2016-06-22 | 上海天脉聚源文化传媒有限公司 | Method and system for controlling smart home through speech recognition |
-
2016
- 2016-12-13 CN CN201611144430.6A patent/CN106653020A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101221762A (en) * | 2007-12-06 | 2008-07-16 | 上海大学 | MP3 compression field audio partitioning method |
CN105700359A (en) * | 2014-11-25 | 2016-06-22 | 上海天脉聚源文化传媒有限公司 | Method and system for controlling smart home through speech recognition |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
CN105045122A (en) * | 2015-06-24 | 2015-11-11 | 张子兴 | Intelligent household natural interaction system based on audios and videos |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108074575A (en) * | 2017-12-14 | 2018-05-25 | 广州势必可赢网络科技有限公司 | Identity verification method and device based on recurrent neural network |
CN109559761A (en) * | 2018-12-21 | 2019-04-02 | 广东工业大学 | A kind of risk of stroke prediction technique based on depth phonetic feature |
CN110428821A (en) * | 2019-07-26 | 2019-11-08 | 广州市申迪计算机系统有限公司 | A kind of voice command control method and device for crusing robot |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
WO2021127982A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, smart device, and computer-readable storage medium |
CN111357051B (en) * | 2019-12-24 | 2024-02-02 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111783892A (en) * | 2020-07-06 | 2020-10-16 | 广东工业大学 | Robot instruction identification method and device, electronic equipment and storage medium |
CN113921016A (en) * | 2021-10-15 | 2022-01-11 | 阿波罗智联(北京)科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
CN118708384A (en) * | 2024-08-30 | 2024-09-27 | 江苏博云科技股份有限公司 | Method and system for optimizing GPU remote call performance based on pre-analysis and calculation service |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106653020A (en) | Multi-business control method and system for smart sound and video equipment based on deep learning | |
CN112889108B (en) | Speech classification using audiovisual data | |
CN107437415B (en) | Intelligent voice interaction method and system | |
CN102843543B (en) | Video conferencing reminding method, device and video conferencing system | |
CN110519636B (en) | Voice information playing method and device, computer equipment and storage medium | |
CN104049721B (en) | Information processing method and electronic equipment | |
CN102890776B (en) | The method that expression figure explanation is transferred by facial expression | |
CN109377995B (en) | Method and device for controlling equipment | |
CN112863547A (en) | Virtual resource transfer processing method, device, storage medium and computer equipment | |
CN111541951B (en) | Video-based interactive processing method and device, terminal and readable storage medium | |
CN111966212A (en) | Multi-mode-based interaction method and device, storage medium and smart screen device | |
CN110516749A (en) | Model training method, method for processing video frequency, device, medium and calculating equipment | |
CN111197841A (en) | Control method, control device, remote control terminal, air conditioner, server and storage medium | |
CN109032345A (en) | Apparatus control method, device, equipment, server-side and storage medium | |
CN117193524A (en) | Man-machine interaction system and method based on multi-mode feature fusion | |
CN108040111A (en) | A kind of apparatus and method for supporting natural language interaction | |
CN107452381B (en) | Multimedia voice recognition device and method | |
CN111402096A (en) | Online teaching quality management method, system, equipment and medium | |
CN109343481B (en) | Method and device for controlling device | |
CN111933137B (en) | Voice wake-up test method and device, computer readable medium and electronic equipment | |
CN117877125B (en) | Action recognition and model training method and device, electronic equipment and storage medium | |
CN113571063A (en) | Voice signal recognition method and device, electronic equipment and storage medium | |
WO2024114303A1 (en) | Phoneme recognition method and apparatus, electronic device and storage medium | |
WO2018023523A1 (en) | Motion and emotion recognizing home control system | |
WO2018023518A1 (en) | Smart terminal for voice interaction and recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |
|
RJ01 | Rejection of invention patent application after publication |