CN106653020A

CN106653020A - Multi-business control method and system for smart sound and video equipment based on deep learning

Info

Publication number: CN106653020A
Application number: CN201611144430.6A
Authority: CN
Inventors: 曾旭龙; 林格; 陈小燕
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2017-05-10

Abstract

The embodiment of the invention discloses a multi-business control method and system for smart sound and video equipment based on deep learning, and the method comprises the steps that a voice preprocessing module carries out the extraction of a voice control signal, and obtains MFCC original voice feature information; a remote GPU server receives the MFCC original voice feature information, and obtains the depth voice feature information according to the MFCC original voice feature information; an Internet connection module enables control signal identification information to be transmitted to a signal analysis module, and the signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to a control signal output module; the control signal output module transmits a control signal to the smart sound and video equipment according to the control signal code. The method can control various types of smart sound and video equipment which is based on different control protocols and achieves various types of different business, and provides more unified and natural man-machine interaction and control mode for the smart sound and video equipment.

Description

Intelligent audio-visual equipment multi-service control method and system based on deep learning

Technical Field

The invention relates to the technical field of intelligent audio-visual equipment multi-service control, in particular to an intelligent audio-visual equipment multi-service control method and system based on deep learning.

Background

With the progress of the internet of things and artificial intelligence technology, the intelligent audio-visual equipment technology is rapidly developed. More and more intelligent audio-visual equipment is designed and produced, and various multimedia audio-visual services are realized so as to meet different requirements in life of people. The devices designed and produced by different manufacturers have different control and man-machine interaction modes. The devices may adopt various control modes such as infrared, Bluetooth, Z-wave and the like, and realize human-computer interaction in modes such as voice, action, touch and the like. The intelligent audio-visual equipment control and the human-computer interaction mode are not unified, so that the threshold for the user to learn to use the intelligent audio-visual equipment is improved, and the problem of poor user experience is easily caused. The integration of multiple service scenes and the provision of a uniform, easy and natural control and man-machine interaction mode for the intelligent audio-visual equipment are problems to be solved urgently.

Deep learning is a sub-field of artificial intelligence. In recent years, with the progress of technologies such as Graphics Processing Unit (GPU) and cloud computing, deep learning theoretical research has been made in a breakthrough. Meanwhile, the introduction of deep learning technology has led to a rapid advance in the fields of computer vision, speech recognition, and the like. This also brings new ideas for intelligent audiovisual equipment control technology.

An existing intelligent home natural interaction system [1] based on audio and video uses a microphone and a camera to collect sound and image information, uses an information fusion module to perform signal processing, then uses a machine learning method to obtain a useful instruction, and then uses a control signal transmitting module to send out a control signal.

The system uses information such as voice, gestures, human faces and actions to control, a simple and unified interaction mode cannot be provided for a user, and the problems that the learning cost for the user to master the use of the system is high, the user experience is poor and the like are caused. The traditional machine learning method is adopted to recognize multimedia information such as voice, images and the like, so that the recognition rate is low, and the system robustness is poor. And its voice, image recognition programs run locally, which increases the hardware and energy costs for the user.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a method and a system for controlling multiple services of intelligent audio-visual equipment based on deep learning, which can control multiple intelligent audio-visual equipment based on different control protocols and realizing multiple different services and provide a more uniform and natural man-machine interaction and control mode for the intelligent audio-visual equipment.

In order to solve the above problems, the present invention provides a method for controlling multiple services of an intelligent audiovisual device based on deep learning, the method comprising:

the microphone array monitors and collects voice control signals sent by users at a specific frequency;

the voice preprocessing module extracts voice control signals to obtain original voice feature information of Mel-scalefree Cepstral Coefficients (MFCC); detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends the MFCC original voice feature information to a remote Graphics Processing Unit (GPU) server;

the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;

the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;

the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.

Preferably, the step of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information includes:

carrying out endpoint detection and segmentation processing on the voice control signal;

carrying out noise reduction processing on the voice control signal after the segmentation processing;

and performing MFCC original voice feature extraction on the voice control signal subjected to the noise reduction processing to obtain MFCC original voice feature information.

Preferably, the step of receiving, by the remote GPU server, the MFCC original speech feature information, performing deep speech feature extraction on the MFCC original speech feature information, and obtaining deep speech feature information includes:

the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a Bidirectional Long Short-Term Memory-neural network (biLSTM) algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.

Preferably, the step of the remote GPU server receiving the MFCC original voice feature information, obtaining the deep voice feature information according to the MFCC original voice feature information, and sending the control signal identification information corresponding to the deep voice feature information to the internet connection module includes:

the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;

the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.

Correspondingly, the invention also provides a deep learning-based intelligent audio-visual equipment multi-service control system, which comprises: the system comprises a microphone array, a voice preprocessing module, a remote GPU server, an internet connection module, a control signal analysis module and a control signal output module; wherein,

the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server;

Preferably, the voice preprocessing module includes:

the segmentation unit is used for carrying out endpoint detection and segmentation processing on the voice control signal;

the noise reduction unit is used for carrying out noise reduction processing on the voice control signal after the segmentation processing;

and the extraction unit is used for extracting the MFCC original voice feature of the voice control signal after the noise reduction processing to obtain MFCC original voice feature information.

Preferably, the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and performs deep voice feature extraction on the MFCC original voice feature information by using a biLSTM algorithm to obtain deep voice feature information.

Preferably, the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;

By implementing the embodiment of the invention, natural voice can be used for controlling various intelligent audio-visual equipment based on different control protocols and realizing various different services, and a unified, natural, efficient and low-cost man-machine interaction mode is provided for the intelligent audio-visual equipment; meanwhile, a complex deep learning task is deployed on the remote server, so that the hardware and energy cost of a user is reduced, high-performance and low-cost intelligent audio-visual equipment voice control instruction recognition service is provided for the user, and the recognition accuracy of the intelligent audio-visual equipment voice control instruction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flowchart of a method for controlling multiple services of an intelligent audio-visual device based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a deep learning speech recognition model in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a system and a multi-service control of an intelligent audiovisual device based on deep learning according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flowchart of a method for controlling multiple services of an intelligent audiovisual device based on deep learning according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s1, the microphone array monitors and collects the voice control signal sent by the user with a specific frequency;

s2, the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server; if not, returning to S1;

s3, the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;

s4, the Internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to the control signal output module;

and S5, the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.

In the process of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information, the method comprises the following steps:

Specifically, in S3, the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and performs deep voice feature extraction on the MFCC original voice feature information by using the biLSTM algorithm to obtain deep voice feature information.

Further, the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;

the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning control signal identification information to the Internet connection module; if not, returning an error mark to the Internet connection module.

In the embodiment of the invention, as shown in fig. 2, the main structure of the deep learning speech recognition model includes a biLSTM composed of a forward long-and-short-term memory recurrent neural network and a reverse long-and-short-term memory recurrent neural network, and a Softmax classifier. The input to the deep-learning speech recognition model is sent from the local internet connectivity unit MFCC speech features, the output of which is T +1 class identifiers. These category identifiers include T categories, one for each control signal supported by the system, and a Default category. If the model outputs the Default category, it indicates that the MFCC speech features do not correspond to a control signal for the intelligent audiovisual equipment. The deep learning speech recognition model is generated in advance in a training and generating stage and then is deployed on a remote GPU server to provide speech control instruction recognition service for intelligent audio-visual equipment for users.

In the implementation, the training generation process of the deep learning speech recognition model is as follows:

the first step is as follows: simulating a real device use situation according to the types of intelligent audio-visual devices required to be supported and the service functions realized by the devices, and collecting a large number of voice fragments by using a microphone array;

the second step is that: manually marking the control signal types corresponding to the voice segments;

the third step: extracting MFCC voice features from all voice fragments by using a voice preprocessing module to obtain a marked control voice feature data set;

the fourth step: dividing a data Set, namely, taking a certain amount of data in the marked control voice characteristic data Set to form a Training data Set, namely a Training Set, and taking a certain amount of data as a verification data Set, namely a Validation Set;

the fifth step: randomly initializing all parameters in the deep learning speech recognition model;

and a sixth step: executing a deep learning forward propagation process by taking the training data set as input;

the seventh step: executing a deep learning Back Propagation process by adopting a Back Propagation Through Time (BPTT) method, and updating all parameters in a deep learning voice model;

eighth step: if the execution period reaches the verification period, verifying the current deep learning speech recognition model by using the verification data set;

the ninth step: and stopping training if the training stopping condition is met, and returning to the sixth step if the training stopping condition is not met. The stop condition may be that the number of training times reaches a certain value, or that the verification error is less than a certain value.

Correspondingly, an embodiment of the present invention further provides a deep learning-based intelligent audiovisual device multi-service control system, as shown in fig. 3, the system includes: the system comprises a microphone array 1, a voice preprocessing module 2, a remote GPU server 3, an internet connection module 4, a control signal analysis module 5 and a control signal output module 6; wherein,

the microphone array 1 monitors and collects voice control signals sent by users at a specific frequency;

the voice preprocessing module 2 extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module 4 sends MFCC original voice feature information to the remote GPU server 3;

the remote GPU server 3 receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module 4;

the internet connection module 4 transmits the control signal identification information to the control signal analysis module 5, the control signal analysis module 5 generates a control signal code according to the control signal identification information, selects a corresponding control signal output module 6, and transmits the control signal code to the control signal output module 6;

the control signal output module 6 sends a control signal to the intelligent audio-visual equipment according to the control signal code.

In the embodiment of the invention, the microphone array 1 collects the voice signals sent by the user in real time and sends the voice signals to the voice preprocessing module 2.

The voice preprocessing module 2 is responsible for performing endpoint detection, noise reduction and MFCC original voice feature extraction operations on voice signals.

The internet connection module 4 is responsible for establishing network connection with the remote GPU server 3, sending MFCC raw speech feature information to the remote GPU server 3, and receiving feedback messages from the remote GPU server 3.

The control signal analysis module 5 is responsible for analyzing the feedback message from the remote GPU server 3, and enables the corresponding control signal output module 6 according to the message content, or performs error processing.

The control signal output module 6 has a plurality of modules, each control signal output unit is provided with hardware supporting a wireless communication mode and is responsible for controlling all intelligent audio-visual equipment based on the wireless communication mode. These wireless communication means include infrared, bluetooth, Z-wave, etc.

The remote GPU server 3 provides a voice control instruction recognition service for intelligent audio-visual equipment for users.

Further, the voice preprocessing module 2 includes:

And the remote GPU server 3 receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a bilSTM algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.

The remote GPU server 3 receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module 4;

the remote GPU server 3 classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, control signal identification information is returned to the internet connection module 4.

Specifically, the working principle of the system related function module according to the embodiment of the present invention may refer to the related description of the method embodiment, and is not described herein again.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

In addition, the above detailed description is given to the method and system for controlling multiple services of an intelligent audio-visual device based on deep learning according to the embodiments of the present invention, and a specific example is applied in the present disclosure to explain the principle and implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A multi-service control method of intelligent audio-visual equipment based on deep learning is characterized by comprising the following steps:

2. The intelligent audio-visual equipment multi-service control method based on deep learning of claim 1, wherein the step of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information comprises:

3. The intelligent audio-visual device multi-service control method based on deep learning of claim 1, wherein the remote GPU server receives MFCC original voice feature information, and performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, including:

and the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a bilSTM algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.

4. The intelligent audio-visual device multi-service control method based on deep learning of claim 1, wherein the remote GPU server receives MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the internet connection module, including:

5. A smart audiovisual device multi-service control system based on deep learning, the system comprising: the system comprises a microphone array, a voice preprocessing module, a remote GPU server, an internet connection module, a control signal analysis module and a control signal output module; wherein,

6. The intelligent deep learning based multi-service control system for intelligent audio-visual devices as claimed in claim 5, wherein the voice pre-processing module comprises:

7. The deep learning-based intelligent audio-visual device multi-service control system as claimed in claim 5, wherein the remote GPU server receives MFCC original speech feature information, starts a deep learning speech recognition procedure, and performs deep speech feature extraction on the MFCC original speech feature information by using a bilSTM algorithm to obtain deep speech feature information.

8. The intelligent audio-visual equipment multi-service control system based on deep learning of claim 5, wherein the remote GPU server receives MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;