CN106653020A - Multi-business control method and system for smart sound and video equipment based on deep learning - Google Patents

Multi-business control method and system for smart sound and video equipment based on deep learning Download PDF

Info

Publication number
CN106653020A
CN106653020A CN201611144430.6A CN201611144430A CN106653020A CN 106653020 A CN106653020 A CN 106653020A CN 201611144430 A CN201611144430 A CN 201611144430A CN 106653020 A CN106653020 A CN 106653020A
Authority
CN
China
Prior art keywords
control signal
feature information
mfcc
voice
voice feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611144430.6A
Other languages
Chinese (zh)
Inventor
曾旭龙
林格
陈小燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201611144430.6A priority Critical patent/CN106653020A/en
Publication of CN106653020A publication Critical patent/CN106653020A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the invention discloses a multi-business control method and system for smart sound and video equipment based on deep learning, and the method comprises the steps that a voice preprocessing module carries out the extraction of a voice control signal, and obtains MFCC original voice feature information; a remote GPU server receives the MFCC original voice feature information, and obtains the depth voice feature information according to the MFCC original voice feature information; an Internet connection module enables control signal identification information to be transmitted to a signal analysis module, and the signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to a control signal output module; the control signal output module transmits a control signal to the smart sound and video equipment according to the control signal code. The method can control various types of smart sound and video equipment which is based on different control protocols and achieves various types of different business, and provides more unified and natural man-machine interaction and control mode for the smart sound and video equipment.

Description

Intelligent audio-visual equipment multi-service control method and system based on deep learning
Technical Field
The invention relates to the technical field of intelligent audio-visual equipment multi-service control, in particular to an intelligent audio-visual equipment multi-service control method and system based on deep learning.
Background
With the progress of the internet of things and artificial intelligence technology, the intelligent audio-visual equipment technology is rapidly developed. More and more intelligent audio-visual equipment is designed and produced, and various multimedia audio-visual services are realized so as to meet different requirements in life of people. The devices designed and produced by different manufacturers have different control and man-machine interaction modes. The devices may adopt various control modes such as infrared, Bluetooth, Z-wave and the like, and realize human-computer interaction in modes such as voice, action, touch and the like. The intelligent audio-visual equipment control and the human-computer interaction mode are not unified, so that the threshold for the user to learn to use the intelligent audio-visual equipment is improved, and the problem of poor user experience is easily caused. The integration of multiple service scenes and the provision of a uniform, easy and natural control and man-machine interaction mode for the intelligent audio-visual equipment are problems to be solved urgently.
Deep learning is a sub-field of artificial intelligence. In recent years, with the progress of technologies such as Graphics Processing Unit (GPU) and cloud computing, deep learning theoretical research has been made in a breakthrough. Meanwhile, the introduction of deep learning technology has led to a rapid advance in the fields of computer vision, speech recognition, and the like. This also brings new ideas for intelligent audiovisual equipment control technology.
An existing intelligent home natural interaction system [1] based on audio and video uses a microphone and a camera to collect sound and image information, uses an information fusion module to perform signal processing, then uses a machine learning method to obtain a useful instruction, and then uses a control signal transmitting module to send out a control signal.
The system uses information such as voice, gestures, human faces and actions to control, a simple and unified interaction mode cannot be provided for a user, and the problems that the learning cost for the user to master the use of the system is high, the user experience is poor and the like are caused. The traditional machine learning method is adopted to recognize multimedia information such as voice, images and the like, so that the recognition rate is low, and the system robustness is poor. And its voice, image recognition programs run locally, which increases the hardware and energy costs for the user.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a method and a system for controlling multiple services of intelligent audio-visual equipment based on deep learning, which can control multiple intelligent audio-visual equipment based on different control protocols and realizing multiple different services and provide a more uniform and natural man-machine interaction and control mode for the intelligent audio-visual equipment.
In order to solve the above problems, the present invention provides a method for controlling multiple services of an intelligent audiovisual device based on deep learning, the method comprising:
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts voice control signals to obtain original voice feature information of Mel-scalefree Cepstral Coefficients (MFCC); detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends the MFCC original voice feature information to a remote Graphics Processing Unit (GPU) server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
Preferably, the step of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information includes:
carrying out endpoint detection and segmentation processing on the voice control signal;
carrying out noise reduction processing on the voice control signal after the segmentation processing;
and performing MFCC original voice feature extraction on the voice control signal subjected to the noise reduction processing to obtain MFCC original voice feature information.
Preferably, the step of receiving, by the remote GPU server, the MFCC original speech feature information, performing deep speech feature extraction on the MFCC original speech feature information, and obtaining deep speech feature information includes:
the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a Bidirectional Long Short-Term Memory-neural network (biLSTM) algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.
Preferably, the step of the remote GPU server receiving the MFCC original voice feature information, obtaining the deep voice feature information according to the MFCC original voice feature information, and sending the control signal identification information corresponding to the deep voice feature information to the internet connection module includes:
the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
Correspondingly, the invention also provides a deep learning-based intelligent audio-visual equipment multi-service control system, which comprises: the system comprises a microphone array, a voice preprocessing module, a remote GPU server, an internet connection module, a control signal analysis module and a control signal output module; wherein,
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
Preferably, the voice preprocessing module includes:
the segmentation unit is used for carrying out endpoint detection and segmentation processing on the voice control signal;
the noise reduction unit is used for carrying out noise reduction processing on the voice control signal after the segmentation processing;
and the extraction unit is used for extracting the MFCC original voice feature of the voice control signal after the noise reduction processing to obtain MFCC original voice feature information.
Preferably, the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and performs deep voice feature extraction on the MFCC original voice feature information by using a biLSTM algorithm to obtain deep voice feature information.
Preferably, the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
By implementing the embodiment of the invention, natural voice can be used for controlling various intelligent audio-visual equipment based on different control protocols and realizing various different services, and a unified, natural, efficient and low-cost man-machine interaction mode is provided for the intelligent audio-visual equipment; meanwhile, a complex deep learning task is deployed on the remote server, so that the hardware and energy cost of a user is reduced, high-performance and low-cost intelligent audio-visual equipment voice control instruction recognition service is provided for the user, and the recognition accuracy of the intelligent audio-visual equipment voice control instruction is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flowchart of a method for controlling multiple services of an intelligent audio-visual device based on deep learning according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deep learning speech recognition model in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a system and a multi-service control of an intelligent audiovisual device based on deep learning according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flowchart of a method for controlling multiple services of an intelligent audiovisual device based on deep learning according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s1, the microphone array monitors and collects the voice control signal sent by the user with a specific frequency;
s2, the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server; if not, returning to S1;
s3, the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
s4, the Internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to the control signal output module;
and S5, the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
In the process of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information, the method comprises the following steps:
carrying out endpoint detection and segmentation processing on the voice control signal;
carrying out noise reduction processing on the voice control signal after the segmentation processing;
and performing MFCC original voice feature extraction on the voice control signal subjected to the noise reduction processing to obtain MFCC original voice feature information.
Specifically, in S3, the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and performs deep voice feature extraction on the MFCC original voice feature information by using the biLSTM algorithm to obtain deep voice feature information.
Further, the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning control signal identification information to the Internet connection module; if not, returning an error mark to the Internet connection module.
In the embodiment of the invention, as shown in fig. 2, the main structure of the deep learning speech recognition model includes a biLSTM composed of a forward long-and-short-term memory recurrent neural network and a reverse long-and-short-term memory recurrent neural network, and a Softmax classifier. The input to the deep-learning speech recognition model is sent from the local internet connectivity unit MFCC speech features, the output of which is T +1 class identifiers. These category identifiers include T categories, one for each control signal supported by the system, and a Default category. If the model outputs the Default category, it indicates that the MFCC speech features do not correspond to a control signal for the intelligent audiovisual equipment. The deep learning speech recognition model is generated in advance in a training and generating stage and then is deployed on a remote GPU server to provide speech control instruction recognition service for intelligent audio-visual equipment for users.
In the implementation, the training generation process of the deep learning speech recognition model is as follows:
the first step is as follows: simulating a real device use situation according to the types of intelligent audio-visual devices required to be supported and the service functions realized by the devices, and collecting a large number of voice fragments by using a microphone array;
the second step is that: manually marking the control signal types corresponding to the voice segments;
the third step: extracting MFCC voice features from all voice fragments by using a voice preprocessing module to obtain a marked control voice feature data set;
the fourth step: dividing a data Set, namely, taking a certain amount of data in the marked control voice characteristic data Set to form a Training data Set, namely a Training Set, and taking a certain amount of data as a verification data Set, namely a Validation Set;
the fifth step: randomly initializing all parameters in the deep learning speech recognition model;
and a sixth step: executing a deep learning forward propagation process by taking the training data set as input;
the seventh step: executing a deep learning Back Propagation process by adopting a Back Propagation Through Time (BPTT) method, and updating all parameters in a deep learning voice model;
eighth step: if the execution period reaches the verification period, verifying the current deep learning speech recognition model by using the verification data set;
the ninth step: and stopping training if the training stopping condition is met, and returning to the sixth step if the training stopping condition is not met. The stop condition may be that the number of training times reaches a certain value, or that the verification error is less than a certain value.
Correspondingly, an embodiment of the present invention further provides a deep learning-based intelligent audiovisual device multi-service control system, as shown in fig. 3, the system includes: the system comprises a microphone array 1, a voice preprocessing module 2, a remote GPU server 3, an internet connection module 4, a control signal analysis module 5 and a control signal output module 6; wherein,
the microphone array 1 monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module 2 extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module 4 sends MFCC original voice feature information to the remote GPU server 3;
the remote GPU server 3 receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module 4;
the internet connection module 4 transmits the control signal identification information to the control signal analysis module 5, the control signal analysis module 5 generates a control signal code according to the control signal identification information, selects a corresponding control signal output module 6, and transmits the control signal code to the control signal output module 6;
the control signal output module 6 sends a control signal to the intelligent audio-visual equipment according to the control signal code.
In the embodiment of the invention, the microphone array 1 collects the voice signals sent by the user in real time and sends the voice signals to the voice preprocessing module 2.
The voice preprocessing module 2 is responsible for performing endpoint detection, noise reduction and MFCC original voice feature extraction operations on voice signals.
The internet connection module 4 is responsible for establishing network connection with the remote GPU server 3, sending MFCC raw speech feature information to the remote GPU server 3, and receiving feedback messages from the remote GPU server 3.
The control signal analysis module 5 is responsible for analyzing the feedback message from the remote GPU server 3, and enables the corresponding control signal output module 6 according to the message content, or performs error processing.
The control signal output module 6 has a plurality of modules, each control signal output unit is provided with hardware supporting a wireless communication mode and is responsible for controlling all intelligent audio-visual equipment based on the wireless communication mode. These wireless communication means include infrared, bluetooth, Z-wave, etc.
The remote GPU server 3 provides a voice control instruction recognition service for intelligent audio-visual equipment for users.
Further, the voice preprocessing module 2 includes:
the segmentation unit is used for carrying out endpoint detection and segmentation processing on the voice control signal;
the noise reduction unit is used for carrying out noise reduction processing on the voice control signal after the segmentation processing;
and the extraction unit is used for extracting the MFCC original voice feature of the voice control signal after the noise reduction processing to obtain MFCC original voice feature information.
And the remote GPU server 3 receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a bilSTM algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.
The remote GPU server 3 receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module 4;
the remote GPU server 3 classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, control signal identification information is returned to the internet connection module 4.
Specifically, the working principle of the system related function module according to the embodiment of the present invention may refer to the related description of the method embodiment, and is not described herein again.
By implementing the embodiment of the invention, natural voice can be used for controlling various intelligent audio-visual equipment based on different control protocols and realizing various different services, and a unified, natural, efficient and low-cost man-machine interaction mode is provided for the intelligent audio-visual equipment; meanwhile, a complex deep learning task is deployed on the remote server, so that the hardware and energy cost of a user is reduced, high-performance and low-cost intelligent audio-visual equipment voice control instruction recognition service is provided for the user, and the recognition accuracy of the intelligent audio-visual equipment voice control instruction is improved.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
In addition, the above detailed description is given to the method and system for controlling multiple services of an intelligent audio-visual device based on deep learning according to the embodiments of the present invention, and a specific example is applied in the present disclosure to explain the principle and implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (8)

1. A multi-service control method of intelligent audio-visual equipment based on deep learning is characterized by comprising the following steps:
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
2. The intelligent audio-visual equipment multi-service control method based on deep learning of claim 1, wherein the step of extracting the voice control signal by the voice preprocessing module to obtain the MFCC original voice feature information comprises:
carrying out endpoint detection and segmentation processing on the voice control signal;
carrying out noise reduction processing on the voice control signal after the segmentation processing;
and performing MFCC original voice feature extraction on the voice control signal subjected to the noise reduction processing to obtain MFCC original voice feature information.
3. The intelligent audio-visual device multi-service control method based on deep learning of claim 1, wherein the remote GPU server receives MFCC original voice feature information, and performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, including:
and the remote GPU server receives the MFCC original voice feature information, starts a deep learning voice recognition program, and adopts a bilSTM algorithm to perform deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information.
4. The intelligent audio-visual device multi-service control method based on deep learning of claim 1, wherein the remote GPU server receives MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the internet connection module, including:
the remote GPU server receives the MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
5. A smart audiovisual device multi-service control system based on deep learning, the system comprising: the system comprises a microphone array, a voice preprocessing module, a remote GPU server, an internet connection module, a control signal analysis module and a control signal output module; wherein,
the microphone array monitors and collects voice control signals sent by users at a specific frequency;
the voice preprocessing module extracts the voice control signal to obtain MFCC original voice characteristic information; detecting whether the logarithmic energy of the MFCC original voice features is larger than a threshold value; if yes, the Internet connection module sends MFCC original voice feature information to a remote GPU server;
the remote GPU server receives the MFCC original voice feature information, obtains deep voice feature information according to the MFCC original voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the internet connection module transmits the control signal identification information to the control signal analysis module, the control signal analysis module generates a control signal code according to the control signal identification information, a corresponding control signal output module is selected, and the control signal code is transmitted to the control signal output module;
the control signal output module sends a control signal to the intelligent audio-visual equipment according to the control signal code.
6. The intelligent deep learning based multi-service control system for intelligent audio-visual devices as claimed in claim 5, wherein the voice pre-processing module comprises:
the segmentation unit is used for carrying out endpoint detection and segmentation processing on the voice control signal;
the noise reduction unit is used for carrying out noise reduction processing on the voice control signal after the segmentation processing;
and the extraction unit is used for extracting the MFCC original voice feature of the voice control signal after the noise reduction processing to obtain MFCC original voice feature information.
7. The deep learning-based intelligent audio-visual device multi-service control system as claimed in claim 5, wherein the remote GPU server receives MFCC original speech feature information, starts a deep learning speech recognition procedure, and performs deep speech feature extraction on the MFCC original speech feature information by using a bilSTM algorithm to obtain deep speech feature information.
8. The intelligent audio-visual equipment multi-service control system based on deep learning of claim 5, wherein the remote GPU server receives MFCC original voice feature information, performs deep voice feature extraction on the MFCC original voice feature information to obtain deep voice feature information, and sends control signal identification information corresponding to the deep voice feature information to the Internet connection module;
the remote GPU server classifies the deep voice characteristic information to obtain a category corresponding to the deep voice characteristic information, and detects whether the category corresponds to a control signal identifier; if yes, returning the control signal identification information to the Internet connection module.
CN201611144430.6A 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning Pending CN106653020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611144430.6A CN106653020A (en) 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611144430.6A CN106653020A (en) 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning

Publications (1)

Publication Number Publication Date
CN106653020A true CN106653020A (en) 2017-05-10

Family

ID=58824998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611144430.6A Pending CN106653020A (en) 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning

Country Status (1)

Country Link
CN (1) CN106653020A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108074575A (en) * 2017-12-14 2018-05-25 广州势必可赢网络科技有限公司 Identity verification method and device based on recurrent neural network
CN109559761A (en) * 2018-12-21 2019-04-02 广东工业大学 A kind of risk of stroke prediction technique based on depth phonetic feature
CN110428821A (en) * 2019-07-26 2019-11-08 广州市申迪计算机系统有限公司 A kind of voice command control method and device for crusing robot
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111783892A (en) * 2020-07-06 2020-10-16 广东工业大学 Robot instruction identification method and device, electronic equipment and storage medium
CN113921016A (en) * 2021-10-15 2022-01-11 阿波罗智联(北京)科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN118708384A (en) * 2024-08-30 2024-09-27 江苏博云科技股份有限公司 Method and system for optimizing GPU remote call performance based on pre-analysis and calculation service

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221762A (en) * 2007-12-06 2008-07-16 上海大学 MP3 compression field audio partitioning method
CN104952448A (en) * 2015-05-04 2015-09-30 张爱英 Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks
CN105045122A (en) * 2015-06-24 2015-11-11 张子兴 Intelligent household natural interaction system based on audios and videos
CN105700359A (en) * 2014-11-25 2016-06-22 上海天脉聚源文化传媒有限公司 Method and system for controlling smart home through speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221762A (en) * 2007-12-06 2008-07-16 上海大学 MP3 compression field audio partitioning method
CN105700359A (en) * 2014-11-25 2016-06-22 上海天脉聚源文化传媒有限公司 Method and system for controlling smart home through speech recognition
CN104952448A (en) * 2015-05-04 2015-09-30 张爱英 Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks
CN105045122A (en) * 2015-06-24 2015-11-11 张子兴 Intelligent household natural interaction system based on audios and videos

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108074575A (en) * 2017-12-14 2018-05-25 广州势必可赢网络科技有限公司 Identity verification method and device based on recurrent neural network
CN109559761A (en) * 2018-12-21 2019-04-02 广东工业大学 A kind of risk of stroke prediction technique based on depth phonetic feature
CN110428821A (en) * 2019-07-26 2019-11-08 广州市申迪计算机系统有限公司 A kind of voice command control method and device for crusing robot
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
WO2021127982A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech emotion recognition method, smart device, and computer-readable storage medium
CN111357051B (en) * 2019-12-24 2024-02-02 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111783892A (en) * 2020-07-06 2020-10-16 广东工业大学 Robot instruction identification method and device, electronic equipment and storage medium
CN113921016A (en) * 2021-10-15 2022-01-11 阿波罗智联(北京)科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN118708384A (en) * 2024-08-30 2024-09-27 江苏博云科技股份有限公司 Method and system for optimizing GPU remote call performance based on pre-analysis and calculation service

Similar Documents

Publication Publication Date Title
CN106653020A (en) Multi-business control method and system for smart sound and video equipment based on deep learning
CN112889108B (en) Speech classification using audiovisual data
CN107437415B (en) Intelligent voice interaction method and system
CN102843543B (en) Video conferencing reminding method, device and video conferencing system
CN110519636B (en) Voice information playing method and device, computer equipment and storage medium
CN104049721B (en) Information processing method and electronic equipment
CN102890776B (en) The method that expression figure explanation is transferred by facial expression
CN109377995B (en) Method and device for controlling equipment
CN112863547A (en) Virtual resource transfer processing method, device, storage medium and computer equipment
CN111541951B (en) Video-based interactive processing method and device, terminal and readable storage medium
CN111966212A (en) Multi-mode-based interaction method and device, storage medium and smart screen device
CN110516749A (en) Model training method, method for processing video frequency, device, medium and calculating equipment
CN111197841A (en) Control method, control device, remote control terminal, air conditioner, server and storage medium
CN109032345A (en) Apparatus control method, device, equipment, server-side and storage medium
CN117193524A (en) Man-machine interaction system and method based on multi-mode feature fusion
CN108040111A (en) A kind of apparatus and method for supporting natural language interaction
CN107452381B (en) Multimedia voice recognition device and method
CN111402096A (en) Online teaching quality management method, system, equipment and medium
CN109343481B (en) Method and device for controlling device
CN111933137B (en) Voice wake-up test method and device, computer readable medium and electronic equipment
CN117877125B (en) Action recognition and model training method and device, electronic equipment and storage medium
CN113571063A (en) Voice signal recognition method and device, electronic equipment and storage medium
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
WO2018023523A1 (en) Motion and emotion recognizing home control system
WO2018023518A1 (en) Smart terminal for voice interaction and recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510

RJ01 Rejection of invention patent application after publication