WO2022161077A1

WO2022161077A1 - Speech control method, and electronic device

Info

Publication number: WO2022161077A1
Application number: PCT/CN2021/142083
Authority: WO
Inventors: 王晓博; 许嘉璐
Original assignee: 华为技术有限公司
Priority date: 2021-01-29
Filing date: 2021-12-28
Publication date: 2022-08-04
Also published as: CN114822525A

Abstract

A speech control method, and an electronic device The speech control method is applied to a speech control system, wherein the speech control system at least comprises a first electronic device and a second electronic device, which have a speech control function. The speech control method comprises: a first electronic device and a second electronic device respectively receiving a first speech instruction input by a user, and the first electronic device responding to the first speech instruction; the second electronic device performing recording and storing recording data, wherein the recording is used for recording a second speech instruction input by the user; the second electronic device sending the recording data of the second electronic device to the first electronic device; the first electronic device responding to the second speech instruction according to recording data of the first electronic device and/or the recording data of the second electronic device, wherein the recording data of the first electronic device comprises recording data of when the first electronic device records a second speech instruction input by the user. By means of the method, the problem of false recognition of speech control in a multi-device scenario can be solved, thereby improving the accuracy of speech control.

Description

Voice control method and electronic device

This application claims the priority of the Chinese patent application with the application number 202110130831.0 and the application title "Voice Control Method and Electronic Device" filed with the China Patent Office on January 29, 2021, the entire contents of which are incorporated into this application by reference.

technical field

The present application relates to computer technology, and in particular, to a voice control method and electronic device.

Background technique

As a new type of terminal application (application, APP) based on voice semantic algorithm, voice assistant provides service functions such as interactive dialogue, information query, and device control by receiving and recognizing voice signals sent by users. With the continuous development of deep learning theory and the maturity of intelligent voice hardware, voice assistant applications have become an essential software function for terminal devices such as smartphones, tablet computers, smart TVs, and smart speakers.

With the popularization of terminal devices equipped with voice assistants, many users already own multiple terminal devices of the same or different types. In the scenario where the user uses multiple terminal devices concurrently, or the user's voice interaction occurs within the effective working range of multiple terminal devices, through signal detection and interactive negotiation between the terminal devices, select the clearest pickup (that is, the distance The user's nearest) terminal device is used as a pickup entrance for the voice assistant application to call, which can improve the recognition accuracy of the voice assistant application. For example, the user's living room has three devices: a speaker, a TV, and a mobile phone. All three devices have a voice assistant application installed, and the wake-up words are all "small E and small E". Then, when the user speaks the wake-up word "Xiao E Xiao E", the voice assistant application of the speaker, TV and mobile phone selects one of the three devices as the answering device by detecting the audio energy information of the wake-up word. Since the speaker is closest to the user, the three devices negotiate and select the speaker as the answering device based on the audio energy information of the wake-up word. The speaker wakes up its own voice assistant application, and other devices do not respond to the wake word, that is, do not wake up their respective voice assistant applications. In this way, after the user continues to speak the voice signal, only the speaker will recognize and respond to the user's voice signal. For example, after the user speaks the voice signal "play song 112222", the speaker recognizes and responds to the voice signal. For example, the speaker responds by outputting the voice signal "Song 112222 will be played for you".

In the above-mentioned multi-device voice control process, the answering device recognizes and responds to the user's voice signal. However, due to the diversity and complexity of the usage scenarios, this processing method will have the problem of misrecognition by the answering device, that is, there is an answering device. The problem that the voice signal input by the user after the wake-up word cannot be accurately recognized.

SUMMARY OF THE INVENTION

The present application provides a voice control method and electronic device, so as to solve the problem of misrecognition of voice control in a multi-device scenario and improve the accuracy of voice control.

In the first aspect, an embodiment of the present application provides a voice control method, which can be applied to a voice control system, and the voice control system can at least include a first electronic device and a second electronic device with a voice control function. The control method may include: the first electronic device and the second electronic device respectively receive a first voice command input by a user, and the first electronic device responds to the first voice command. The second electronic device records and saves the recording data, and the recording is used to record the second voice command input by the user. The second electronic device sends the audio recording data of the second electronic device to the first electronic device. The first electronic device responds to the second voice instruction according to the recorded data of the first electronic device and/or the recorded data of the second electronic device. The recording data of the first electronic device includes recording data of the second voice instruction input by the user recorded by the first electronic device.

The recording of the second electronic device may start before the first electronic device responds to the first voice instruction, decoupling the selection process of the answering device from the recording process of the electronic device, regardless of whether the first electronic device is determined as the Both the answering device and the second electronic device can record and save the second voice command input by the user. After the first electronic device is decided as the answering device, the recording data of the second electronic device is sent to the first electronic device, and the The first electronic device responds to the second voice command.

In this implementation, the first electronic device acts as an answering device to answer the first voice command, the first electronic device and the second electronic device both record the second voice command and save the recorded data, and the second electronic device sends its own recorded data To the first electronic device, the first electronic device responds to the second voice instruction according to the recorded data of the first electronic device and/or the recorded data of the second electronic device. In this implementation, the voice commands input by the user are recorded by the non-responding device, and the answering device performs SE, ASR and other processing based on the recorded data of the answering device and/or the recorded data of the non-response device, effectively eliminating the need for the equipment in the process of selecting the answering device. The communication delay between different devices can be solved, so as to solve the frame loss problem of voice control caused by delay in multi-device scenarios. The answering device responds to the second voice command through the recording data collected by multiple devices collaboratively, which can solve the problem of the influence of the audio quality of the voice command picked up by the electronic device on the accuracy of ASR recognition, and improve the accuracy of voice control.

In a possible design, the method may further include: the first electronic device invokes a voice pickup instruction to the second electronic device, where the voice pickup instruction is used by the second electronic device to return the recording data of the second electronic device.

In a possible design, the recording by the second electronic device may include: recording by the second electronic device when or after the second electronic device receives the first voice instruction input by the user.

In this implementation manner, when the second electronic device receives the first voice command input by the user or after the second electronic device records the recording, that is, the second electronic device starts recording before determining the answering device, the second electronic device can record to the user The second voice command entered. This can effectively eliminate the communication delay between devices in the process of selecting the answering device, thereby solving the problem of frame loss in voice control caused by delay in multi-device scenarios.

In a possible design, the method may further include: when or after the first electronic device receives the first voice instruction input by the user, recording the first electronic device, and the recording is used to record the second voice instruction input by the user.

In a possible design, the first voice command is used to wake up the voice control function of the first electronic device and/or the second electronic device.

For ease of understanding, the first voice instruction here may be the voice instruction of step 401 in the following embodiment shown in FIG. 3 .

In a possible design, the method may further include: the first electronic device and the second electronic device determine that the first electronic device is the answering device of the voice control system according to the audio quality information of the first voice command received by the first electronic device respectively.

In a possible design, after the first electronic device responds to the first voice command and before recording the second voice command input by the user, the method may further include: during the recording process of the first electronic device and the second electronic device. , the first electronic device does not detect the second voice instruction input by the user within the preset time period, the first electronic device deletes the saved recording data and continues to record. The first electronic device invokes a multi-round dialogue pause command to the second electronic device, where the multi-round dialogue pause command is used to instruct the multi-round dialogue pause to temporarily stop. The second electronic device deletes the saved recording data and continues recording.

For ease of understanding, the first voice instruction here may be the voice instruction before step 701 in the embodiment shown in FIG. 6 below. The second voice instruction here may be the voice instruction of step 703 in the following embodiment shown in FIG. 6 .

In a possible design, the method may further include: the first electronic device receiving audio quality information of the recording data of the second electronic device sent by the second electronic device.

This implementation can speed up the decision of the optimal radio equipment, thereby improving the response speed of the voice control.

In a possible design, the first electronic device responds to the second voice command according to the recorded data of the first electronic device and/or the recorded data of the second electronic device, which may include: the first electronic device responds to the second voice command according to the recorded data of the first electronic device. The audio quality information of the data and the audio quality information of the recording data of the second electronic device are used to determine the optimal audio pickup device from the voice control system. When the optimal radio device is the first electronic device, the first electronic device responds to the second voice command according to the recording data of the first electronic device, or according to the recording data of the first electronic device and the recording data of the second electronic device. When the optimal radio device is the second electronic device, the first electronic device responds to the second voice command according to the recording data of the second electronic device, or according to the recording data of the second electronic device and the recording data of the first electronic device. The audio quality information is used to indicate the audio quality of the recording data.

In this implementation manner, by using the recording data of the optimal radio equipment to respond to the second voice command, the influence of noise on the accuracy of voice control can be reduced.

In a possible design, the first electronic device responds to the second voice command according to the recorded data of the first electronic device and/or the recorded data of the second electronic device, which may include: the first electronic device responds to the second voice command according to the recorded data of the first electronic device. The audio content information of the data and/or the audio content information of the recording data of the second electronic device is to respond to the second voice instruction. The audio content information is used to represent the audio content of the recording data.

For example, when the audio content information of the recording data of the first electronic device is more than the audio content information of the recording data of the second electronic device, the second voice command is responded according to the audio content information of the recording data of the first electronic device. When the audio content information of the recording data of the first electronic device is less than the audio content information of the recording data of the second electronic device, the second voice command is responded according to the audio content information of the recording data of the second electronic device. For another example, when the audio content information of the audio recording data of the first electronic device and the audio content information of the audio recording data of the second electronic device have partially the same content, the first electronic device can compare the audio content information of the audio recording data of the first electronic device to the audio content information. Splicing with the audio content information of the recording data of the second electronic device, and responding to the second voice command according to the spliced audio content information.

In this implementation manner, by using the recording data collected by multiple devices cooperatively to respond to the second voice command, frame loss can be avoided, and the accuracy of voice control can be improved.

In a second aspect, an embodiment of the present application provides a voice control method, which can be applied to a first electronic device of a voice control system, the voice control system can also include at least a second electronic device, and the voice control method can include: An electronic device receives the first voice command input by the user, and the first electronic device responds to the first voice command. The first electronic device receives the recording data of the second electronic device sent by the second electronic device, and the recording data of the second electronic device includes the recording data of the second electronic device recording the second voice instruction input by the user. The first electronic device responds to the second voice command according to the recorded data of the first electronic device and/or the recorded data of the second electronic device, and the recorded data of the first electronic device includes the first electronic device recording the second voice command input by the user. recording data.

In a possible design, the method may further include: the first electronic device invokes a voice pickup instruction to the second electronic device, and the voice pickup instruction is used for the second electronic device to return the recording data of the second electronic device.

In a possible design, the method may further include: when or after the first electronic device receives the first voice instruction input by the user, recording the first electronic device for recording the second voice instruction input by the user.

In a possible design, the method may further include: the first electronic device according to the audio quality information of the first voice command received by the first electronic device and the audio quality information of the first voice command received by the second electronic device, It is determined that the first electronic device is an answering device of the voice control system.

In a possible design, after the first electronic device responds to the first voice command and before recording the second voice command input by the user, the method may further include: during the recording process of the first electronic device, the first electronic device If the second voice command input by the user is not detected within the preset time period, the first electronic device deletes the saved recording data and continues to record; the first electronic device invokes multiple rounds of dialogue pause instructions to the second electronic device, The dialogue pause instruction is used to instruct multiple rounds of dialogue to temporarily stop; the second electronic device deletes the saved recording data and continues recording.

In a possible design, the first electronic device responds to the second voice command according to the recorded data of the first electronic device and/or the recorded data of the second electronic device, which may include: the first electronic device responds to the second voice command according to the recorded data of the first electronic device. The audio quality information of the data and the audio quality information of the recording data of the second electronic device are used to determine the optimal audio pickup device from the voice control system. When the optimal audio pickup device is the first electronic device, the first electronic device responds to the second voice command according to the recording data of the first electronic device. When the optimal radio device is the second electronic device, the first electronic device responds to the second voice command according to the recording data of the second electronic device, or according to the recording data of the second electronic device and the recording data of the first electronic device. The audio quality information is used to indicate the audio quality of the recording data.

In a third aspect, an embodiment of the present application provides a voice control method. The voice control method can be applied to a second electronic device of a voice control system. The voice control system can also include at least a first electronic device. The voice control method can include : The second electronic device records and saves the recording data, and the recording is used to record the second voice command input by the user. The second electronic device sends the recording data of the second electronic device to the first electronic device, the recording data of the second electronic device includes the recording data of the second electronic device recording the second voice command input by the user, and the recording data is used by the first electronic device After answering the first voice instruction, answer the second voice instruction.

In a possible design, the method may further include: the second electronic device receives a voice pickup instruction called by the first electronic device, and the voice pickup instruction is used for the second electronic device to return the recording data of the second electronic device.

In a possible design, the method may further include: the second electronic device according to the audio quality information of the first voice command received by the second electronic device and the audio quality information of the first voice command received by the first electronic device, It is determined that the first electronic device is an answering device of the voice control system.

In a possible design, after the first electronic device responds to the first voice command, the method may further include: during the recording process of the second electronic device, the second electronic device receives the second electronic device to invoke multiple rounds of dialogue pause commands, The multi-round dialogue pause instruction is used to instruct the multi-round dialogue to temporarily stop; the second electronic device deletes the saved recording data and continues recording.

In a possible design, the method may further include: the second electronic device sends audio quality information of the recording data of the second electronic device to the first electronic device.

In a fourth aspect, an embodiment of the present application provides a voice control device, the device has the function of implementing the second aspect or any possible design of the second aspect. The functions can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, for example, a transceiver unit or module, and a processing unit or module.

In a fifth aspect, an embodiment of the present application provides a voice control device, the device has a function of implementing the third aspect or any possible design of the third aspect. The functions can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, for example, a transceiver unit or module, and a processing unit or module.

In a sixth aspect, an embodiment of the present application provides an electronic device, which may include: one or more processors; one or more memories; wherein the one or more memories are used to store one or more programs ; the one or more processors are configured to run the one or more programs to implement the method according to the second aspect or any possible design of the second aspect.

In a seventh aspect, an embodiment of the present application provides an electronic device, which may include: one or more processors; one or more memories; wherein the one or more memories are used to store one or more programs ; the one or more processors are configured to run the one or more programs to implement the method according to the third aspect or any possible design of the third aspect.

In an eighth aspect, an embodiment of the present application provides a computer-readable storage medium, which is characterized in that it includes a computer program, and when the computer program is executed on a computer, causes the computer to execute the second aspect or any of the second aspect. A possible design of the method described.

In a ninth aspect, an embodiment of the present application provides a computer-readable storage medium, which is characterized in that it includes a computer program, and when the computer program is executed on a computer, causes the computer to execute the third aspect or any of the third aspect. A possible design of the method described.

In a tenth aspect, an embodiment of the present application provides a chip, which is characterized in that it includes a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory, to A method as described in the second aspect or any possible design of the second aspect is performed.

In an eleventh aspect, an embodiment of the present application provides a chip, characterized in that it includes a processor and a memory, the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory, to perform the method described in the third aspect or any possible design of the third aspect.

In a twelfth aspect, embodiments of the present application provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the method described in the second aspect or any possible design of the second aspect.

In a thirteenth aspect, the embodiments of the present application provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the method described in the third aspect or any possible design of the third aspect.

In a fourteenth aspect, an embodiment of the present application provides a voice control system, where the voice control system includes at least a first electronic device and a second electronic device having a voice control function. The first electronic device is adapted to perform the method as described in the second aspect or any possible design of the second aspect. The second electronic device is configured to perform the method as described in the third aspect or any possible design of the third aspect.

The voice control method and electronic device of the embodiments of the present application, in the above-mentioned multi-device scenario, solve the problem of frame loss in voice control in the multi-device scenario by directly recording multiple devices without performing cross-device communication, and improve voice control 's accuracy. After that, responding to the voice command input by the user through the recorded data of the multi-device collaborative sound collection can effectively solve the problem that the audio quality of the voice command picked up by the electronic device affects the accuracy of ASR recognition, and improve the accuracy of voice control.

Description of drawings

1 provides a schematic diagram of a voice control system according to an embodiment of the present application;

2 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application;

3 is a schematic flowchart of a voice control method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a scenario of multi-device voice control provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of another multi-device voice control scenario provided by an embodiment of the present application;

6 is a schematic flowchart of another voice control method provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of another multi-device voice control scenario provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a voice control device according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a voice control apparatus provided by an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed ways

The terms "first", "second", etc. involved in the embodiments of the present application are only used for the purpose of distinguishing and describing, and cannot be understood as indicating or implying relative importance, nor can they be understood as indicating or implying a sequence. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, eg, comprising a series of steps or elements. A method, system, product or device is not necessarily limited to those steps or units expressly listed, but may include other steps or units not expressly listed or inherent to the process, method, product or device.

It should be understood that, in this application, "at least one (item)" refers to one or more, and "a plurality" refers to two or more. "And/or" is used to describe the relationship between related objects, indicating that there can be three kinds of relationships, for example, "A and/or B" can mean: only A, only B, and both A and B exist , where A and B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a) of a, b or c, can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c" ", where a, b, c can be single or multiple.

Voice assistant: An application program built on artificial intelligence, with the help of speech semantic recognition algorithm, through instant question-and-answer voice interaction with users, it helps users to complete information query, device control, text input and other operations. Voice assistants usually use staged cascade processing, followed by voice wake-up, voice front-end processing, automatic speech recognition (ASR), natural language understanding (NLU), dialogue management (dialog management, DM), Basic workflows such as natural language generation (NLG) and text-to-speech (TTS) provide service functions. Wherein, the voice front-end processing may include but is not limited to voice enhancement (speech enhancement, SE). ASR can take the speech signal processed by SE noise reduction as input, and output the textual description result of the user's speech signal. ASR is the basis for voice assistant applications to accurately complete subsequent recognition processing tasks. The audio quality of the user's voice signal input to the ASR directly determines the accuracy of the ASR recognition result. The voice control method of the embodiment of the present application can ensure the accuracy and reliability of the user voice signal input to the ASR, thereby improving the accuracy of the ASR recognition result, and then accurately completing the subsequent recognition processing task.

Voice wake-up: The electronic device receives and detects a specific user voice signal (ie wake-up word) when the screen is locked or the voice assistant is dormant, activates or starts the voice assistant, and makes the voice assistant enter the state of waiting for voice signal input.

Acoustic echo cancellation (AEC): A voice front-end processing technology that eliminates the noise generated by the microphone and the speaker due to the return path of the air by means of sound wave interference, which can effectively alleviate the sound caused by the speaker playing audio or sound waves. noise interference problem.

In the multi-device voice control process, multiple electronic devices select an answering device through mutual communication and negotiation, and the answering device identifies and responds to the user's voice signal. There are two reasons for the misidentification of this approach: audio quality and latency. With regard to audio quality, due to the diversity and complexity of usage scenarios, user voice commands picked up and processed by electronic devices are inevitably disturbed by various external and internal noises. The interference of noise will affect the audio quality of the user's voice command picked up by the electronic device. For example, the external noise can be noises such as air conditioner fans and unrelated human voices around the device, and the internal noise can be the audio/video played by the electronic device itself. In addition, the distance and orientation between the electronic device and the user, as well as the posture of the electronic device and the performance of the microphone module, etc., will also affect the audio quality of the user's voice commands picked up by the electronic device. When the audio quality of the user's voice command picked up by the electronic device is poor, misrecognition will occur. For the delay, in the process of negotiating and selecting the answering device among multiple electronic devices, the communication delay caused by the cross-device communication between the multiple electronic devices and the delay caused by the selection of the answering device will cause the frame loss problem, which will lead to misidentification. . For example, the above delay will cause the user to say the voice signal "play song 112222", but the answering device only recognizes the voice signal "2222", that is, the voice signal "play song 11" is not received and recognized, which makes the answering device unable to Accurately recognize and respond to user voice commands.

The voice control method of the embodiment of the present application can improve the audio quality and/or reduce the time delay, so as to solve the problem of misrecognition of voice commands in the process of multi-device voice control. Through the method of directly starting recording without cross-device communication between multiple electronic devices, the delay caused by the realization of multi-device wake-up and data transmission through communication is eliminated, thereby eliminating the impact of the delay on the accuracy of ASR recognition, and solving the problem of multiple devices. The frame loss problem of voice control in the scene improves the accuracy of voice control. By selecting one or more electronic devices among the plurality of electronic devices as the optimal audio pickup device, the audio quality of the audio recording data of the optimal audio pickup device is better than that of other electronic devices. Based on the recording data of the optimal radio equipment, it responds to the voice command input by the user. Through multi-device cooperative audio recording, the influence of audio quality of voice commands picked up by electronic devices on the accuracy of ASR recognition can be solved, and the accuracy of voice control can be improved.

The voice control method in the embodiment of the present application can be applied to a multi-device scenario. The multi-device scenario may include a scenario where a user uses multiple electronic devices concurrently, or a scenario where user voice interaction occurs within the effective working range of multiple electronic devices. Among them, each of the plurality of electronic devices has a voice control function. This voice control function may be provided by a voice assistant. In the multi-device scenario, after the user speaks the wake-up word and the voice command, the method of this embodiment can ensure the accuracy and reliability of the voice command input to the ASR, thereby improving the accuracy of the ASR recognition result, and further improving the accuracy of the ASR recognition result. Complete the subsequent recognition and processing tasks, and complete the response to the voice command. It makes the electronic device more intelligent, and realizes the efficient and accurate interaction between the electronic device and the user. At the same time, the user experience is improved.

The voice command in the embodiment of the present application refers to the command input by the user to the electronic device in the form of sound. The voice command is used to enable the electronic device to provide the user with service functions such as interactive dialogue, information query, and device control. For example, the voice instruction may be a piece of voice signal input by the user through the microphone of the electronic device.

In some embodiments, a voice assistant may be installed in the electronic device to enable the electronic device to implement a voice control function. Voice assistants are generally dormant. The user can voice wake up the voice assistant before using the voice control function of the electronic device. Among them, the voice signal to wake up the voice assistant may be called a wake-up word (or wake-up voice). The wake word may be pre-registered in the electronic device. For example, the wake-up word may be "small E, small E". Of course, it can be understood that the wake-up word may also be any other word or statement, which can be flexibly set according to requirements, and the embodiments of the present application will not illustrate them one by one.

In addition, the above-mentioned voice assistant may be an embedded application in the electronic device (ie, a system application of the electronic device), or may be a downloadable application. Embedded applications are applications provided as part of the implementation of an electronic device such as a cell phone. A downloadable application is an application that can provide its own internet protocol multimedia subsystem (IMS) connection. The downloadable application may be pre-installed in the electronic device, or may be a third-party application downloaded by the user and installed in the electronic device.

The implementation of the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a voice control system according to an embodiment of the present application. The voice control system may include multiple electronic devices, and the multiple electronic devices meet one or more of the following conditions: connected to the same wireless access point (such as a WiFi access point), or logged into the same account, or Set by the user in the same group, or the user's voice interaction occurs within the effective working range of the plurality of electronic devices.

Wherein, as an example, the voice control system may include three electronic devices, for example, a first electronic device 201 , a second electronic device 202 and a third electronic device 203 . The first electronic device 201 , the second electronic device 202 and the third electronic device 203 all have a voice control function, for example, a voice assistant is installed.

In some embodiments, the first electronic device 201 , the second electronic device 202 , and the third electronic device 203 can wake up the voice assistant with the same wake-up word, for example, "small E and small E".

Exemplarily, the electronic devices described in the embodiments of the present application, such as the first electronic device 201, the second electronic device 202, and the third electronic device 203, may be mobile phones, tablet computers, desktops, laptops, handheld computers, Laptops, desktops, ultra-mobile personal computers (UMPCs), netbooks, and cellular phones, personal digital assistants (PDAs), augmented reality (AR)\virtual reality , VR) devices, media players, TVs, smart speakers, smart watches, smart headphones and other devices. The specific form of the electronic device is not particularly limited in the embodiments of the present application. For the specific structure of the electronic device, reference may be made to the description of the corresponding embodiment in FIG. 2 .

In addition, in some embodiments, the first electronic device 201 , the second electronic device 202 and the third electronic device 203 can be the same type of electronic device, such as the first electronic device 201 , the second electronic device 202 and the third electronic device The devices 203 are all mobile phones. In some other embodiments, the first electronic device 201 , the second electronic device 202 and the third electronic device 203 can be different types of electronic devices, for example, the first electronic device 201 is a mobile phone, and the second electronic device 202 is a smart speaker , the third electronic device 203 is a television (as shown in FIG. 1 ).

In the embodiment of the present application, the first electronic device 201, the second electronic device 202 and the third electronic device 203 directly start recording without cross-device communication, so as to solve the frame loss problem of voice control in a multi-device scenario, Improve the accuracy of voice control.

The first electronic device 201 , the second electronic device 202 and the third electronic device 203 can record each other without being called by other devices (eg, a central device), thus realizing a decentralized recording method. This decentralized recording method does not need to perform the process of selecting a device as the calling device, which can effectively eliminate the delay caused by communication between devices and improve the accuracy of subsequent voice control.

Afterwards, based on one or more dimensions such as the respective device information of the first electronic device 201, the second electronic device 202 and the third electronic device 203, the respective recording data, etc., the first electronic device 201, the second electronic device 202 and the third electronic device 203 Among the three electronic devices 203, one or more electronic devices are selected as the optimal sound-receiving device. Based on the recording data of the optimal radio equipment, it responds to the voice command input by the user. The embodiment of the present application can solve the problem of the influence of the audio quality of the voice command picked up by the electronic device on the accuracy of the ASR recognition by means of multi-device cooperative audio collection.

In some embodiments, the voice control system may also include a server 204 . The server 204 can provide intelligent voice services.

Please refer to FIG. 2 , which is a schematic structural diagram of an electronic device according to an embodiment of the present application.

As shown in FIG. 2 , the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193, Display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and an environmental sensor Light sensor 180L, bone conduction sensor 180M, etc.

It can be understood that the structure illustrated in this embodiment does not constitute a specific limitation on the electronic device. In other embodiments, the electronic device may include more or fewer components than shown, or some components may be combined, or some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor ( image signal processor, ISP), controller, memory, video codec, digital signal processor (DSP), baseband processor, and/or neural-network processing unit (NPU), etc. . Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

A controller can be the nerve center and command center of an electronic device. The controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuitsound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver (universal asynchronous receiver) /transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and/or Universal serial bus (universal serial bus, USB) interface, etc.

The charging management module 140 is used to receive charging input from the charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from the wired charger through the USB interface 130 . In some wireless charging embodiments, the charging management module 140 may receive wireless charging input through a wireless charging coil of the electronic device. While the charging management module 140 charges the battery 142 , it can also supply power to the electronic device through the power management module 141 .

The power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 . The power management module 141 receives input from the battery 142 and/or the charging management module 140 and supplies power to the processor 110 , the internal memory 121 , the external memory, the display screen 194 , the camera 193 , and the wireless communication module 160 . The power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, battery health status (leakage, impedance). In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.

The wireless communication function of the electronic device can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in an electronic device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 can provide a wireless communication solution including 2G/3G/4G/5G etc. applied on the electronic device. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like. The mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then turn it into an electromagnetic wave for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .

The wireless communication module 160 can provide applications on electronic devices including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellite systems ( global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify the signal, and convert it into electromagnetic waves for radiation through the antenna 2 . For example, in some embodiments of the present application, the wireless communication module 160 may interact with other electronic devices, for example, after detecting a voice signal matching the wake-up word, send energy information of the detected voice signal to other electronic devices. For example, the electronic device in this embodiment of the present application may communicate with other electronic devices through the mobile communication module 150 and/or the wireless communication module 160 . For example, the first electronic device 201 sends a voice pickup instruction and the like to the second electronic device 202 through the communication module 150 and/or the wireless communication module 160 .

In some embodiments, the antenna 1 of the electronic device is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the electronic device can communicate with the network and other devices through wireless communication technology. The wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), wideband code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technology, etc. The GNSS may include global positioning system (global positioning system, GPS), global navigation satellite system (global navigation satellite system, GLONASS), Beidou navigation satellite system (beidou navigation satellite system, BDS), quasi-zenith satellite system (quasi-zenith) satellite system, QZSS) and/or satellite based augmentation systems (SBAS).

The electronic device realizes the display function through the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

Display screen 194 is used to display images, videos, and the like. Display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). , AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on. In some embodiments, the electronic device may include 1 or N display screens 194 , where N is a positive integer greater than 1.

The electronic device can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.

The ISP is used to process the data fed back by the camera 193 . For example, when taking a photo, the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin tone. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193 .

Camera 193 is used to capture still images or video. The object is projected through the lens to generate an optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In some embodiments, the electronic device may include 1 or N cameras 193 , where N is a positive integer greater than 1.

A digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device selects the frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy, etc.

Video codecs are used to compress or decompress digital video. An electronic device may support one or more video codecs. In this way, the electronic device can play or record videos in various encoding formats, such as: moving picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4 and so on.

The NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, such as the transfer mode between neurons in the human brain, it can quickly process the input information, and can continuously learn by itself. Through the NPU, applications such as intelligent cognition of electronic devices can be realized, such as image recognition, face recognition, speech recognition, text understanding, etc.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The processor 110 executes various functional applications and data processing of the electronic device by executing the instructions stored in the internal memory 121 . The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area can store data (such as audio data, phone book, etc.) created during the use of the electronic device. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.

The electronic device can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor. Such as music playback, recording, etc.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device answers a call or a voice message, the voice can be received by placing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message or needing to trigger the electronic device to perform certain events through the voice assistant, the user can make a sound through the human mouth close to the microphone 170C, and input the sound signal into the microphone 170C. The electronic device may be provided with at least one microphone 170C. In other embodiments, the electronic device may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions. For example, the electronic device in this embodiment of the present application may receive a voice instruction input by the user through the microphone 170C.

The earphone jack 170D is used to connect wired earphones. The earphone interface 170D may be the USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.

The pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals. In some embodiments, the pressure sensor 180A may be provided on the display screen 194 . There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, and the like. The capacitive pressure sensor may be comprised of at least two parallel plates of conductive material. When a force is applied to the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device determines the intensity of the pressure based on the change in capacitance. When a touch operation acts on the display screen 194, the electronic device detects the intensity of the touch operation according to the pressure sensor 180A. The electronic device can also calculate the touched position according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example, when a touch operation whose intensity is less than the first pressure threshold acts on the short message application icon, the instruction for viewing the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, the instruction to create a new short message is executed.

The gyro sensor 180B can be used to determine the motion attitude of the electronic device. In some embodiments, the angular velocity of the electronic device about three axes (ie, the x, y, and z axes) may be determined by the gyro sensor 180B. The gyro sensor 180B can be used for image stabilization. Exemplarily, when the shutter is pressed, the gyro sensor 180B detects the shaking angle of the electronic device, calculates the distance to be compensated by the lens module according to the angle, and allows the lens to counteract the shaking of the electronic device through reverse motion to achieve anti-shake. The gyro sensor 180B can also be used for navigation and somatosensory game scenarios.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, the electronic device calculates the altitude from the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.

The magnetic sensor 180D includes a Hall sensor. The electronic device can use the magnetic sensor 180D to detect the opening and closing of the flip holster. In some embodiments, when the electronic device is a flip machine, the electronic device can detect the opening and closing of the flip according to the magnetic sensor 180D. Further, according to the detected opening and closing state of the leather case or the opening and closing state of the flip cover, characteristics such as automatic unlocking of the flip cover are set.

The acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device in various directions (generally three axes). The magnitude and direction of gravity can be detected when the electronic device is stationary. It can also be used to identify the posture of electronic devices, and can be used in applications such as horizontal and vertical screen switching, pedometers, etc.

Distance sensor 180F for measuring distance. Electronic devices can measure distances by infrared or laser. In some embodiments, when shooting a scene, the electronic device can use the distance sensor 180F to measure the distance to achieve fast focusing.

Proximity light sensor 180G may include, for example, light emitting diodes (LEDs) and light detectors, such as photodiodes. The light emitting diodes may be infrared light emitting diodes. Electronic devices emit infrared light outward through light-emitting diodes. Electronic devices use photodiodes to detect reflected infrared light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object in the vicinity of the electronic device. When insufficient reflected light is detected, the electronic device can determine that there is no object in the vicinity of the electronic device. The electronic device can use the proximity light sensor 180G to detect that the user holds the electronic device close to the ear to talk, so as to automatically turn off the screen to save power. Proximity light sensor 180G can also be used in holster mode, pocket mode automatically unlocks and locks the screen.

The ambient light sensor 180L is used to sense ambient light brightness. The electronic device can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device is in the pocket to prevent accidental touch.

The fingerprint sensor 180H is used to collect fingerprints. Electronic devices can use the collected fingerprint characteristics to unlock fingerprints, access application locks, take photos with fingerprints, and answer incoming calls with fingerprints.

The temperature sensor 180J is used to detect the temperature. In some embodiments, the electronic device utilizes the temperature detected by the temperature sensor 180J to implement a temperature handling strategy. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold value, the electronic device may reduce the performance of the processor located near the temperature sensor 180J in order to reduce power consumption and implement thermal protection. In other embodiments, when the temperature is lower than another threshold, the electronic device heats the battery 142 to avoid abnormal shutdown of the electronic device caused by the low temperature. In some other embodiments, when the temperature is lower than another threshold, the electronic device boosts the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperature.

Touch sensor 180K, also called "touch panel". The touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”. The touch sensor 180K is used to detect a touch operation on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to touch operations may be provided through display screen 194 . In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device, which is different from the location where the display screen 194 is located.

The bone conduction sensor 180M can acquire vibration signals. In some embodiments, the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice. The bone conduction sensor 180M can also contact the pulse of the human body and receive the blood pressure beating signal. In some embodiments, the bone conduction sensor 180M can also be disposed in the earphone, combined with the bone conduction earphone. The audio module 170 can analyze the voice signal based on the vibration signal of the vocal vibration bone block obtained by the bone conduction sensor 180M, so as to realize the voice function. The application processor can analyze the heart rate information based on the blood pressure beat signal obtained by the bone conduction sensor 180M, and realize the function of heart rate detection.

The keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key. The electronic device may receive key input and generate key signal input related to user settings and function control of the electronic device.

Motor 191 can generate vibrating cues. The motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, playing audio, etc.) can correspond to different vibration feedback effects. The motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 . Different application scenarios (for example: time reminder, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.

The indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the electronic device. The electronic device can support 1 or N SIM card interfaces, where N is a positive integer greater than 1. The SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card and so on. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 195 can also be compatible with different types of SIM cards. The SIM card interface 195 is also compatible with external memory cards. The electronic device interacts with the network through the SIM card to realize functions such as call and data communication. In some embodiments, the electronic device employs an eSIM, ie: an embedded SIM card. The eSIM card can be embedded in the electronic device and cannot be separated from the electronic device.

The methods in the following embodiments can all be implemented in an electronic device having the above-mentioned hardware structure.

In the embodiment of the present application, in the above-mentioned multi-device scenario, recording is directly started without cross-device communication between multiple devices, so as to solve the problem of frame loss of voice control in the multi-device scenario, and improve the accuracy of voice control.

After that, based on one or more dimensions such as device information and recording data of the multiple electronic devices, one or more electronic devices are selected from the multiple electronic devices as the optimal audio pickup device. Based on the recording data of the optimal radio equipment, it responds to the voice command input by the user. Through the selection of the optimal radio equipment, choose to satisfy the clearest pickup (closest to the user), the least noise interference (farthest away from the noise source), or the best SE processing effect (the best microphone noise reduction performance or support AEC) At least one of the electronic devices is used as a voice pickup entrance for the voice assistant to call, which can effectively solve the problem of the influence of the audio quality of the voice commands picked up by the electronic device on the ASR recognition accuracy. The device information may include, but is not limited to, static attribute information or dynamic attribute information of the electronic device. The static attribute information may include, but is not limited to, device model, system version, microphone capability information, and the like. The dynamic attribute information may include, but is not limited to, power information of the electronic device, headphone status information, microphone status information, speaker status information, audio quality information of the recording data, and the like. The speaker status information may be used to indicate whether the speaker of the electronic device is occupied. The audio quality information is used to indicate whether the audio quality of the recorded data is good or bad. The specific form of the audio quality information may include one or more items such as sound intensity information, noise sound intensity information, and signal-to-noise ratio information.

FIG. 3 is a schematic flowchart of a voice control method according to an embodiment of the present application. This embodiment is illustrated by taking the three electronic devices shown in FIG. 1 , a speaker 201 , a TV 202 and a mobile phone 203 as examples. As shown in FIG. 3, the method of this embodiment may include:

Step 401 , the speaker 201 , the television 202 and the mobile phone 203 respectively receive the first voice instruction input by the user.

The first voice instruction is used to wake up the voice assistant of the electronic device. For example, the first voice instruction may be the above-mentioned wake-up word "small E small E". In this embodiment, the first voice command is used to wake up the respective voice assistants of the speaker 201 , the television 202 and the mobile phone 203 .

For an electronic device installed with a voice assistant, if the electronic device has no other software or hardware to use a microphone to collect voice signals, the electronic device can monitor whether the user has a voice signal input in real time through the microphone. Generally, when a user wants to use the voice control function of the electronic device, he or she can make a sound within the sound pickup range of the electronic device, so as to input the emitted sound into the microphone. At this time, if no other software or hardware of the electronic device is using the microphone to collect the voice signal, the electronic device can monitor the corresponding voice signal, such as the first voice command, through the microphone.

For example, as shown in FIG. 4 , when the user wants to use the voice control function, he can say the wake-up word "small E, small E". If the sounding position of the user is located within the respective pickup ranges of the speaker 201, the TV 202 and the mobile phone 203, and no other software or hardware is using the microphone to collect the voice signal, the speaker 201, the TV 202 and the mobile phone 203 can pass their respective voice signals. The microphone detects the first voice instruction corresponding to the wake-up word "small E small E".

Step 402 , in response to the first voice command, the speaker 201 , the TV 202 and the mobile phone 203 wake up their respective voice assistants and start recording.

When the electronic device detects the first voice command, in response to the first voice command, the electronic device wakes up the voice assistant. In an example, after the electronic device receives the above-mentioned first voice command, the first voice command can be checked, that is, it is determined whether the received first voice command is a wake-up word registered in the electronic device. If the verification is passed, it indicates that the received first voice command is a wake-up word, which wakes up the voice assistant. If the verification fails, it indicates that the received first voice command is not a wake-up word, and the electronic device may not wake up the voice assistant at this time, that is, keep the voice assistant in a dormant state.

In this embodiment, when the speaker 201, the TV 202 and the mobile phone 203 detect the first voice command respectively, the speaker 201, the TV 202 and the mobile phone 203 wake up their respective voice assistants and start recording. After the speaker 201, the TV 202 and the mobile phone 203 start recording respectively, they can detect whether the user inputs other voice commands through their respective microphones, and when detecting other voice commands input by the user, generate recording data and save them in their own devices.

For example, as shown in FIG. 3 and FIG. 4 , after the sound box 201 , the television set 202 and the mobile phone 203 start recording, they respectively receive the second voice instruction input by the user. For example, take the second voice command spoken by the user as "play song 112222" as an example. The speaker 201, the TV 202 and the mobile phone 203 respectively record the second voice command to generate their own recording data, and the content of the recording data is "play song 112222".

It should be noted that, in an achievable manner, the recording data may be recorded every 0.5s to generate the recording data. Wherein, 0.5 may also be other numerical values, for example, 0.6, 1, etc., which are not described one by one in the embodiments of the present application. When saving the recording data, you can overwrite the previous recording data with the new recording data, or you can save the previous recording data and the new recording data without overwriting the previous recording data with the new recording data. The embodiment of the present application uses saving the previous recording data and the new recording data as an example for illustration.

In some embodiments, the electronic device may further determine audio quality information corresponding to the recorded data according to the recorded data. In other words, the electronic device also evaluates the quality of its own recording data. As mentioned above, the audio quality information may include one or more items of sound intensity information, noise sound intensity information, and signal-to-noise ratio information.

Taking the three electronic devices in this embodiment as an example, the speaker 201 , the TV 202 and the mobile phone 203 can respectively perform quality evaluation on the respective recording data, and determine the audio quality information corresponding to the respective recording data.

Step 403 , the speaker 201 , the TV 202 and the mobile phone 203 respectively execute the selection of the answering device, determine the answering device, and the answering device plays the answering voice corresponding to the first voice command.

Wherein, the execution order of step 402 and step 403 is not limited by the size of the serial number, and other execution sequences may also be used. For example, an answering device selection is performed while recording is started.

The answering device in this embodiment is used to play the answering voice corresponding to the voice command input by the user. For example, the answering device plays the answer voice corresponding to the first voice command, that is, the wake-up answer voice, such as "I'm here". While other electronic devices that are not used as answering devices wake up the voice assistant, but do not play the answering voice corresponding to the voice command input by the user.

The electronic device may select an answering device based on the audio quality information corresponding to the first voice command to determine an answering device. In one possible implementation, the electronic device can evaluate the quality of the received first voice command, determine the audio quality information corresponding to the first voice command received by itself, and broadcast the audio quality corresponding to the first voice command received by itself. information and its own device information. The electronic device receives audio quality information and its own device information corresponding to the first voice instruction received by itself and broadcast by other electronic devices. The electronic device selects one electronic device as the answering device according to the audio quality information and device information of all the electronic devices. For example, choose the electronic device with the best audio quality as the answering device.

Combined with the example in step 402, when the speaker 201 detects the first voice command, the speaker 201 can also evaluate the quality of the first voice command, determine the audio quality information corresponding to the first voice command received by the speaker 201, and broadcast the speaker. The audio quality information corresponding to the first voice command received by 201 and the device information of the speaker 201 . Similar processing method, when the TV set 202 detects the first voice command, the TV set 202 can also perform quality evaluation on the first voice command, determine the audio quality information corresponding to the first voice command received by the TV set 202, and broadcast it. The audio quality information corresponding to the first voice command received by the TV set 202 and the device information of the TV set 202 . When the mobile phone 203 detects the first voice command, the mobile phone 203 can also evaluate the quality of the first voice command, determine the audio quality information corresponding to the first voice command received by the mobile phone 203, and broadcast the first voice received by the mobile phone 203. The audio quality information corresponding to the instruction and the device information of the mobile phone 203 are specified. In this way, the speaker 201 can receive the audio quality information and device information corresponding to the first voice command of the TV 202 and the mobile phone 203, and the speaker 201 can receive the audio quality information and device information corresponding to the first voice command of the speaker 201, the TV 202 and the mobile phone 203 according to the Device information, select an electronic device from the speaker 201, the TV 202 and the mobile phone 203 as the answering device. Similarly, the TV 202 can receive the audio quality information and device information corresponding to the first voice command of the speaker 201 and the mobile phone 203 , and the TV 202 can receive the audio quality information corresponding to the first voice command of the speaker 201 , the TV 202 and the mobile phone 203 according to the audio quality Information and device information, select an electronic device from the speaker 201, the TV 202 and the mobile phone 203 as the answering device. The mobile phone 203 can receive the audio quality information and device information corresponding to the first voice command of the speaker 201 and the TV 202, and the mobile phone 203 can receive the audio quality information and device information corresponding to the first voice command of the speaker 201, the TV 202 and the mobile phone 203 , select an electronic device from the speaker 201, the TV 202 and the mobile phone 203 as the answering device. Here, the speaker 201, the television 202, and the mobile phone 203 are all determined to be the answering device as an exemplary illustration.

For example, as shown in FIG. 4 , the speaker 201 acts as an answering device and plays a wake-up answering voice, such as "I am here". The TV set 202 and the mobile phone 203 do not play the wake-up response voice, but the voice assistants of the TV set 202 and the mobile phone 203 are in the wake-up state as described in step 402 above, and can record.

It should be noted that, during the selection process of the answering device, the answering device may also be selected in combination with other information, such as the priority of each electronic device. In addition, the specific implementation manner of performing the selection of the answering device may also adopt other manners, and this embodiment of the present application does not limit the foregoing manner. For example, the answering device in the last use process of the user or the answering device set by the user may be used as the answering device in this embodiment.

Step 404 , the speaker 201 calls the voice pickup instruction to the TV set 202 and the mobile phone 203 respectively, and the voice pickup instruction is used to instruct to return the recording data.

After the above-mentioned step 403, the speaker 201 starts to perform the distributed sound collection task. The answering device can respectively call the pickup instruction to other non-answering devices, and the pickup instruction is used to instruct the non-answering device to return the recording data to the answering device.

Combined with the examples of the above steps, the voice assistant of the speaker 201 can call the interface between the voice assistant of the television 202 and the voice assistant of the speaker 201 to transmit the voice pickup instruction to the television 202 . The voice assistant of the speaker 201 can call the interface between the voice assistant of the mobile phone 203 and the voice assistant of the speaker 201 to transmit a voice pickup instruction to the speaker 201 . The pickup instruction may carry the identification information of the answering device. The identification information of the answering device may be a media access control (media access control, MAC) address of the answering device. For example, the voice pickup instruction may carry the identification information of the speaker 201 to instruct the television 202 to return the recording data to the speaker 201 .

Step 405 , the television 202 and the mobile phone 203 respectively send the recording data to the speaker 201 .

The answering device receives recorded data sent by other non-answering devices. After other non-answering devices send their own recording data, they can continue recording and send new recording data to the answering device.

Combined with the example of the above steps, the television 202 sends the audio recording data of the television 202 to the speaker 201 . The mobile phone 203 sends the recording data of the mobile phone 203 to the speaker 201 . The recorded data may include the above-mentioned second voice instruction. For example, the content of the recording data is "play song 112222".

In a possible implementation manner, the speaker 201 performs quality evaluation on the received recording data of the TV set 202 , and determines the audio quality information corresponding to the recording data of the TV set 202 . The speaker 201 performs quality evaluation on the received recording data of the mobile phone 203 , and determines the audio quality information corresponding to the recording data of the mobile phone 203 .

In another implementation manner, the speaker 201 may also receive audio quality information corresponding to the recording data of the television set 202 sent by the television set 202 . The speaker 201 can also receive audio quality information corresponding to the recording data of the mobile phone 203 sent by the mobile phone 203 .

Step 406 , the speaker 201 determines the optimal radio device in the speaker 201 , the TV 202 and the mobile phone 203 according to the audio quality information, and plays the response voice corresponding to the second voice command according to the recording data of the optimal radio device.

The answering device selects an optimal radio device from multiple electronic devices according to the audio quality information corresponding to the recording data of multiple electronic devices (including itself and other non-responding devices), and uses the recording data of the optimal radio device to perform SE. , ASR, etc., to correctly identify the voice command input by the user, and then accurately respond to the voice command input by the user. The accurate response to the voice command input by the user includes playing the response voice corresponding to the voice command input by the user. In some embodiments, the accurate response to the voice command input by the user may further include triggering the answering device or other non-responding device to execute an event corresponding to the voice command. The event could be playing a song, playing a video, making a call, etc.

It should be noted that, in some embodiments, the speaker 201 may also send the recording data of the optimal radio device to the server 204 shown in FIG. 1 , and the server 204 uses the recording data of the optimal device to perform SE, ASR and other processing, so as to correctly recognize the voice command input by the user, and then make an accurate response to the voice command input by the user.

For example, as shown in FIG. 4 and FIG. 5 , although the user is closest to the mobile phone 203 , because the user is using the hair dryer 205 , the hair dryer 205 will generate noise, which affects the sound quality of the mobile phone 203 . The speaker 201 in this embodiment determines the speaker 201 as the optimal sound-receiving device among the speaker 201 , the TV 202 and the mobile phone 203 according to the audio quality information of the recording data of the speaker 201 , the TV 202 and the mobile phone 203 . For example, as shown in FIG. 5, the speaker 201 can play the answering voice "Song 112222 will be played for you here". The multimedia resource of the song 112222 can be provided by the server 204 or the mobile phone 203 .

Optionally, in another implementation manner, the speaker 201 may also play the response voice corresponding to the second voice command according to its own recording data and the recording data of the optimal audio recording device. For example, the speaker 201 can splicing its own recording data and the recording data of the optimal audio-receiving device, and plays the response voice corresponding to the second voice command based on the spliced recording data.

Optionally, after step 406, steps 404 to 406 may also be performed again to process the new recording data in a similar manner, so as to correctly identify the new voice command input by the user, and then perform the processing on the new voice command input by the user. Voice commands for accurate responses.

Optionally, in some embodiments, the voice control method of the embodiment of the present application may further process the new recording data through the following steps.

Step 407 , the speaker 201 sends a stop recording instruction to the TV 202 and the mobile phone 203 respectively.

The answering device sends a stop recording instruction to other non-answering devices, and the stop recording instruction is used to instruct to stop recording and discard the recording data.

In step 408, the television 202 and the mobile phone 203 respectively stop recording, and discard the recording data.

Other non-answering devices stop recording based on the stop recording command to reduce power consumption.

For example, the speaker 201 sends a stop recording instruction to the TV 202 and the mobile phone 203 respectively. The television set 202 and the mobile phone 203 respectively stop recording and discard the recording data. For example, the recorded data corresponding to the second voice instruction is discarded. After that, the speaker 201 receives a new voice command input by the user. For example, take the third voice command spoken by the user as "change a song" as an example. The speaker 201 records the third voice instruction to generate recording data, and the content of the recording data is "change a song". The speaker 201 uses the recorded data to perform processing such as SE, ASR, etc., so as to correctly recognize the voice command input by the user, and then accurately respond to the voice command input by the user. For example, the speaker 201 can play the response voice "OK, switch songs for you", and play the switched songs.

It should be noted that, in this embodiment, the answering device and the optimal radio device are both the speaker 201 as an example for illustration. The answering device and the optimal radio device may be the same device or different devices. For example, the answering device is a speaker. 201, the optimal radio device is a television set 202, and the embodiments of the present application are not limited by the above examples. When the answering device and the optimal radio device are different devices, the answering device can call the recording data of the optimal radio device.

In some embodiments, when the voice command received by the answering device is used to turn off the voice assistant, the answering device can stop calling the recording data of other non-answering devices, and then stop its own distributed voice recording task, and discard the recorded data.

In this embodiment of the present application, when multiple electronic devices respectively receive the first voice command input by the user, the multiple electronic devices wake up their respective voice assistants and start recording, and the first voice command is used to wake up the voice assistant of the electronic device. . After multiple electronic devices negotiate and determine the answering device, the answering device can determine the optimal radio device according to the recording data of each electronic device, and play the response voice corresponding to the second voice command according to the recording data of the optimal radio device. Different from the method of starting recording after being called by the central device, this embodiment realizes a decentralized collaborative recording method by directly starting the recording after waking up from the electronic device, and no longer relying on the central device to call. Before the answering device is determined, the recording has been started, and the recording data is used for SE, ASR and other processing, which effectively eliminates the communication delay between devices, thereby solving the problem of frame loss in voice control caused by delay in multi-device scenarios.

By using the recording data of the optimal radio equipment and processing SE, ASR, etc., the voice commands input by the user can be correctly recognized, and then the voice commands input by the user can be accurately responded to, and the accuracy of voice control can be improved.

By combining the two processes of wake-up and radio, the audio recording can be started in advance, and the electronic device can evaluate the quality of its own recording data, which can speed up the audio evaluation of the electronic device and shorten the time required for subsequent decision-making on the optimal radio device. , to speed up the processing flow of the voice control method and improve the response speed of the voice control.

It should be noted that the above-mentioned embodiment of FIG. 3 uses the wake-up word to wake up the voice assistant and start recording as an example for illustration. The embodiment of the present application is not limited by this. The embodiment of the present application may also not have the above wake-up process. The method triggers the recording of the electronic device, and based on the multi-device collaborative radio, the accuracy of the voice control is improved. For example, the other manner may be that the electronic device detects a human voice, or the electronic device detects the voice of a specific user, etc., which are not described one by one in the embodiments of the present application. The specific implementation of the voice control method without the above wake-up process triggering the recording of the electronic device is similar to the embodiment shown in FIG. 3 . For example, after the recording starts, the answering device calls the voice pickup instruction, the non-responding device returns the recorded data, and the answering device returns the recording data. According to the recording data of each electronic device, the optimal radio device is determined, and the response voice corresponding to the second voice command is played according to the recorded data of the optimal radio device. For the realization principle and technical effect, reference may be made to the explanations of the above-mentioned embodiments.

FIG. 6 is a schematic flowchart of another voice control method provided by an embodiment of the present application. This embodiment is illustrated by taking the three electronic devices shown in FIG. 1 , a speaker 201 , a television 202 and a mobile phone 203 , and the answering device being the speaker 201 as an example. This embodiment is not the first invocation after the electronic device wakes up, for example, the second invocation, the third invocation, and the fourth invocation of the multi-round dialogue of the voice assistant. As shown in FIG. 6 , the method of this embodiment may include:

Step 701 , the speaker 201 respectively invokes a multi-round dialogue pause instruction to the TV set 202 and the mobile phone 203 , and the multi-round dialogue pause instruction is used to instruct the multi-round dialogue pause instruction to temporarily stop.

The answering device does not detect a new voice command input by the user within a preset time period, that is, there is a time interval between voice commands input by the user. The answering device detects this time interval and triggers multiple rounds of dialogue pause operations. The answering device may respectively call other non-answering devices a multi-round dialogue pause instruction, where the multi-round dialogue pause instruction is used to instruct the multi-round dialogue to temporarily stop.

For example, the voice assistant of the speaker 201 may invoke the interface between the voice assistant of the television 202 and the voice assistant of the speaker 201 to transmit a multi-round dialogue temporary stop instruction to the television 202 . The voice assistant of the speaker box 201 can call the interface between the voice assistant of the mobile phone 203 and the voice assistant of the speaker box 201 , so as to transmit to the speaker box 201 an instruction to temporarily stop multiple rounds of conversations. The speaker 201 deletes the previously saved recording data and continues to keep the recording.

In step 702, the television 202 and the mobile phone 203 respectively delete the recorded recording data and keep the recording respectively.

The television set 202 and the mobile phone 203 respectively delete the recording data before invoking the multi-round dialogue pause instruction, and continue to keep the recording.

Step 703 , the speaker 201 , the television 202 and the mobile phone 203 respectively receive the fourth voice command input by the user, and record the fourth voice command respectively to generate respective recording data.

In some embodiments, the speaker 201 , the TV 202 and the mobile phone 203 may further perform quality evaluation on the respective received recording data, and determine the audio quality information corresponding to the respective received recording data.

For example, as shown in FIG. 7 , the fourth voice command spoken by the user may be “play movie 333333” as an example. The speaker 201 , the TV 202 and the mobile phone 203 respectively record the fourth voice command to generate respective recording data, and the content of the recording data is "play movie 333333".

In step 704, the speaker 201 calls the voice pickup instruction to the TV set 202 and the mobile phone 203 respectively, and the voice pickup instruction is used to instruct to return the recording data.

After the above step 703, the speaker 201 starts to perform the distributed sound collection task again. The answering device can respectively call the pickup instruction to other non-answering devices, and the pickup instruction is used to instruct the non-answering device to return the recording data to the answering device.

Step 705 , the television 202 and the mobile phone 203 respectively send the recording data to the speaker 201 .

Combined with the example of the above steps, the television 202 sends the audio recording data of the television 202 to the speaker 201 . The mobile phone 203 sends the recording data of the mobile phone 203 to the speaker 201 . For example, the content of the audio recording data is "play movie 333333".

Step 706 , the speaker 201 determines the optimal radio device in the speaker 201 , the TV 202 and the mobile phone 203 according to the audio quality information, and responds to the fourth voice command according to the recording data of the optimal radio device.

The answering device selects an optimal radio device from multiple electronic devices according to the audio quality information corresponding to the recording data of multiple electronic devices (including itself and other non-responding devices), and uses the recording data of the optimal radio device to perform SE. , ASR, etc., to correctly identify the voice command input by the user, and then accurately respond to the voice command input by the user. The accurate response to the voice command input by the user includes playing the response voice corresponding to the voice command input by the user. In some embodiments, the accurate response to the voice command input by the user may further include triggering the answering device or other non-responding device to execute an event corresponding to the voice command. The event can be playing a song, playing a video, making a call, etc.

For example, as shown in FIG. 7 , the speaker 201 of this embodiment determines that the optimal sound-receiving device is the speaker in the speaker 201 , the TV 202 and the mobile phone 203 according to the audio quality information of the recording data of the speaker 201 , the TV 202 and the mobile phone 203 . 201. For example, as shown in FIG. 7 , the speaker 201 can play the response voice "The movie 333333 will be played on the TV", and the TV 202 starts to play the movie 333333.

Afterwards, if the user triggers multiple rounds of dialogue pause again, the above steps 701 to 706 may be re-executed. During this process, the optimal radio equipment can change. For example, with reference to the example shown in FIG. 7 , after the TV starts to play a movie, the fifth voice command spoken by the user may be “sound small” as an example. The speaker 201 , the TV 202 and the mobile phone 203 respectively record the fifth voice command to generate their respective recording data, and the content of the recording data is the "sound point". Afterwards, through the processes involved in the above steps, the TV 202 is determined as the optimal sound-receiving device among the speakers 201 , the TV 202 and the mobile phone 203 . The speaker 201 can respond to the fifth voice command based on the recording data of the TV set 202 . In this embodiment, when the user's environment changes, different devices can be selected for sound recording according to the recording effect. For example, after the TV 202 starts to play a movie, strong self-noise (such as the sound produced during movie playback) occurs in the user's home, and the voice assistant of the speaker 201 will also be mixed into the statement played by the TV. If the sound of the speaker 201 is used The recorded data will cause ASR recognition errors. The voice control method of this embodiment can improve the accuracy of ASR recognition by dynamically calling the TV to perform radio recording and complete echo cancellation, thereby accurately responding to the voice commands input by the user, and improving the accuracy of voice control. Rate.

It should be noted that the above-mentioned embodiments shown in FIG. 3 and FIG. 6 are illustrated by taking the answering device selecting the optimal radio device according to the audio quality information, and responding to the second voice command according to the recording data of the optimal radio device as an example. Other processing methods are also possible, for example, the answering device directly responds to the second voice instruction according to the received recording data, or according to the received recording data and its own recording data. Wherein, according to the received recording data and its own recording data, the specific implementation manner of responding to the second voice command may be that the answering device splices the audio content information of the received recording data and the audio content information of its own recording data, based on The spliced audio content information responds to the second voice command. For example, the user speaks the voice signal "play song 112222", the answering device only recognizes the voice signal "2222", the audio content information of the recording data of the answering device is used to represent the voice signal "2222", and the answering device receives the recording of other devices The audio content information of the data is used to represent the voice signal "play song 112", and the answering device can splicing the two to obtain the spliced audio content information, and the spliced audio content information is used to represent the voice signal "play song 112222".

FIG. 8 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application. As shown in FIG. 8 , the apparatus can be applied to an electronic device of a voice control system (such as the above-mentioned first electronic device 201 ), and the voice control system can also include at least a second electronic device (such as the second electronic device 202 or the third electronic device 202 ). device 203 ), the apparatus may include: a transceiver module 81 and a processing module 82 . For example, the transceiver module 81 may specifically be the mobile communication module 150 and/or the wireless communication module 160 in the embodiment shown in FIG. 2 . The processing module 82 may be the processor 110 of the embodiment shown in FIG. 2 .

The transceiver module 81 is used for receiving the first voice command input by the user, and the processing module 82 is used for responding to the first voice command. The transceiver module 81 is further configured to receive the recording data of the second electronic device sent by the second electronic device, where the recording data of the second electronic device includes the recording data of the second electronic device recording the second voice instruction input by the user. The processing module 82 is further configured to respond to the second voice command according to the recorded data of the first electronic device and/or the recorded data of the second electronic device, and the recorded data of the first electronic device includes the second voice input by the user recorded by the first electronic device The recorded data of the command.

In some embodiments, the transceiver module 81 is further configured to call a voice pickup instruction to the second electronic device, and the voice pickup instruction is used for the second electronic device to return the recording data of the second electronic device.

In some embodiments, the processing module 82 is further configured to record when or after the first electronic device receives the first voice instruction input by the user, and the recording is used to record the second voice instruction input by the user.

In some embodiments, the first voice command is used to wake up a voice control function of the first electronic device and/or the second electronic device.

In some embodiments, the processing module 82 is further configured to determine the first electronic device according to the audio quality information of the first voice command received by the first electronic device and the audio quality information of the first voice command received by the second electronic device Answering device for voice control system.

In some embodiments, the processing module 82 is further configured to, after the first electronic device responds to the first voice command, before recording the second voice command input by the user, during the recording process of the first electronic device, within a preset time period If the second voice command input by the user is not detected, the saved recording data will be deleted, and the recording will continue. The transceiver module 81 is further configured to call a multi-round dialogue pause instruction to the second electronic device, and the multi-round dialogue pause instruction is used to instruct the multi-round dialogue to temporarily stop.

In some embodiments, the transceiver module 81 is further configured to receive audio quality information of the recording data of the second electronic device sent by the second electronic device.

In some embodiments, the processing module 82 is configured to determine the optimal radio device from the voice control system according to the audio quality information of the audio recording data of the first electronic device and the audio quality information of the audio recording data of the second electronic device. When the optimal sound-receiving device is the first electronic device, the second voice command is responded to according to the recording data of the first electronic device. When the optimal sound-receiving device is the second electronic device, the second voice command is answered according to the recording data of the second electronic device, or according to the recording data of the second electronic device and the recording data of the first electronic device. The audio quality information is used to indicate the audio quality of the recording data.

In some embodiments, the processing module 82 is configured to respond to the second voice instruction according to the audio content information of the audio recording data of the first electronic device and/or the audio content information of the audio recording data of the second electronic device. The audio content information is used to represent the audio content of the recording data.

The voice control apparatus in this embodiment of the present application can be used to execute the steps of the answering device (eg, speaker 201 ) in the above method embodiment, and its technical principle and technical effect can be found in the explanation of the above method embodiment, which will not be repeated here.

FIG. 9 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present application. As shown in FIG. 9 , the apparatus can be applied to an electronic device (such as a second electronic device 202 or a third electronic device 203 ) of a voice control system, and the voice control system can also include at least a first electronic device (such as a first electronic device) 201), the apparatus may include: a transceiver module 91 and a processing module 92. For example, the transceiver module 91 may specifically be the mobile communication module 150 and/or the wireless communication module 160 in the embodiment shown in FIG. 2 . The processing module 92 may be the processor 110 of the embodiment shown in FIG. 2 .

The processing module 92 is used for recording and saving the recording data, and the recording is used for recording the second voice instruction input by the user. The transceiver module 91 is used for sending the recording data of the second electronic device to the first electronic device, the recording data of the second electronic device includes the recording data of the second electronic device recording the second voice command input by the user, and the recording data is used for the first electronic device. After responding to the first voice command, the device responds to the second voice command.

In a possible design, the transceiver module 91 is further configured to receive a voice pickup instruction called by the first electronic device, and the voice pickup instruction is used for the second electronic device to return the recording data of the second electronic device.

In a possible design, the processing module 92 is configured to record when or after the second electronic device receives the first voice instruction input by the user.

In a possible design, the processing module 92 is further configured to determine the first electronic device according to the audio quality information of the first voice command received by the second electronic device and the audio quality information of the first voice command received by the first electronic device. The device is the answering device of the voice control system.

In a possible design, the processing module 92 is further configured to, after the first electronic device responds to the first voice command, during the recording process of the second electronic device, receive through the transceiver module 91 to invoke multiple rounds of dialogue pause commands from the second electronic device, The multi-round dialogue pause command is used to instruct the multi-round dialogue to temporarily stop. The processing module 92 is also used to delete the saved recording data and continue recording.

In a possible design, the transceiver module 91 is further configured to send the audio quality information of the recording data of the second electronic device to the first electronic device.

The voice control apparatus in this embodiment of the present application can be used to perform the steps of any non-response device (such as a TV 202 or a mobile phone 203 ) in the above method embodiments. For the technical principle and technical effect, please refer to the explanations of the above method embodiments. No longer.

Other embodiments of the embodiments of the present application further provide an electronic device, which is used to execute the methods of the electronic device in the above method embodiments. As shown in FIG. 10 , the electronic device may include: a microphone 1001 , one or more processors 1002 ; one or more memories 1003 ; the above devices may be connected through one or more communication buses 1005 . Wherein the above-mentioned memory 1003 stores one or more computer programs 1004, one or more processors 1002 are used to execute one or more computer programs 1004, and the one or more computer programs 1004 include instructions, and the above-mentioned instructions can be used to execute the above-mentioned Each step performed by any electronic device in the method embodiment. The electronic device may be any of the above-mentioned electronic devices, for example, a smart phone, a smart watch, and the like.

Of course, the electronic device shown in FIG. 10 may also include other devices such as a display screen, which is not limited in this embodiment of the present application. When it includes other devices, it may specifically be the electronic device shown in FIG. 2 .

The electronic device in this embodiment of the present application can be used to execute the steps of the electronic device in any of the above method embodiments, and the technical principles and technical effects of the electronic device can be referred to the explanations of the above method embodiments, which will not be repeated here.

Other embodiments of the embodiments of the present application further provide a computer storage medium, where the computer storage medium may include computer instructions, when the computer instructions are executed on the electronic device, the electronic device is made to perform the execution of the electronic device in the foregoing method embodiments. each step.

Other embodiments of the embodiments of the present application further provide a computer program product, which, when the computer program product runs on a computer, enables the computer to perform each step performed by the electronic device in the foregoing method embodiments.

An embodiment of the present application further provides a voice control system, the voice control system may at least include: a first electronic device and a second electronic device, wherein the first electronic device may adopt the structure of the embodiment shown in FIG. 8 or FIG. 10 , The second electronic device may adopt the structure of the embodiment shown in FIG. 9 or FIG. 10 , and correspondingly, may implement the technical solutions of any of the above method embodiments, and the implementation principles and technical effects thereof are similar, and will not be repeated here.

From the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions can be allocated as required. It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above.

In the several embodiments provided by the embodiments of the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be Incorporation may either be integrated into another device, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may be one physical unit or multiple physical units, that is, they may be located in one place, or may be distributed to multiple different places . Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

The processor mentioned in the above embodiments may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above method embodiment may be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the methods disclosed in the embodiments of the present application may be directly embodied as executed by a hardware encoding processor, or executed by a combination of hardware and software modules in the encoding processor. The software module may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.

The memory mentioned in the above embodiments may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

A voice control method, characterized in that it is applied to a voice control system, the voice control system comprising at least a first electronic device and a second electronic device with a voice control function, the method comprising:

The first electronic device and the second electronic device respectively receive a first voice command input by a user, and the first electronic device responds to the first voice command;

The second electronic device records and saves the recording data, and the recording is used to record the second voice command input by the user;

sending, by the second electronic device, the recording data of the second electronic device to the first electronic device;

The first electronic device responds to the second voice instruction according to the recorded data of the first electronic device and/or the recorded data of the second electronic device;

The recording data of the first electronic device includes recording data of the second voice instruction input by the user recorded by the first electronic device.
The method according to claim 1, wherein the method further comprises:

The first electronic device invokes a voice pickup instruction to the second electronic device, and the voice pickup instruction is used by the second electronic device to return the recording data of the second electronic device.
The method according to claim 1 or 2, wherein the recording by the second electronic device comprises:

The second electronic device records audio when or after the second electronic device receives the first voice instruction input by the user.
The method according to claim 3, wherein the method further comprises:

When or after the first electronic device receives the first voice instruction input by the user, the first electronic device makes a recording, and the recording is used to record the second voice instruction input by the user.
The method according to any one of claims 1 to 4, wherein the first voice instruction is used to wake up a voice control function of the first electronic device and/or the second electronic device.
The method according to any one of claims 1 to 5, wherein the method further comprises:

The first electronic device and the second electronic device respectively determine that the first electronic device is an answering device of the voice control system according to the audio quality information of the first voice command received respectively.
The method according to any one of claims 1 to 6, wherein after the first electronic device responds to the first voice instruction, and before recording the second voice instruction input by the user, the method further include:

During the recording process of the first electronic device and the second electronic device, the first electronic device does not detect the second voice command input by the user within a preset time period, and the first electronic device deletes the saved and continue recording; the first electronic device calls a multi-round dialogue pause instruction to the second electronic device, and the multi-round dialogue pause instruction is used to instruct the multi-round dialogue to temporarily stop; the second electronic device Delete the saved recording data and continue recording.
The method according to any one of claims 1 to 7, wherein the method further comprises:

The first electronic device receives audio quality information of the recording data of the second electronic device sent by the second electronic device.
The method according to any one of claims 1 to 8, wherein the first electronic device responds to the said first electronic device according to the recorded data of the first electronic device and/or the recorded data of the second electronic device Second voice command, including:

The first electronic device determines an optimal radio device from the voice control system according to the audio quality information of the audio recording data of the first electronic device and the audio quality information of the audio recording data of the second electronic device;

When the optimal audio pickup device is the first electronic device, the first electronic device is based on the recording data of the first electronic device, or, according to the recording data of the first electronic device and the second electronic device the recorded data, and respond to the second voice command;

When the optimal sound-receiving device is the second electronic device, the first electronic device is based on the recording data of the second electronic device, or, according to the recording data of the second electronic device and the first electronic device the recorded data, and respond to the second voice command;

Wherein, the audio quality information is used to indicate the audio quality of the recording data.
The method according to any one of claims 1 to 8, wherein the first electronic device responds to the said first electronic device according to the recorded data of the first electronic device and/or the recorded data of the second electronic device Second voice command, including:

The first electronic device responds to the second voice instruction according to the audio content information of the audio recording data of the first electronic device and/or the audio content information of the audio recording data of the second electronic device;

Wherein, the audio content information is used to represent the audio content of the recording data.
A voice control method, characterized in that it is applied to a first electronic device of a voice control system, the voice control system further comprising at least a second electronic device, and the method includes:

The first electronic device receives a first voice command input by a user, and the first electronic device responds to the first voice command;

The first electronic device receives the audio recording data of the second electronic device sent by the second electronic device, where the audio recording data of the second electronic device includes the recording data of the second electronic device that records the second voice instruction input by the user. recording data;

The first electronic device responds to the second voice command according to the recorded data of the first electronic device and/or the recorded data of the second electronic device, and the recorded data of the first electronic device includes the first electronic device. An electronic device records recording data of the second voice instruction input by the user.
The method according to claim 11, wherein the method further comprises:

The first electronic device invokes a voice pickup instruction to the second electronic device, and the voice pickup instruction is used by the second electronic device to return the recording data of the second electronic device.
The method of claim 12, wherein the method further comprises:

When or after the first electronic device receives the first voice instruction input by the user, the first electronic device makes a recording, and the recording is used to record the second voice instruction input by the user.
The method according to any one of claims 11 to 13, wherein the first voice command is used to wake up a voice control function of the first electronic device and/or the second electronic device.
The method according to any one of claims 11 to 14, wherein the method further comprises:

The first electronic device determines the first electronic device according to the audio quality information of the first voice command received by the first electronic device and the audio quality information of the first voice command received by the second electronic device It is the answering device of the voice control system.
The method according to any one of claims 11 to 15, wherein after the first electronic device responds to the first voice instruction, before recording the second voice instruction input by the user, the method further include:

During the recording process of the first electronic device, if the first electronic device does not detect the second voice command input by the user within a preset time period, the first electronic device deletes the saved recording data and continues to record ; the first electronic device invokes a multi-round dialogue pause instruction to the second electronic device, and the multi-round dialogue pause instruction is used to instruct the multi-round dialogue to temporarily stop.
The method according to any one of claims 11 to 16, wherein the method further comprises:

The first electronic device receives audio quality information of the recording data of the second electronic device sent by the second electronic device.
The method according to any one of claims 11 to 17, wherein the first electronic device responds to the said first electronic device according to the recorded data of the first electronic device and/or the recorded data of the second electronic device Second voice command, including:

The first electronic device determines an optimal radio device from the voice control system according to the audio quality information of the audio recording data of the first electronic device and the audio quality information of the audio recording data of the second electronic device;

When the optimal radio device is the first electronic device, the first electronic device responds to the second voice instruction according to the recording data of the first electronic device;

When the optimal sound-receiving device is the second electronic device, the first electronic device is based on the recording data of the second electronic device, or, according to the recording data of the second electronic device and the first electronic device the recorded data, and respond to the second voice command;

Wherein, the audio quality information is used to indicate the audio quality of the recording data.
The method according to any one of claims 11 to 17, wherein the first electronic device responds to the said first electronic device according to the recorded data of the first electronic device and/or the recorded data of the second electronic device Second voice command, including:

The first electronic device responds to the second voice instruction according to the audio content information of the audio recording data of the first electronic device and/or the audio content information of the audio recording data of the second electronic device;

Wherein, the audio content information is used to represent the audio content of the recording data.
A voice control method, characterized in that it is applied to a second electronic device of a voice control system, the voice control system further comprising at least a first electronic device, and the method includes:

The second electronic device records and saves the recording data, and the recording is used to record the second voice command input by the user;

The second electronic device sends the recording data of the second electronic device to the first electronic device, where the recording data of the second electronic device includes the recording of the second electronic device recording the second voice command input by the user data, the recording data is used for the first electronic device to respond to the second voice command after responding to the first voice command.
The method of claim 20, wherein the method further comprises:

The second electronic device receives a voice pickup instruction called by the first electronic device, and the voice pickup instruction is used by the second electronic device to return the recording data of the second electronic device.
The method according to claim 20 or 21, wherein the recording by the second electronic device comprises:

The second electronic device records audio when or after the second electronic device receives the first voice instruction input by the user.
The method according to any one of claims 20 to 22, wherein the method further comprises:

The second electronic device determines the first electronic device according to the audio quality information of the first voice command received by the second electronic device and the audio quality information of the first voice command received by the first electronic device It is the answering device of the voice control system.
The method according to any one of claims 20 to 23, wherein after the first electronic device responds to the first voice command, the method further comprises:

During the recording process of the second electronic device, the second electronic device receives the second electronic device to invoke a multi-round dialogue pause instruction, and the multi-round dialogue pause instruction is used to instruct the multi-round dialogue to temporarily stop; the first The second electronic device deletes the saved recording data and continues the recording.
The method according to any one of claims 20 to 24, wherein the method further comprises:

The second electronic device sends audio quality information of the recording data of the second electronic device to the first electronic device.
An electronic device, comprising: one or more processors and memories;

The memory is coupled to the one or more processors for storing computer program code comprising computer instructions that, when executed by the one or more processors, The electronic device executes the voice control method according to any one of claims 11 to 19, or the electronic device executes the voice control method according to any one of claims 20 to 25.
A computer storage medium, characterized by comprising computer instructions, which, when the computer instructions are executed on an electronic device, cause the electronic device to execute the voice control method according to any one of claims 11 to 19, or , so that the electronic device executes the voice control method according to any one of claims 20 to 25.
A computer program product, characterized in that, when the computer program product runs on a computer, the computer is caused to execute the voice control method according to any one of claims 11 to 19, or, the computer is caused to A voice control method as claimed in any one of claims 20 to 25 is performed.
A voice control system, characterized in that, the voice control system at least includes a first electronic device and a second electronic device with a voice control function, and the voice control system is used to perform the method as claimed in any one of claims 1 to 10. described voice control method.