CN115457945A - Voice interaction method, server and storage medium - Google Patents
Voice interaction method, server and storage medium Download PDFInfo
- Publication number
- CN115457945A CN115457945A CN202211402578.0A CN202211402578A CN115457945A CN 115457945 A CN115457945 A CN 115457945A CN 202211402578 A CN202211402578 A CN 202211402578A CN 115457945 A CN115457945 A CN 115457945A
- Authority
- CN
- China
- Prior art keywords
- rejection
- semantic
- voice
- service
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 87
- 238000000034 method Methods 0.000 title claims abstract description 73
- 230000001360 synchronised effect Effects 0.000 claims abstract description 13
- 238000013500 data storage Methods 0.000 claims description 38
- 238000007726 management method Methods 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 15
- 238000000605 extraction Methods 0.000 description 3
- 238000002620 method output Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a voice interaction method, a server and a storage medium. The voice interaction method comprises the following steps: receiving a voice request forwarded by a vehicle, and performing downstream logic processing on the voice request; the method comprises the steps that when downstream logic processing is carried out on a voice request, an asynchronous request is sent, so that first semantic rejection is carried out on the voice request according to context characteristics, and a first semantic rejection result is obtained; after receiving the downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result; and transmitting the second semantic rejection result to the vehicle to complete voice interaction. The invention carries out the first semantic rejection to the voice request according to the context characteristics and carries out the second semantic rejection to the voice request according to the first semantic rejection result and the downstream logic processing result to obtain the second semantic rejection result, thereby completing the voice interaction and reducing the end-to-end time delay of the voice interaction process.
Description
Technical Field
The present invention relates to the field of voice interaction technologies, and in particular, to a voice interaction method, a server, and a storage medium.
Background
The multi-modal feature semantic rejection develops rapidly in academia, but the industry has a fresh scheme. At present, a related voice interaction scheme of multiple feature extraction and model inference consumes a lot of time from end to end, cannot meet the requirement of voice interaction instantaneity, and is poor in user experience.
Disclosure of Invention
The embodiment of the invention provides a voice interaction method, a server and a storage medium.
The embodiment of the invention provides a voice interaction method. The voice interaction method comprises the following steps: receiving a voice request forwarded by a vehicle, and performing downstream logic processing on the voice request; sending an asynchronous request while performing the downstream logic processing on the voice request so as to perform first semantic rejection on the voice request according to the context characteristics to obtain a first semantic rejection result; after receiving a downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result; and transmitting the second semantic rejection result to the vehicle to complete voice interaction.
Therefore, the voice interaction method carries out the first semantic rejection on the voice request according to the context characteristics and carries out the second semantic rejection on the voice request according to the first semantic rejection result and the downstream logic processing result to obtain the second semantic rejection result so as to complete the voice interaction, thereby greatly reducing the end-to-end time delay of the voice interaction process and meeting the real-time requirement of the voice interaction.
The method comprises the following steps that a voice request forwarded by a vehicle is received, and downstream logic processing is carried out on the voice request, and the method comprises the following steps: and sending the voice request to a central control service so that the central control service sends the voice request to a dialog system downstream service for entry processing, and the dialog system downstream service realizes the downstream logic processing.
Thus, the present invention enables downstream logical processing of voice requests by sending the voice requests to a dialog system downstream service.
The sending the voice request to a central control service so that the central control service sends the voice request to a dialog system downstream service for participating in processing, and the dialog system downstream service realizes the downstream logic processing, including: and sending the voice request to the central control service so that the central control service sends the voice request to a downstream service of a dialogue system for natural language understanding, dialogue management and/or business robot participation, thereby realizing business logic processing corresponding to the natural language understanding, the dialogue management and the business robot by the downstream service of the dialogue system.
Therefore, the invention can realize the natural language understanding, the dialogue management or the business logic processing corresponding to the business robot for the voice request through a natural language understanding, the dialogue management or the business robot parameter entering mode.
After the step of receiving the voice request forwarded by the vehicle and performing downstream logic processing on the voice request, the voice interaction method comprises the following steps: and sending the voice recognition text characteristics of the voice request to a context service through a central control service so as to store the voice recognition text characteristics as the context characteristics into a data storage service.
Therefore, the speech interaction method can write the ASR characteristics into the context characteristics and store the context characteristics stored with the ASR characteristics into the data storage service, so that the context characteristics with the ASR characteristics can be directly obtained from the context service during subsequent asynchronous requests.
Before the step of sending an asynchronous request while performing the downstream logic processing on the voice request to perform the first semantic rejection on the voice request according to the context characteristics to obtain a first semantic rejection result, the voice interaction method includes: the voice request is sent to an acoustic rejection service for processing while downstream logic processing is carried out on the voice request, so that acoustic features are obtained; rejecting the voice request according to the acoustic features to obtain an acoustic rejection result; and sending the acoustic features and the acoustic rejection results to the context service so as to store the acoustic features and the acoustic rejection results as context features in the data storage service.
Therefore, the voice interaction method can write the acoustic features and the acoustic rejection results into the context features, and store the context features in which the acoustic features and the acoustic rejection results are stored in the data storage service, so that the context features with the acoustic features and the acoustic rejection results can be directly obtained from the context service during subsequent asynchronous requests.
The sending an asynchronous request while performing the downstream logic processing on the voice request to perform a first semantic rejection on the voice request according to the context characteristics to obtain a first semantic rejection result includes: sending the asynchronous request to a semantic rejection service for entering the reference through the central control service; and obtaining the voice recognition text characteristics, the acoustic characteristics and the acoustic rejection result through the semantic rejection service to perform first semantic rejection on the voice request to obtain a first semantic rejection result.
Therefore, the invention sends the asynchronous request to the semantic rejection service for entering the reference through the central control service, and carries out the first semantic rejection on the voice request through the semantic rejection service, thereby obtaining the first semantic rejection result, hiding the time delay in the backbone link, and being capable of reducing the end-to-end time delay of the voice interaction of the scheme.
After the step of sending an asynchronous request while performing the downstream logic processing on the voice request to perform the first semantic rejection on the voice request according to the context feature to obtain a first semantic rejection result, the voice interaction method includes: and storing the first semantic rejection result into the data storage service.
Therefore, the first voice rejection result can be read from the data storage service in time when the second semantic rejection is carried out subsequently.
After receiving the downstream logic processing result, sending a synchronization request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result, including: after receiving the downstream logic processing result, sending the synchronization request to a semantic rejection service for entering the parameters through the central control service based on the downstream logic processing result; obtaining the first semantic rejection result of the data storage service through the semantic rejection service; and performing second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and fusing to obtain a second semantic rejection result.
Therefore, when the voice interaction method of the invention carries out the same request for the second time, the end-to-end time delay of the voice interaction process can be further reduced by directly reading the rejection model result and carrying out rule prediction, and the accuracy of the voice interaction is ensured.
The invention provides a server. The server comprises a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the voice interaction method of any one of the above embodiments is realized.
Therefore, the server of the invention carries out the first semantic rejection on the voice request according to the context characteristics and carries out the second semantic rejection on the voice request according to the first semantic rejection and the downstream logic processing result to obtain the second semantic rejection so as to complete the voice interaction, thereby greatly reducing the end-to-end time delay of the voice interaction process.
The present invention also provides a non-transitory computer-readable storage medium containing the computer program. The computer program, when executed by one or more processors, implements the voice interaction method of any of the above embodiments.
Therefore, the storage medium of the invention carries out the first semantic rejection on the voice request according to the context characteristics and carries out the second semantic rejection on the voice request according to the first semantic rejection result and the downstream logic processing result to obtain the second semantic rejection result so as to complete the voice interaction, thereby greatly reducing the end-to-end time delay of the voice interaction process.
Additional aspects and advantages of embodiments of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a diagram illustrating a workflow of a voice interaction scheme in the related art;
FIG. 2 is a flow chart of a voice interaction method of the present invention;
FIG. 3 is a schematic illustration of a workflow of the voice interaction method of the present invention;
FIG. 4 is a second flowchart of the voice interaction method of the present invention;
FIG. 5 is a third flowchart of the voice interaction method of the present invention;
FIG. 6 is a fourth flowchart illustrating a voice interaction method according to the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for the purpose of illustrating the embodiments of the present invention and are not to be construed as limiting the embodiments of the present invention.
The correlation scheme for various feature extraction and model inference can be, for example, a voice interaction scheme as shown in fig. 1. The voice interaction scheme in fig. 1 performs semantic rejection on a voice request through only one synchronous request, and a thick part in fig. 1 is a backbone link delay of the semantic rejection scheme.
Specifically, in fig. 1, first, a request audio and an ASR feature are obtained through an Automatic Speech Recognition (ASR) service in a cloud, then, the ASR feature result is broadcasted by the Automatic Speech Recognition service to request a central control service, and the central control service writes the ASR feature into a session management context service. The dialogue management context service can store the context multi-modal characteristics, return the processing result of the context multi-modal characteristics to the central control service and store the ASR characteristic result in the data storage library. And after the downstream service completes the downstream logic processing, the downstream logic processing result can be returned to the central control service, the central control service performs synchronous request field and audio to the semantic rejection service at the moment, and the dialogue management context service reads the context ASR characteristics from the semantic rejection service and returns the context ASR characteristic result to the semantic rejection service. The semantic rejection service then passes the audio request to the acoustic rejection service. The acoustic rejection service may return the acoustic features and the acoustic rejection results to the semantic rejection service according to the audio request. And finally, storing the acoustic characteristics and the acoustic rejection results into a dialogue management context service, storing the acoustic characteristics and the acoustic rejection results into a data storage service (redis), and returning the multi-modal characteristic rejection to a semantic rejection service.
In fig. 1, the time delay of the speech processing process from the ASR recognition service to other downstream services in the dialog system is greater than 200ms, which is time-consuming, and cannot meet the requirement of the real-time performance of the speech interaction, and the user experience is poor.
In view of the above, referring to fig. 2, the present invention provides a voice interaction method. The voice interaction method comprises the following steps:
01: receiving a voice request forwarded by a vehicle, and performing downstream logic processing on the voice request;
03: the method comprises the steps that when downstream logic processing is carried out on a voice request, an asynchronous request is sent, so that first semantic rejection is carried out on the voice request according to context characteristics, and a first semantic rejection result is obtained;
05: after receiving the downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result;
07: and transmitting the second semantic rejection result to the vehicle to complete voice interaction.
The invention also provides a server. The server comprises a processor and a memory, wherein the memory is stored with computer programs, and the processor is used for receiving the voice request forwarded by the vehicle and carrying out downstream logic processing on the voice request; the method comprises the steps that when downstream logic processing is carried out on a voice request, an asynchronous request is sent, so that first semantic rejection is carried out on the voice request according to context characteristics, and a first semantic rejection result is obtained; after receiving the downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result; and transmitting the second semantic rejection result to the vehicle to complete voice interaction.
That is, referring to fig. 3, a flow chart of the semantic refusal scheme for voice request according to the present invention is shown in fig. 3.
The voice interaction method firstly receives the voice request forwarded by the vehicle and carries out downstream logic processing on the voice request. Referring to FIG. 3, in steps 1-7 of FIG. 3, the present invention first obtains the request audio and ASR characteristics of the vehicle-forwarded speech request via an ASR recognition service, and then performs a series of downstream logic processes on the speech request. The downstream logic processing means that the ASR recognizes a central control service of the service request, and sends the voice request to other downstream services in the dialog system through the central control service to perform downstream logic processing. The bold part in fig. 3 is the backbone link delay of the semantic rejection scheme of the present invention.
And sending an asynchronous request while performing downstream logic processing on the voice request so as to perform first semantic rejection on the voice request according to the context characteristics to obtain a first semantic rejection result. Specifically, as shown in step 8-11 in fig. 3, while performing downstream logic processing on the voice request, the central control service sends an asynchronous request to the semantic rejection service, and reads a contextual acoustic feature, an ASR feature, and an acoustic rejection result from the dialog management context service, so as to perform a first semantic rejection on the voice request according to the contextual feature, and obtain a first semantic rejection result. Wherein, rejecting refers to judging whether the voice request is an out-of-set word or an invalid input. The voice interaction method takes the context information into consideration, so that the result of the feature extraction of the voice request can be more accurate.
And after receiving the downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result. Specifically, as shown in step 14-15 in fig. 3, after receiving the downstream logic processing result of other downstream services in the dialog system, the central control service may send a synchronization request to the semantic rejection service, and the semantic rejection service performs a second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result.
And finally, the semantic rejection service reads the multi-mode rejection model result from the data storage service to obtain a third-generation rejection model result, the semantic rejection service reads the third-generation rejection model result, the third-generation rejection model result is a second semantic rejection result, the model result is directly read and is subjected to rule prediction to obtain a second semantic rejection result, the second semantic rejection result is returned to the central control service, and the second semantic rejection result is issued to the vehicle to complete voice interaction.
That is, referring to fig. 3, the present invention can hide the delay in the backbone link through the first asynchronous request (step 8), and directly read the model result and make rule prediction when the second synchronous request (step 15) is performed, and compared with steps 7 to 14 of fig. 1 in the related scheme, steps 1 to 18 of fig. 3 in the scheme of the present invention, the end-to-end delay can be reduced from 200ms to 6 to 10ms.
Therefore, the voice interaction method carries out first semantic rejection on the voice request according to the context characteristics and carries out second semantic rejection on the voice request according to the first semantic rejection and the downstream logic processing result to obtain a second semantic rejection so as to complete voice interaction, so that the end-to-end time delay of the voice interaction process is greatly reduced, and the real-time requirement of the voice interaction is met.
011: and sending the voice request to a central control service so that the central control service sends the voice request to a downstream service of the dialogue system for participating in processing, and the downstream service of the dialogue system realizes downstream logic processing.
The processor is used for sending the voice request to the central control service so that the central control service sends the voice request to the downstream service of the dialogue system for participating in processing, and the downstream logic processing is realized by the downstream service of the dialogue system.
Specifically, as shown in step 1.2 and step 2.1 in fig. 3, the ASR recognition service sends the voice request to the central control service, so that the central control service sends the voice request to the downstream service of the dialog system for entry processing, and the downstream logic processing is implemented by the downstream service of the dialog system.
Thus, the present invention enables downstream logical processing of voice requests by sending the voice requests to a dialog system downstream service.
More specifically, step 011 includes:
0111: and sending the voice request to a central control service so that the central control service sends the voice request to a downstream service of the dialogue system for natural language understanding, dialogue management and/or business robot participation, thereby realizing business logic processing corresponding to the natural language understanding, the dialogue management and the business robot by the downstream service of the dialogue system.
The processor is used for sending the voice request to the central control service so that the central control service sends the voice request to the downstream services of the dialogue system for natural language understanding, dialogue management and/or business robot participation, and therefore business logic processing corresponding to the natural language understanding, the dialogue management and the business robot is achieved through the downstream services of the dialogue system.
Specifically, the method for participating in the voice request by the downstream service of the dialog system comprises one or more of three participating in processing methods of participating in natural voice understanding, participating in dialog management and participating in business robot.
When the processing mode of the parameter input comprises a processing mode of natural speech understanding parameter input, business logic processing corresponding to natural speech understanding of the speech request can be realized; when the participation processing mode comprises a processing mode of conversation management participation, business logic processing corresponding to conversation management of the voice request can be realized; when the processing mode of entering the parameter comprises the processing mode of entering the parameter by the service robot, the service logic processing corresponding to the service robot of the voice request can be realized.
Therefore, the invention can realize the natural language understanding, the dialogue management or the business logic processing corresponding to the business robot for the voice request through a natural language understanding, the dialogue management or the business robot parameter entering mode.
After step 01, the voice interaction method comprises the following steps:
02: and sending the voice recognition text characteristics of the voice request to a context service through a central control service so as to store the voice recognition text characteristics as the context characteristics into a data storage service.
The processor is used for sending the voice recognition text characteristics of the voice request to the context service through the central control service so as to store the voice recognition text characteristics as the context characteristics into the data storage service.
Specifically, as shown in steps 1.2, 2.1, 2.2, 3.1, and 3.2 in fig. 3, after receiving an ASR feature of an ASR Recognition service, the central control service may send the ASR feature to the context service, and store the ASR feature as the context feature in the data storage service.
In the feature processing stages (steps 1 to 7), the description will be given by taking as an example that the previous voice request of the user is "turn on the air conditioner" and the current voice request is "third air volume", and for the current voice request "third air volume", the step of storing the ASR feature of the voice request in the data storage service may be expressed in a table form as shown in table 1 below.
TABLE 1
Associated Step (ii) of | Input device | Output of | Time consuming |
1.2 2.1 2.2 3.1 3.2 | User language Voice request [ air volume 3 ] Stage (C) | The ASR features of [ three windshields ] are stored in the data storage service: {“asrAlignment”;[{“conf”:1.0,“end”:820,“pinyin”: "feng", "start":640, "word": "wind" }, { "conf":1.0, “end”:940,“pinyin”:“liang”,“start”:820,“word”: "amount" }, { "conf":1.0, "end":1120, "pinyin": "san", "start":940, "word": "three" }, { "conf":1.0, "end":1340, "pinyin": "dang", "start":1120, "word": gear], “eof”:true} | Asynchronous 50ms |
Therefore, the speech interaction method can write the ASR characteristics into the context characteristics and store the context characteristics stored with the ASR characteristics into the data storage service, so that the context characteristics with the ASR characteristics can be directly obtained from the context service during subsequent asynchronous requests.
Referring to fig. 4, before step 03, the voice interaction method includes:
021: the method comprises the steps that when downstream logic processing is carried out on a voice request, the voice request is sent to an acoustic rejection service to be processed, so that acoustic features are obtained;
022: rejecting the voice request according to the acoustic characteristics to obtain an acoustic rejection result;
023: and sending the acoustic features and the acoustic rejection results to a context service so as to store the acoustic features and the acoustic rejection results as context features into a data storage service.
The processor is used for sending the voice request to the acoustic rejection service for processing while performing downstream logic processing on the voice request so as to obtain acoustic features; rejecting the voice request according to the acoustic characteristics to obtain an acoustic rejection result; and sending the acoustic characteristics and the acoustic rejection results to a context service so as to store the acoustic characteristics and the acoustic rejection results as context characteristics in a data storage service.
Specifically, as shown in steps 1.1, 4, 5, 6, and 7 in fig. 3, while performing downstream logic processing on the voice request, the ASR recognition service sends the voice request to the acoustic rejection service, and the acoustic rejection service performs processing to obtain acoustic features, and then determines whether the voice request is rejected according to the acoustic features, that is, rejects the voice request according to the acoustic features to obtain an acoustic rejection result. Wherein the acoustic features refer to pure audio features.
The acoustic rejection results include three results of whether the voice request is an invalid input, and whether the voice request is an invalid input are uncertain, which can be represented by 0, 1, and 2, 0 representing passing, the voice request is not an invalid input, 1 representing rejection, the voice request is an invalid input, and 2 representing uncertain whether the voice request is an invalid input.
In the feature processing stages (steps 1 to 7), the description will be given by taking as an example that the previous voice request of the user is "turn on the air conditioner" and the current voice request is "third air volume", and for the current voice request "third air volume", the step of storing the acoustic feature and the acoustic rejection result of the voice request in the data storage service may be expressed in a table form shown in table 2 below.
TABLE 2
Associated Step (ii) of | Input device | Output of | Time consuming |
1.145 67 | User' s Audio frequency [ air volume ] Third stage | Acoustic rejection results: the conditional rejection =0 logging context service (0: releasing; 1: refusing; 2: indeterminate) acoustic MFCC features: N-M feature matrix logging The context service and the ASR features constitute context features: { "data"; [{“MFCCFeat”:[],“acousticRejection”:“0”, “asrAlignment”:[{“conf”:1.0,“end”:1120,“pinyin”: "da kai", "start":660, "word": "open" }, { "conf": 0.795,“end”:1820,“pinyin”:“kongtiao”,“start”:1120, "word": "air conditioner" }],“eof”:true},{“MFCCFeat”:[], “acousticRejection”:“0”,“asrAlignment”:[{“conf”: 1.0,“end”:820,“pinyin”:“feng”,“start”:640,“word”: "wind" }, { "conf":1.0, "end":940, "pinyin": "liang" of the plant tissue, "start":820, "word": "amount" }, { "conf":1.0, "end":1120, "pinyin": "san", "start":940, "word": "three" }, { "conf": 1.0,“end”:1340,“pinyin”:“dang”,“start”:1120, "word": gear],“eof”:true}]} | Asynchronous 100ms |
The acoustic rejection service may then store the acoustic features and the acoustic rejection results as context features in the data storage service to facilitate direct retrieval of the context features with the acoustic features and the acoustic rejection results from the context service upon subsequent asynchronous requests.
Therefore, the voice interaction method can write the acoustic features and the acoustic rejection results into the context features, and store the context features in which the acoustic features and the acoustic rejection results are stored in the data storage service, so that the context features with the acoustic features and the acoustic rejection results can be directly obtained from the context service during subsequent asynchronous requests.
The steps of performing downstream logic processing on the voice request to obtain the downstream logic processing result in fig. 3 may be expressed in a table form as shown in table 3 below.
TABLE 3
Associated Step (ii) of | Input device | Output the output | Time consuming |
2.1 8 | Asynchronous first request body: {“data”:{“async”:1, “hardwareid”: “xxxxxx”,“msgld”: "xxxxxx", "query": wind (Chinese character of 'feng') Volume three, recordidd: “rcidxxxxxx”, “scenelds”:xxxxxx}, “status”:“xxxxxx”} | after obtaining the downstream logic processing result, the second synchronization is formed A step requesting body: { "data": { "async":1, “hardwareid”:“xxxxxx”,“msgld”: "xxxxxx", "query": 'three wind volume gears', “recordid”:“rcidxxxxxx”,“scenelds”: xxxxxx},“status”:“xxxxxx”, “domains”:[{“domainConfidence”: 0.91380006,“domainName”:“ac”, “intents”:[{“intentConfidence”:1.0, “intentName”:“ac wind set”, “slotConfidence”:0.0,“slots”: [{“name”:“number”,“pos”:[2.2], "rawvalue": "three", "value type": “string”}]}]}],}} | synchronization 200- 500ms |
Further, referring to fig. 5, step 03 includes:
031: sending an asynchronous request to a semantic rejection service through a central control service for entering the parameters;
032: and acquiring voice recognition text characteristics, acoustic characteristics and acoustic rejection results through the semantic rejection service to perform first semantic rejection on the voice request to obtain a first semantic rejection result.
The processor is used for sending an asynchronous request to the semantic rejection service for entering the reference through the central control service; and obtaining the voice recognition text characteristics, the acoustic characteristics and the acoustic rejection result through the semantic rejection service, and performing first semantic rejection on the voice request to obtain a first semantic rejection result.
Specifically, as shown in step 8 and step 9 of fig. 3, the central control service sends an asynchronous request to the semantic rejection service for participation. Then, as shown in step 10 and step 11 in fig. 3, the semantic rejection service obtains the speech recognition text feature, the acoustic feature and the acoustic rejection result from the dialog management context service, and performs the first semantic rejection on the speech request to obtain a first semantic rejection result.
In the first asynchronous request stage (steps 8 to 13) of the central control service for the semantic rejection service, the previous round of voice request of the user is taken as 'open air conditioner', the current round of voice request is taken as 'third air volume', the central control service sends the asynchronous request to the semantic rejection service for entering, the semantic rejection service acquires voice recognition text features, acoustic features and acoustic rejection results from the dialogue management context service to perform first semantic rejection on the voice request, and the step of obtaining the first semantic rejection results can be expressed in a table form shown in the following table 4.
TABLE 4
Associated Step (ii) of | Input the method | Output of | Time consuming |
8.191 0 11 | Asynchronous request body: {“data”: {“async”:1, “hardwareid ”:“xxxxxx”, “msgld”: “xxxxxx”, "query": wind (Chinese character of 'feng') Measuring the third gear, “recordid”: “rcidxxxxxx ”, “scenelds”: xxxxxx}, “status”: “xxxxxx”} | context characteristics: { "data"; [ { "MFCCFeat": [], “acousticRejection”:“0”,“asrAlignment”: [{“conf”:1.0,“end”:1120,“pinyin”:“da kai "," start ":660, "word": "open" }, { "conf": 0.795,“end”:1820,“pinyin”:“kongtiao”, "start":1120, word: "air conditioner" }],“eof”:true}, {“MFCCFeat”:[],“acousticRejection”:“0”, “asrAlignment”:[{“conf”:1.0,“end”:820, “pinyin”:“feng”,“start”:640,“word”: "wind" }, { "conf":1.0, "end":940, "pinyin": "liang", "start":820, "word": the "amount" of the water is used, {“conf”:1.0,“end”:1120,“pinyin”:“san”, "start":940, "word": "three" }, { "conf":1.0, “end”:1340,“pinyin”:“dang”,“start”:1120, "word": gear],“eof”:true}]} | Asynchronous 20ms |
The voice request may be subjected to the first semantic rejection through a pre-trained acoustic rejection model, and accordingly, the first semantic rejection result may also be referred to as a multi-modal rejection model result.
Therefore, the invention sends the asynchronous request to the semantic rejection service for entering the reference through the central control service, and carries out the first semantic rejection on the voice request through the semantic rejection service, thereby obtaining the first semantic rejection result, hiding the time delay in the backbone link, and being capable of reducing the end-to-end time delay of the voice interaction of the scheme.
After step 03, the voice interaction method includes:
04: and storing the first semantic rejection result into a data storage service.
The processor is configured to store the first semantic rejection result in the data storage service.
Specifically, as shown in steps 12 and 13 in fig. 3, after obtaining the first semantic rejection result in the semantic rejection service, the present invention may store the first semantic rejection result in the data storage service.
In the stage of storing the first semantic rejection result in the data storage service (steps 12 and 13), the step of storing the first semantic rejection result in the data storage service is shown in table 5, taking the previous round of voice request of the user as "turn on the air conditioner" and the current round of voice request as "third gear of air volume" as an example.
TABLE 5
Associated Step (ii) of | Input the method | Output of | Time consuming |
12 13 | Context characteristics: { "data": [ { "MFCCFeat": [], “acousticRejection”:“0”, “asrAlignment”: [{“conf”:1.0,“end”: 1120,“pinyin”:“da kai”,“start”: 660, "word": "open" }, { "conf":0.795, “end”:1820,“pinyin”:“kongtiao”, "start":1120, word: air conditioner],“eof”: true},{“MFCCFeat”:[], “acousticRejection”:“0”, “asrAlignment”:[{“conf”:1.0,“end”: 820,“pinyin”:“feng”,“start”:640, "word": "wind" }, { "conf":1.0, "end":940, “pinyin”:“liang”,“start”:820, "word": "amount" }, { "conf":1.0, "end": 1120,“pinyin”:“san”,“start”:940, "word": "three" }, { "conf":1.0, "end": 1340,“pinyin”:“dang”,“start”:1120, "word": gear],“eof”:true}]Is asynchronous A request body: an asynchronous request body: { "data": {“async”:1,“hardwareid”:“xxxxxx”, "msgld": "xxxxxx", "query": ' air volume three Gear "," recordidd ": "rcidxxxxxx", “scenelds”:xxxxxx},“status”: “xxxxxx”} | and (3) modeling results: { "query": 'three wind volume gears', “modelConfidence”: {“noise”:0.0,“clear”: 1.0,“taskLabel”:“xx”, “taskLabel”:“xx”, “detail”: {“taskLabel”:0.9, “taskLabel2”:0.8}}, “code”,10000,“msg”: “ok”} | asynchronous 30ms |
Therefore, the first voice rejection result can be read from the data storage service in time when the second semantic rejection is carried out subsequently.
Referring to fig. 6, step 05 includes:
051: after receiving a downstream logic processing result, sending a synchronization request to a semantic rejection service for entering a reference through a central control service based on the downstream logic processing result;
052: acquiring a first semantic rejection result of the data storage service through the semantic rejection service;
053: and performing second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and fusing to obtain a second semantic rejection result.
The processor is used for sending a synchronization request to the semantic rejection service for entering the parameters through the central control service based on the downstream logic processing result after receiving the downstream logic processing result; acquiring a first semantic rejection result of the data storage service through the semantic rejection service; and performing second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and fusing to obtain a second semantic rejection result.
Referring to fig. 3, as shown in steps 14 to 18 in fig. 3, after receiving the downstream logic processing result, the central control service may send a synchronization request to the semantic rejection service for entry based on the downstream logic processing result, where the semantic rejection service obtains a first semantic rejection result stored in the data storage service, performs a second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and merges the first semantic rejection result and the second semantic rejection result to obtain a second semantic rejection result.
That is, the central control service receives the downstream logic processing results returned by other downstream services (NLU/DM/BOT) in the dialog system, sends a synchronization request to the semantic rejection service by using the service logic, and the semantic rejection service reads the multi-modal rejection model result in the data storage service (REDIS), performs rule reasoning by combining the service logic, and then fuses to obtain a second semantic rejection result to return to the central control service.
In the second synchronous request stage (steps 14 to 18) of the semantic rejection service by the central control service, taking the previous round of voice request of the user as "open the air conditioner" and the current round of voice request as "third air volume" as an example for explanation, the central control service receives the downstream logic processing result returned by other downstream services (NLU/DM/BOT) in the dialog system, and the step of sending the second synchronous request to the semantic rejection service by the business logic can be expressed in a table form as shown in table 6; the step of the semantic rejection service reading the result of the multi-modal rejection model in the data storage service (REDIS) can be expressed in a tabular form as shown in table 7; the steps of performing rule reasoning by combining with the service logic, then fusing to obtain a second semantic rejection result, and returning the result to the central control service can be expressed in a table form as shown in table 8.
TABLE 6
Associated Step (ii) of | Input device | Output of |
15 | Second-time synchronization request body: { "data": { "async":1, “hardwareid”:“xxxxxx”,“msgld”: "xxxxxx", "query": 'air volume three-gear', “recordid”:“rcidxxxxxx”,“scenelds”: xxxxxx},“status”:“xxxxxx”,“domains”: [{“domainConfidence”:0.91380006, “domainName”:“ac”,“intents”: [{“intentConfidence”:1.0,“intentName”: “ac wind set”,“slotConfidence”:0.0, “slots”:[{“name”:“number”,“pos”:[2.2], "rawvalue": "three", "value type": “string”}]}]}],}} | is free of |
TABLE 7
Closing (A) Couplet Step (b) Method for preparing a Chinese medicinal composition | Input device | Output of | Consumption of Time of flight |
16 17 | Key value: {“recordl d”:“rcid” xxxxxx} | model results in data storage service: { "query": 'air volume three-gear', “modelConfidence”:{“noise”:0.0,“clear”:1.0, “taskLabel”:“xx”,“detail”:{“taskLabel”:0.9, “taskLabel”:0.8}},“code”:10000,“msg”:“ok”} | all in one Step by step 1- 5m s |
TABLE 8
Associated Step (ii) of | Input the method | Output the output | Time consuming |
18 | Second synchronization request body: { "data": { "async": 1,“hardwareid”:“xxxxxx”,“msgld”: "xxxxxx", "query": 'air volume three-gear', “recordid”:“rcidxxxxxx”, “scenelds”:xxxxxx},“status”: “xxxxxx”,“domains”: [{“domainConfidence”:0.91380006, “domainName”:“ac”,“intents”: [{“intentConfidence”:1.0, “intentName”:“ac wind set”, “slotConfidence”:0.0,“slots”: [{“name”:“number”,“pos”:[2.2], "rawvalue": "three", "value type": “string”}]}]}]in the data storage service Results of the model of (1):{ "query": 'three wind volume gears', “modelConfidence”:{“noise”:0.0, “clear”:1.0,“taskLabel”:“xx”, “detail”:{“taskLabel”:0.9, “taskLabel”:0.8}},“code”:10000, “msg”:“ok”} | and outputting a final result: { "query": 'three wind volume gears', "queryRewrite": air volume Third gear, the "queryState", “clear”, “modelConfidence”: {“noise”:0.0,“clear”: 1.0,“taskLabel”: “2.7”:{“taskLabel”: “xx”,“detail”: {“taskLabel”:0.9, “taskLabe2”:0.8}}, “ruleConfidence”:1.0, “filterReason”: “number set”, “code”:10000,“msg”: “ok”} | synchronization 5ms |
Therefore, when the voice interaction method of the invention carries out the second same-time request, the end-to-end time delay of the voice interaction process can be further reduced by directly reading the rejection model result and carrying out rule prediction, and the accuracy of voice interaction is ensured.
In summary, the voice interaction method of the present invention can hide the delay in the backbone link through the first asynchronous request (step 8), and directly read the model result and make rule prediction when the second synchronous request (step 15) is performed, so that steps 1 to 18 in fig. 3 of the voice interaction method of the present invention can be compared with steps 7 to 14 in fig. 1 of the related art, and the use of the voice interaction method of the present invention can reduce the end-to-end delay from 200ms to 6 to 10ms. Wherein, the end-to-end delay of the voice interaction scheme in fig. 1 is 200ms, the end-to-end delay of the voice interaction scheme of the present invention is 6-10ms, and the specific time consumption analysis comparison table is shown in table 9 and table 10.
TABLE 9
Voice interaction scheme of related art | Time consuming procedure | Time consuming |
Synchronization scheme | Steps 7-14 of FIG. 1 | Synchronization for 200ms, backbone link addition |
End-to-end time consumption increase | 200ms |
TABLE 10
Voice interaction scheme of the invention | Time consuming procedure | Time consuming |
Asynchronous scheme | 1.2 2.1 2.2 3.1 3.2 | Asynchronous 50ms, hidden in backbone link |
1.1 4 5 6 7 | Asynchronous 100ms, hidden in backbone links | |
8.1 9 10 11 | Asynchronous 20ms, hidden in backbone links | |
2.1 8 | Backbone link 200-500ms | |
12 13 | Asynchronous 30ms, hidden in backbone link | |
16 17 | Synchronization 1-5ms, backbone link addition | |
18 | Synchronization for 5ms, backbone link addition | |
End-to-end time consumption increase | 6-10ms |
The present invention also provides a non-transitory computer-readable storage medium containing the computer program. The computer program, when executed by one or more processors, implements the voice interaction method described in any of the embodiments above.
For example, the computer program when executed by a processor implements the steps of the following voice interaction method:
01: receiving a voice request forwarded by a vehicle, and performing downstream logic processing on the voice request;
02: the method comprises the steps that when downstream logic processing is carried out on a voice request, an asynchronous request is sent, so that first semantic rejection is carried out on the voice request according to context characteristics, and a first semantic rejection result is obtained;
03: after receiving the downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result;
04: and transmitting the second semantic rejection result to the vehicle to complete voice interaction.
It will be appreciated that the computer program comprises computer program code. The computer program code may be in the form of source code, object code, an executable file or some intermediate form, and the like. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), and software distribution medium.
The storage medium of the invention carries out first semantic rejection on the voice request according to the context characteristics and carries out second semantic rejection on the voice request according to the first semantic rejection and the downstream logic processing result to obtain a second semantic rejection so as to complete voice interaction, thereby greatly reducing the end-to-end time delay of the voice interaction process.
Claims (10)
1. A method of voice interaction, comprising:
receiving a voice request forwarded by a vehicle, and performing downstream logic processing on the voice request;
sending an asynchronous request while performing the downstream logic processing on the voice request so as to perform first semantic rejection on the voice request according to the context characteristics to obtain a first semantic rejection result;
after receiving a downstream logic processing result, sending a synchronous request to perform second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result;
and transmitting the second semantic rejection result to the vehicle to complete voice interaction.
2. The method of claim 1, wherein the receiving a voice request forwarded by a vehicle, and performing downstream logic processing on the voice request comprises:
and sending the voice request to a central control service so that the central control service sends the voice request to a dialog system downstream service for entry processing, and the dialog system downstream service realizes the downstream logic processing.
3. The voice interaction method of claim 2, wherein the sending the voice request to a central control service to cause the central control service to send the voice request to a dialog system downstream service for participating in processing, and the downstream logic processing is implemented by the dialog system downstream service, and the method comprises:
and sending the voice request to the central control service so that the central control service sends the voice request to a downstream service of a dialogue system for natural language understanding, dialogue management and/or business robot participation, thereby realizing business logic processing corresponding to the natural language understanding, the dialogue management and the business robot by the downstream service of the dialogue system.
4. The voice interaction method of claim 1, wherein after the step of receiving a vehicle-forwarded voice request and performing downstream logic processing on the voice request, the voice interaction method comprises:
and sending the voice recognition text characteristics of the voice request to a context service through a central control service so as to store the voice recognition text characteristics as the context characteristics into a data storage service.
5. The voice interaction method according to claim 4, wherein before the step of sending an asynchronous request to perform the first semantic rejection on the voice request according to the context feature while performing the downstream logic processing on the voice request, and obtaining a first semantic rejection result, the voice interaction method comprises:
the voice request is sent to an acoustic rejection service for processing while downstream logic processing is carried out on the voice request, so that acoustic features are obtained;
rejecting the voice request according to the acoustic features to obtain an acoustic rejection result;
and sending the acoustic feature and the acoustic rejection result to the context service so as to store the acoustic feature and the acoustic rejection result as context features into the data storage service.
6. The method according to claim 5, wherein the sending an asynchronous request while performing the downstream logic processing on the voice request to perform a first semantic rejection on the voice request according to a context feature to obtain a first semantic rejection result includes:
sending the asynchronous request to a semantic rejection service for entering the reference through the central control service;
and obtaining the voice recognition text characteristics, the acoustic characteristics and the acoustic rejection result through the semantic rejection service to perform first semantic rejection on the voice request to obtain a first semantic rejection result.
7. The voice interaction method according to claim 6, wherein after the step of sending an asynchronous request to perform the first semantic rejection on the voice request according to the context feature while performing the downstream logic processing on the voice request, so as to obtain a first semantic rejection result, the voice interaction method comprises:
and storing the first semantic rejection result into the data storage service.
8. The voice interaction method according to claim 7, wherein the sending a synchronization request after receiving the downstream logic processing result to perform a second semantic rejection on the voice request according to the downstream logic processing result and the first semantic rejection result to obtain a second semantic rejection result includes:
after receiving the downstream logic processing result, sending the synchronization request to a semantic rejection service for entering the parameters through the central control service based on the downstream logic processing result;
obtaining the first semantic rejection result of the data storage service through the semantic rejection service;
and performing second semantic rejection according to the first semantic rejection result and the downstream logic processing result, and fusing to obtain a second semantic rejection result.
9. A server, characterized in that the server comprises a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, carries out the method of voice interaction according to any one of claims 1-8.
10. A non-transitory computer-readable storage medium embodying a computer program, wherein the computer program, when executed by one or more processors, implements the voice interaction method of any of claims 1-8.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211402578.0A CN115457945B (en) | 2022-11-10 | 2022-11-10 | Voice interaction method, server and storage medium |
PCT/CN2023/130564 WO2024099375A1 (en) | 2022-11-10 | 2023-11-08 | Voice interaction method, and server and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211402578.0A CN115457945B (en) | 2022-11-10 | 2022-11-10 | Voice interaction method, server and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115457945A true CN115457945A (en) | 2022-12-09 |
CN115457945B CN115457945B (en) | 2023-03-31 |
Family
ID=84295593
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211402578.0A Active CN115457945B (en) | 2022-11-10 | 2022-11-10 | Voice interaction method, server and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115457945B (en) |
WO (1) | WO2024099375A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024099375A1 (en) * | 2022-11-10 | 2024-05-16 | 广州小鹏汽车科技有限公司 | Voice interaction method, and server and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020198722A1 (en) * | 1999-12-07 | 2002-12-26 | Comverse Network Systems, Inc. | Language-oriented user interfaces for voice activated services |
CN105122354A (en) * | 2012-12-12 | 2015-12-02 | 亚马逊技术有限公司 | Speech model retrieval in distributed speech recognition systems |
CN110136713A (en) * | 2019-05-14 | 2019-08-16 | 苏州思必驰信息科技有限公司 | Dialogue method and system of the user in multi-modal interaction |
CN111583907A (en) * | 2020-04-15 | 2020-08-25 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
CN111916082A (en) * | 2020-08-14 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Voice interaction method and device, computer equipment and storage medium |
CN112101045A (en) * | 2020-11-02 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Multi-mode semantic integrity recognition method and device and electronic equipment |
CN113488047A (en) * | 2021-07-06 | 2021-10-08 | 思必驰科技股份有限公司 | Man-machine conversation interruption method, electronic device and computer readable storage medium |
CN113990300A (en) * | 2021-12-27 | 2022-01-28 | 广州小鹏汽车科技有限公司 | Voice interaction method, vehicle, server and computer-readable storage medium |
CN115146653A (en) * | 2022-07-21 | 2022-10-04 | 平安科技(深圳)有限公司 | Dialogue script construction method, device, equipment and storage medium |
CN115273841A (en) * | 2022-07-08 | 2022-11-01 | Oppo广东移动通信有限公司 | Voice rejection method, device, service equipment and storage medium |
EP4086894A1 (en) * | 2021-07-08 | 2022-11-09 | Guangzhou Xiaopeng Motors Technology Co., Ltd. | Semantic recognition rejection method, semantic recognition rejection apparatus, transportation means, and medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112000787B (en) * | 2020-08-17 | 2021-05-14 | 上海小鹏汽车科技有限公司 | Voice interaction method, server and voice interaction system |
CN112927688B (en) * | 2021-01-25 | 2022-05-10 | 思必驰科技股份有限公司 | Voice interaction method and system for vehicle |
CN115457945B (en) * | 2022-11-10 | 2023-03-31 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and storage medium |
-
2022
- 2022-11-10 CN CN202211402578.0A patent/CN115457945B/en active Active
-
2023
- 2023-11-08 WO PCT/CN2023/130564 patent/WO2024099375A1/en unknown
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020198722A1 (en) * | 1999-12-07 | 2002-12-26 | Comverse Network Systems, Inc. | Language-oriented user interfaces for voice activated services |
CN105122354A (en) * | 2012-12-12 | 2015-12-02 | 亚马逊技术有限公司 | Speech model retrieval in distributed speech recognition systems |
CN110136713A (en) * | 2019-05-14 | 2019-08-16 | 苏州思必驰信息科技有限公司 | Dialogue method and system of the user in multi-modal interaction |
CN111583907A (en) * | 2020-04-15 | 2020-08-25 | 北京小米松果电子有限公司 | Information processing method, device and storage medium |
US20210327411A1 (en) * | 2020-04-15 | 2021-10-21 | Beijing Xiaomi Pinecone Electronics Co., Ltd. | Method and device for processing information, and non-transitory storage medium |
CN111916082A (en) * | 2020-08-14 | 2020-11-10 | 腾讯科技(深圳)有限公司 | Voice interaction method and device, computer equipment and storage medium |
CN112101045A (en) * | 2020-11-02 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Multi-mode semantic integrity recognition method and device and electronic equipment |
CN113488047A (en) * | 2021-07-06 | 2021-10-08 | 思必驰科技股份有限公司 | Man-machine conversation interruption method, electronic device and computer readable storage medium |
EP4086894A1 (en) * | 2021-07-08 | 2022-11-09 | Guangzhou Xiaopeng Motors Technology Co., Ltd. | Semantic recognition rejection method, semantic recognition rejection apparatus, transportation means, and medium |
CN113990300A (en) * | 2021-12-27 | 2022-01-28 | 广州小鹏汽车科技有限公司 | Voice interaction method, vehicle, server and computer-readable storage medium |
CN115273841A (en) * | 2022-07-08 | 2022-11-01 | Oppo广东移动通信有限公司 | Voice rejection method, device, service equipment and storage medium |
CN115146653A (en) * | 2022-07-21 | 2022-10-04 | 平安科技(深圳)有限公司 | Dialogue script construction method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
DANIEL GEHRIG等: ""Combining Events and Frames Using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction"", 《IEEE EXPLORE》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024099375A1 (en) * | 2022-11-10 | 2024-05-16 | 广州小鹏汽车科技有限公司 | Voice interaction method, and server and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115457945B (en) | 2023-03-31 |
WO2024099375A1 (en) | 2024-05-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11315560B2 (en) | Method for conducting dialog between human and computer | |
CN107665706B (en) | Rapid voice interaction method and system | |
CN110675859B (en) | Multi-emotion recognition method, system, medium, and apparatus combining speech and text | |
CN112818109B (en) | Intelligent reply method, medium, device and computing equipment for mail | |
CN111161726B (en) | Intelligent voice interaction method, device, medium and system | |
CN111276148A (en) | Return visit method, system and storage medium based on convolutional neural network | |
CN115457945B (en) | Voice interaction method, server and storage medium | |
CN110019691A (en) | Conversation message treating method and apparatus | |
CN111858874B (en) | Dialogue service processing method, device, equipment and computer readable storage medium | |
CN109003600B (en) | Message processing method and device | |
CN113362815A (en) | Voice interaction method, system, electronic equipment and storage medium | |
CN113132214B (en) | Dialogue method, dialogue device, dialogue server and dialogue storage medium | |
CN116821290A (en) | Multitasking dialogue-oriented large language model training method and interaction method | |
CN113593565B (en) | Intelligent home device management and control method and system | |
CN108899035B (en) | Message processing method and device | |
CN115757749B (en) | Dialogue processing method and device, electronic equipment and storage medium | |
CN117765932A (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN117370512A (en) | Method, device, equipment and storage medium for replying to dialogue | |
CN113326373B (en) | WeChat group chat record identification method and system fusing session scene information | |
CN116304046A (en) | Dialogue data processing method and device, storage medium and electronic equipment | |
CN112908296A (en) | Dialect identification method | |
CN115658908B (en) | Five-personality perception method and system based on conversation interaction process | |
CN118014039B (en) | Model training method and device, storage medium and electronic equipment | |
CN112233699B (en) | Voice broadcasting method, intelligent voice equipment and computer readable storage medium | |
CN113851121B (en) | Front-end noise filtering method, system, equipment and medium based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |