CN113053380B

CN113053380B - Server and voice recognition method

Info

Publication number: CN113053380B
Application number: CN202110335864.9A
Authority: CN
Inventors: 胡帆; 雷将; 徐侃
Original assignee: Hisense Electronic Technology Wuhan Co ltd
Current assignee: Hisense Electronic Technology Wuhan Co ltd
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2023-12-01
Anticipated expiration: 2041-03-29
Also published as: CN113053380A

Abstract

The embodiment of the application provides a server and a voice recognition method, wherein the server is configured to: receiving voice fragment data from a display device; if the voice slicing data is not the last piece of data of the voice session, according to the fact that the server does not send the last piece of data of the voice slicing data to the voice recognition service equipment, the voice slicing data are temporarily not sent until the server has sent the last piece of data or the last piece of data is marked as an neglected state, and then the voice slicing data are sent to the voice recognition service equipment; and if the voice fragment data is the last piece of data, transmitting the voice fragment data which is not in an ignoring state before the last piece of data, or transmitting the received voice fragment data which is not transmitted and is not marked as an ignoring state to voice recognition service equipment when the waiting time exceeds a preset time threshold. The application solves the technical problem of low accuracy of voice recognition.

Description

Server and voice recognition method

Technical Field

The application relates to the technical field of voice interaction, in particular to a server and a voice recognition method.

Background

With the rapid development of artificial intelligence in the field of display devices, more and more display devices such as intelligent televisions start to support a voice control function, a user can input a voice session into the television, and the television can perform semantic recognition on the voice session through a semantic engine, so that user intention is obtained, and response is performed according to the user intention. In some scenarios, the voice session input by the user is a long voice, and if the television performs semantic recognition on the voice session through the semantic engine after the voice is received, it may take a long time to respond. In the related art, in order to improve the voice recognition speed, the television may send the user voice to the semantic engine in a fragmented manner in real time during the voice conversation input by the user to perform voice recognition, so as to improve the response speed of voice interaction. However, due to network fluctuations, etc., the order of the voice fragments received by the semantic engine may not be the order of the television transmissions, which may result in reduced accuracy of voice recognition and reduced user experience.

Disclosure of Invention

The application provides a server and a voice recognition method for solving the technical problem of low accuracy of voice interaction.

In a first aspect, the present application provides a server configured to:

receiving voice fragment data from a display device;

if the voice slicing data is not the last piece of data of the voice session, according to the fact that the server does not send the last piece of data of the voice slicing data to the voice recognition service equipment, the voice slicing data are temporarily not sent until the server has sent the last piece of data or the last piece of data is marked as an neglected state, and then the voice slicing data are sent to the voice recognition service equipment;

and if the voice fragment data is the last piece of data, transmitting the voice fragment data which is not transmitted and is not marked as the neglected state to the voice recognition service equipment when the non-neglected state voice fragment data before the last piece of data is transmitted or the waiting time exceeds a preset time threshold value.

In some embodiments, for non-received voice sliced data, if the number of received sequenced voice sliced data reaches a preset number threshold, the non-received voice sliced data is marked as an ignore state.

In some embodiments, the server is further configured to:

for the voice slicing data which is not received and is not marked as the neglected state, if the sequence of the voice slicing data is closer to the last piece of data, the preset waiting time corresponding to the voice slicing data is larger, and the preset time threshold is the maximum value in all the preset waiting times.

In some embodiments, the voice chunk data includes voice data and a chunk parameter including a chunk sequence number, the chunk sequence number used to determine an order of the voice chunk data in a voice session.

In some embodiments, the chunk sequence number comprises an array comprising two values, a first value of the array representing an order of the voice chunk data in a voice session and a second value of the array representing an order of a next chunk data of the voice chunk data in the voice session.

In a second aspect, the present application provides a speech recognition method, the method comprising:

the display device sends the voice fragment data to the server;

the server receives the voice slicing data from the display device;

if the voice fragment data is not the last piece of data of the voice session, the server temporarily does not send the voice fragment data according to the last piece of data of the voice fragment data which is not sent to the voice recognition service equipment until the server has sent the last piece of data or the last piece of data is marked as an neglected state, and then sends the voice fragment data to the voice recognition service equipment;

If the voice fragment data is the last piece of data, the server sends the received voice fragment data which is not sent and is not marked as the neglected state to the voice recognition service equipment when the non-neglected state voice fragment data before the last piece of data is sent or the waiting time exceeds a preset time threshold value;

the server and the voice recognition method provided by the application have the beneficial effects that the voice recognition service equipment carries out real-time voice recognition according to the received voice fragment data, and the server and the voice recognition method provided by the application comprise the following steps:

according to the embodiment of the application, after the voice slicing data is received, the voice slicing data is not transmitted to the voice recognition service equipment according to the fact that the last piece of voice slicing data is not transmitted until the server has transmitted the last piece of data or the last piece of data is marked as an neglected state, and then the voice slicing data is transmitted to the voice recognition service equipment, so that the time sequence of the voice slicing data received by the voice recognition service equipment is ensured, the voice recognition accuracy of the voice recognition service equipment is improved, the response accuracy of voice interaction is improved, and the voice interaction experience is improved.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

A schematic diagram of an operational scenario between a display device and a control apparatus according to some embodiments is schematically shown in fig. 1;

a hardware configuration block diagram of the control apparatus 100 according to some embodiments is exemplarily shown in fig. 2;

a hardware configuration block diagram of a display device 200 according to some embodiments is exemplarily shown in fig. 3;

a schematic diagram of the software configuration in a display device 200 according to some embodiments is exemplarily shown in fig. 4;

a schematic diagram of a speech recognition network architecture according to some embodiments is exemplarily shown in fig. 5;

a schematic diagram of a cached file according to some embodiments is shown schematically in fig. 6;

a schematic diagram of a cached file according to some embodiments is shown schematically in fig. 7;

a schematic diagram of a cached file according to some embodiments is shown schematically in fig. 8;

a schematic diagram of a cached file according to some embodiments is shown schematically in fig. 9;

A schematic diagram of a cached file according to some embodiments is shown schematically in fig. 10;

a schematic diagram of a cached file according to some embodiments is shown schematically in fig. 11;

a schematic diagram of a cached file according to some embodiments is shown schematically in fig. 12;

a schematic diagram of a cached file according to some embodiments is illustrated in fig. 13.

Detailed Description

For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.

It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display device 200 through the smart device 300 or the control apparatus 100.

In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, and the display device 200 is controlled by a wireless or wired mode. The user may control the display device 200 by inputting user instructions through keys on a remote control, voice input, control panel input, etc.

In some embodiments, a smart device 300 (e.g., mobile terminal, tablet, computer, notebook, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application running on a smart device.

In some embodiments, the display device 200 may also perform control in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received through a module configured inside the display device 200 device for acquiring voice commands, or the voice command control of the user may be received through a voice control device configured outside the display device 200 device.

In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display device 200. The server 400 may be a cluster, or may be multiple clusters, and may include one or more types of servers.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 in accordance with an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction of a user and convert the operation instruction into an instruction recognizable and responsive to the display device 200, and function as an interaction between the user and the display device 200.

Fig. 3 shows a hardware configuration block diagram of the display device 200 in accordance with an exemplary embodiment.

In some embodiments, display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, memory, a power supply, a user interface.

In some embodiments the controller includes a processor, a video processor, an audio processor, a graphics processor, RAM, ROM, a first interface for input/output to an nth interface.

In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, for receiving image signals from the controller output, for displaying video content, image content, and a menu manipulation interface, and for manipulating a UI interface by a user.

In some embodiments, the display 260 may be a liquid crystal display, an OLED display, a projection device, and a projection screen.

In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The display device 200 may establish transmission and reception of control signals and data signals with the external control device 100 or the server 400 through the communicator 220.

In some embodiments, the user interface may be configured to receive control signals from the control device 100 (e.g., an infrared remote control, etc.).

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

In some embodiments, the external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, or the like. The input/output interface may be a composite input/output interface formed by a plurality of interfaces.

In some embodiments, the modem 210 receives broadcast television signals via wired or wireless reception and demodulates audio-video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command to select a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.

In some embodiments, the object may be any one of selectable objects, such as a hyperlink, an icon, or other operable control. The operations related to the selected object are: displaying an operation of connecting to a hyperlink page, a document, an image, or the like, or executing an operation of a program corresponding to the icon.

In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, graphics processor (Graphics Processing Unit, GPU), RAM Random Access Memory, RAM), ROM (Read-Only Memory, ROM), first to nth interfaces for input/output, a communication Bus (Bus), and the like.

A CPU processor. For executing operating system and application program instructions stored in the memory, and executing various application programs, data and contents according to various interactive instructions received from the outside, so as to finally display and play various audio and video contents. The CPU processor may include a plurality of processors. Such as one main processor and one or more sub-processors.

In some embodiments, a graphics processor is used to generate various graphical objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor comprises an arithmetic unit, which is used for receiving various interactive instructions input by a user to operate and displaying various objects according to display attributes; the device also comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display.

In some embodiments, the video processor is configured to receive an external video signal, perform video processing such as decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, image composition, etc., according to a standard codec protocol of an input signal, and may obtain a signal that is displayed or played on the directly displayable device 200.

In some embodiments, the video processor includes a demultiplexing module, a video decoding module, an image synthesis module, a frame rate conversion module, a display formatting module, and the like. The demultiplexing module is used for demultiplexing the input audio and video data stream. And the video decoding module is used for processing the demultiplexed video signal, including decoding, scaling and the like. And an image synthesis module, such as an image synthesizer, for performing superposition mixing processing on the graphic generator and the video image after the scaling processing according to the GUI signal input by the user or generated by the graphic generator, so as to generate an image signal for display. And the frame rate conversion module is used for converting the frame rate of the input video. And the display formatting module is used for converting the received frame rate into a video output signal and changing the video output signal to be in accordance with a display format, such as outputting RGB data signals.

In some embodiments, the audio processor is configured to receive an external audio signal, decompress and decode the audio signal according to a standard codec protocol of an input signal, and perform noise reduction, digital-to-analog conversion, and amplification processing to obtain a sound signal that can be played in a speaker.

In some embodiments, a user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface recognizes the sound or gesture through the sensor to receive the user input command.

In some embodiments, a "user interface" is a media interface for interaction and exchange of information between an application or operating system and a user that enables conversion between an internal form of information and a form acceptable to the user. A commonly used presentation form of the user interface is a graphical user interface (Graphic User Interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in a display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.

In some embodiments, a system of display devices may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together form the basic operating system architecture that allows users to manage files, run programs, and use the system. After power-up, the kernel is started, the kernel space is activated, hardware is abstracted, hardware parameters are initialized, virtual memory, a scheduler, signal and inter-process communication (IPC) are operated and maintained. After the kernel is started, shell and user application programs are loaded again. The application program is compiled into machine code after being started to form a process.

The system of the display device may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together form the basic operating system architecture that allows users to manage files, run programs, and use the system. After power-up, the kernel is started, the kernel space is activated, hardware is abstracted, hardware parameters are initialized, virtual memory, a scheduler, signal and inter-process communication (IPC) are operated and maintained. After the kernel is started, shell and user application programs are loaded again. The application program is compiled into machine code after being started to form a process.

As shown in fig. 4, the system of the display device is divided into three layers, an application layer, a middleware layer, and a hardware layer, from top to bottom.

The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; native applications (native apps);

the application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).

Native applications (Native APPs) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

The hardware or software architecture in some embodiments may be based on the description in the foregoing embodiments, and in some embodiments may be based on other similar hardware or software architectures, so long as the technical solution of the present application may be implemented.

In order to clearly illustrate the embodiments of the present application, a voice recognition network architecture provided by the embodiments of the present application is described below with reference to fig. 5.

Referring to fig. 5, fig. 5 is a schematic diagram of a voice recognition network architecture according to an embodiment of the present application. In fig. 5, the smart device is configured to receive input information and output a processing result of the information. The voice recognition service equipment is electronic equipment deployed with voice recognition service, the semantic service equipment is electronic equipment deployed with semantic service, and the business service equipment is electronic equipment deployed with business service. The electronic device herein may include a server, a computer, etc., where a speech recognition service for recognizing audio as text, a semantic service (which may also be referred to as a semantic engine) for semantically parsing text, and a business service for providing specific services such as weather query service of ink weather, music query service of QQ music, etc., are web services that may be deployed on the electronic device. In one embodiment, there may be multiple entity service devices deployed with different service services in the architecture shown in fig. 5, and one or more entity service devices may also aggregate one or more functional services.

In some embodiments, the following describes an example of a process of processing information input to the smart device based on the architecture shown in fig. 5, taking the information input to the smart device as a query sentence input through voice as an example, the above process may include the following three processes:

[ Speech recognition ]

The intelligent device may upload the audio of the query sentence to the voice recognition service device after receiving the query sentence input through the voice, so that the voice recognition service device recognizes the audio as text through the voice recognition service and returns the text to the intelligent device. In one embodiment, the intelligent device may denoise the audio of the query statement prior to uploading the audio of the query statement to the speech recognition service device, where the denoising may include steps such as removing echoes and ambient noise.

Semantic understanding

The intelligent device uploads the text of the query sentence identified by the voice recognition service to the semantic service device, so that the semantic service device performs semantic analysis on the text through semantic service to obtain the service field, intention and the like of the text.

[ semantic response ]

And the semantic service equipment issues a query instruction to the corresponding service equipment according to the semantic analysis result of the text of the query statement so as to acquire a query result given by the service. The intelligent device may obtain the query result from the semantic service device and output. As an embodiment, the semantic service device may further send a semantic parsing result of the query statement to the smart device, so that the smart device outputs a feedback statement in the semantic parsing result.

It should be noted that the architecture shown in fig. 5 is only an example, and is not intended to limit the scope of the present application. Other architectures may be employed in embodiments of the present application to achieve similar functionality, for example: all or part of the three processes can be completed by the intelligent terminal, and are not described in detail herein.

In some embodiments, the smart device shown in fig. 5 may be a display device, such as a smart tv, and the display device may send the collected user voice to a server of the display device, and then the server of the display device sends the user voice to a voice recognition service device for voice recognition.

In some embodiments, the smart device shown in fig. 5 may also be other devices that support voice interactions, such as a smart speaker, a smart phone, and so on.

In some embodiments, a query term or other interactive term entered by a user into a display device at a single time may be referred to as a voice conversation. The scene of voice interaction may be subdivided into a question-answer scene in which a user inputs a voice session to a display device, the display device exits the voice interaction interface after responding, and a chat scene in which the user inputs a voice session to the display device, the display device does not exit the voice interaction interface after responding, but continuously collects sound, the user may input a new voice session to the display device, the display device may respond to the new voice session until reaching a termination condition, and then exits the voice interaction interface, which may be exemplified by: the user does not continue to enter a new voice session for a specified time.

In some embodiments, the voice session entered by the user may be a long period of time, and if the display device sends the voice to the server after the user speaks, it takes a long time to send the voice to the server, which results in a long response time for the voice interaction. In order to shorten the response time of voice interaction, in the process of inputting voice conversation by a user, the display device can upload voice data received in the period of time to the server at intervals, thus, one voice conversation is divided into a plurality of pieces of data, each time the server receives one piece of data, the piece of data is sent to the voice recognition service device, when the user finishes inputting the voice conversation, the display device already sends most of data of the voice conversation to the voice recognition service device through the server, and the display device can send the rest of data of the voice conversation to the voice recognition service device through the server only in a short time, so that the response time of voice interaction can be shortened.

However, due to network fluctuation and other reasons, the voice slicing data uploaded to the server by the display device may not be continuous data, and the voice slicing data sent by the display device first may arrive at the server, which may result in that the voice slicing data sent to the voice recognition service device by the server is also not continuous data, and further, the recognition accuracy of the voice session by the voice recognition service device is reduced, so that the experience of voice interaction is poor.

In order to solve the above technical problem, in some embodiments, after receiving a piece of voice slicing data, if the voice slicing data is not the first piece of data of a voice session and the last piece of data is not received, the server does not send the piece of data first, waits for a period of time, and then sends the last piece of data and the voice slicing data to the server if the last piece of data is received, and if the last piece of data is not received, the server can also send the voice slicing data to the server, so that the server receives more continuous voice slicing data as far as possible, and thus, the server sends more continuous voice slicing data to the semantic service device, thereby improving the accuracy of voice recognition and improving the voice interaction experience.

The technical scheme for improving the accuracy of voice recognition is described in detail below by taking an example that a user sends a voice conversation to a display device.

In some embodiments, the method for improving the accuracy of voice recognition may be divided into two phases, where the first phase occurs at the display device side and includes receiving and processing of a voice session, and the second phase occurs at the server side and includes processing and sending of a voice session.

Illustratively, the first stage is as follows:

in some embodiments, a voice control key may be disposed on a remote controller of the display device, and after a user presses the voice control key on the remote controller, the controller of the display device may control a display of the display device to display a voice interactive interface, and control a sound collector, such as a microphone, to collect sounds around the display device. At this point, the user may input a voice session to the display device. During the user's input of a voice session, the display device may send the voice session to the server in a fragmented form.

In some embodiments, the display device may support a voice wake-up function, and the sound collector of the display device may be in a state of continuously collecting sound. After the user speaks the wake-up word, the display device performs voice recognition on the voice conversation input by the user, after recognizing that the voice conversation is the wake-up word, the display device can be controlled to display a voice interaction interface, for the voice conversation with the content of the wake-up word, the display device can not send the voice conversation to the server, and after inputting a new voice conversation to the display device, the display device sends the new voice conversation to the server. During the user input of the new voice session, the display device may send the new voice session to the server in a fragmented form.

In some embodiments, after a user inputs a voice session, in a process that the display device obtains response data of the voice session or the display device responds according to the response data, a sound collector of the display device can keep a sound collection state, the user can press a voice control key on a remote controller at any time to input the voice session again, or speak a wake-up word, at this time, the display device can end a last voice interaction process, and start a new voice interaction process according to the voice session newly input by the user, so as to ensure real-time performance of voice interaction.

Taking the example that the user enters the voice interaction interface through the remote controller, in some embodiments, the user can speak after pressing the voice control button on the remote controller, and release the voice control button after speaking. The display device responds to the voice control key being pressed, controls the voice collector to collect voice, stores the collected voice, generates a piece of voice slicing data at intervals of 300ms, and comprises the voice data and slicing parameters, wherein the slicing parameters can comprise session identification and slicing serial numbers. The session identifier is used for distinguishing different voice sessions, different voice sessions are provided with different session identifiers, and the display device generates a new session identifier according to the signal that the voice control key is pressed. The slicing sequence number may be an array comprising two values, a first value representing the order of the voice slicing data in a voice session and a second value representing the order of the next piece of data of the voice slicing data in a voice session. For example, the slicing sequence number is 1-2, which indicates that the current voice slicing data is the first piece of data, the current voice session is also the next piece of data, and the slicing sequence number is 2-2, which indicates that the current voice slicing data is the second piece of data, and the current voice session does not have the next piece of data, i.e. the current voice slicing data is the last piece of data of the current voice session. When the user releases the voice control key, the display device can determine that the current voice session is finished according to the signal of the reset of the voice control key, and set the fragment sequence number of the last piece of voice fragment data to be the same in the front and the rear values.

In some embodiments, the fragment sequence number may also include only a value that indicates the order of the voice fragment data in the voice session. For example, the slice number is 1, which indicates that the current voice slice data is the first slice data, and the slice number is 2, which indicates that the current voice slice data is the second slice data. When the user releases the voice control key, the display device can determine that the current voice session is finished according to the signal of the reset of the voice control key, generate an end mark, write the end mark into the slicing parameters, if the end mark exists in the slicing parameters of one voice slicing data, indicate that the voice slicing data is the last piece of data of the current voice session, and if the end mark does not exist in the slicing parameters of one voice slicing data, indicate that the voice slicing data is not the last piece of data of the current voice session.

Taking the slicing sequence number as an array as an example, the display device intercepts voice data in the period of time at intervals, generates slicing parameters, generates voice slicing data according to the voice data and the slicing parameters, and sends the voice slicing data to the server so that the server carries out second-stage processing.

Illustratively, the second stage is as follows:

in some embodiments, after receiving the voice fragment data, the server may extract the session identifier from the voice fragment data, then detect whether a cache file corresponding to the session identifier already exists, and if the cache file corresponding to the session identifier does not exist, indicate that the server receives the voice fragment data of the voice session for the first time. The server may create a buffer file corresponding to the session identifier, and then set session state parameters corresponding to the session identifier, where the session state parameters include waitCounter (waiting number) and lastUpdateTime (last update time), where waitCounter indicates the number of voice fragments data that the server has received but has not yet sent to the voice recognition service device and is not marked as an ignore state, lastUpdateTime indicates the time when the server last received voice fragments data that is not marked as an ignore state, and the lastUpdateTime may be a timestamp, for example. The server may update the session state parameter once every time a piece of voice fragment data is received. The server stores voice data and session state parameters in the voice fragment data in the cache file, wherein the fragment sequence number and the state can be marked when the voice data is stored, the state can be unreceived, to be transmitted, transmitted and ignored, and the "ignore" means that the voice data is not transmitted to the voice recognition service device any more.

According to the slicing sequence number of the voice slicing data, the voice slicing data received by the server once may be the first slice data, may be the middle slice data, and may be the last slice data. Starting from the first time of receiving the voice fragment data of one voice session, the server can perform different processing according to the fragment sequence number in each received voice fragment data.

The processing of the voice-slice data by the server will be described by taking the first time the server receives the voice-slice data and the second time the server receives the voice-slice data as examples.

In some embodiments, in the case of a network being unblocked, for a voice session corresponding to a session identifier, the voice fragment data received by the server for the first time is typically the first piece of data of the voice session, and after creating a buffer file, the server may send a voice recognition request to the voice recognition service device, where the voice recognition request includes an instruction for starting the voice session with the voice recognition service device and the voice data in the first piece of data. After receiving the voice recognition request, the voice recognition service device starts a voice recognition process according to the command for starting the voice session, and carries out real-time voice recognition on voice data sent by the server through the process.

In some embodiments, for a voice session corresponding to a session identifier, due to network fluctuation, the voice fragment data received by the server for the first time may not be the first piece of data of the voice session, but may be the second piece of data or the third piece of data, etc., where after creating a buffer file, the server may temporarily not send a voice recognition request to the voice recognition service device, and may continue to wait for receiving the voice fragment data from the display device.

In some embodiments, for a voice session corresponding to a session identifier, the server receives the first piece of voice fragment data of the voice session, and the second piece of voice fragment data received by the server is the second piece of data of the voice session, and then the server has sent the first piece of data to the voice recognition service device. At this time, the server may send the voice data in the second piece of data to the voice recognition service device, and update the cache file; if the voice fragment data is not the second fragment data of the voice session but the third fragment data or the fourth fragment data, etc., in this case, the server may temporarily not transmit voice data in the second received voice fragment data to the voice recognition service apparatus and then update the buffer file.

In some embodiments, for a voice session corresponding to a session identifier, the server does not send a voice recognition request to the voice recognition service device if the voice fragment data received by the server for the first time is not the first piece of data of the voice session and the voice fragment data received by the server for the second time is the first piece of data of the voice session. At this time, if the voice fragment data received for the first time is just the second piece of data, the server may send a voice recognition request to the voice recognition service device, and update the cache file, where the voice recognition request includes an instruction for starting a voice session with the voice recognition service device, and the voice data in the first piece of data and the voice data in the second piece of data; if the first received voice fragment data is not the second fragment data, the server can send a voice recognition request to the voice recognition service equipment and update the cache file, wherein the voice recognition request comprises an instruction for starting a voice session with the voice recognition service equipment and voice data in the first fragment data, and the voice data in the second fragment data is temporarily not sent to the voice recognition service equipment.

In some embodiments, for a voice session corresponding to a session identifier, the server does not send a voice recognition request to the voice recognition service device if the voice fragment data received by the server for the first time is not the first piece of data of the voice session and the voice fragment data received by the server for the second time is not the first piece of data of the voice session. At this time, the server may update only the cache file without sending a voice recognition request to the voice recognition service device.

The processing of the voice slicing data by the server is described below by taking the nth time of the server to receive the voice slicing data as an example, wherein N is greater than or equal to 2.

In some embodiments, after the server receives the voice fragment data, determining that the current voice fragment data is middle fragment data according to the fragment sequence number in the voice fragment data, judging whether the last fragment data of the current voice fragment data is sent to the voice recognition service device, if the last fragment data is sent to the voice recognition service device, sending the voice fragment data to the voice recognition service device, updating the cache file, if the server does not send the voice data in the last fragment data of the voice fragment data to the voice recognition service device, temporarily not sending the voice data in the voice fragment data to the voice recognition service device, and updating the cache file until the server has sent the voice data in the last fragment data to the voice recognition service device, or the last fragment data accords with a preset neglect criterion, and then sending the voice data in the voice fragment data to the voice recognition service device, and updating the cache file. For example, the server has received the voice fragment data with the fragment number of 3 and 4, where the voice fragment data with the fragment number of 3 and 4 is middle fragment data, but the server does not receive the voice fragment data with the fragment number of 2, then the voice data corresponding to the voice fragment data with the fragment number of 3 and 4 is temporarily not transmitted, after a period of time, the server receives the voice fragment data with the fragment number of 2, and then the voice data corresponding to the voice fragment data with the fragment number of 2, 3 and 4 can be transmitted to the voice recognition service device together.

In some embodiments, the server may flag some voice fragment data as an ignore state: for the non-received voice slicing data, if the number of the voice slicing data which are sequenced after the server receives the voice slicing data reaches a preset number threshold, marking the non-received voice slicing data as an ignore state, and the preset number threshold can be 3. For example, the server has received the voice fragment data with the fragment numbers 6, 7 and 8, and the voice fragment data with the fragment numbers 6, 7 and 8 are all middle fragment data, but the server has not received the voice fragment data with the fragment number 5, marks the voice fragment data with the fragment number 5 as an ignore state, and does not send the voice fragment data to the voice recognition service device even if the voice fragment data with the fragment number 5 is received.

In the process that the display device sends multiple pieces of voice slicing data to the server, if the server receives the last piece of data, the server can determine whether the data before the last piece of data is received or not, the user has finished voice input, and needs to acquire a voice recognition result to the voice recognition service device as soon as possible. If the difference between the current time and lastUpdateTime is greater than 10 seconds, the request is considered to be overtime, the current cache file is deleted, an instruction for ending the voice session is sent to the voice recognition service equipment, and if the voice fragment data of the session identifier is received later, the request is not sent to the voice recognition service equipment any more. If the difference between the current time and lastUpdateTime is less than 10 seconds, the request is considered to be not overtime, the current cache file is not deleted, and the next piece of voice fragment data is continuously waited for receiving. Of course, the server may not start the timing task, or detect lastUpdateTime in real time.

The processing of the voice slicing data by the server is described below by taking the M-th received voice slicing data of the server as an example, wherein M is greater than or equal to 2.

In some embodiments, after receiving a piece of voice slicing data, the server updates the buffer file, determines that the voice slicing data is the last piece of data according to the slicing sequence number of the voice slicing data, and can determine whether the voice slicing data before the last piece of data has been sent to the voice recognition service device according to the state of each piece of voice slicing data in the buffer file, wherein the ignored voice slicing data is excluded. If the voice fragment data before the last piece of data is sent out except the ignored voice fragment data, the voice data in the last piece of data can be also sent to the voice recognition service equipment, and an instruction for ending the voice session is sent to the voice recognition service equipment. If the voice fragment data before the last fragment data has unsent fragment data besides the ignored voice fragment data, the voice fragment data is indicated to have unsent voice fragment data, at this time, the server may calculate maxWaitTime (maximum waiting time) according to a preset waiting time table, and use the maximum waiting time as a preset time threshold, and continue to receive the voice fragment data in the maximum waiting time.

For example, a latency table for voice-fragmented data may be seen in table 1.

TABLE 1

Fragment sequence number difference	1	2	3	4	5	6
							Waiting time (ms)	1200	600	300	300	300	150
Fragment sequence number difference	7	8	9	10	11	12
							Waiting time (ms)	150	150	150	0	0	0

In table 1, the segment number difference indicates the difference between the segment numbers between the unreceived voice segment data and the last segment data, and when the time indicates the segment number difference, the server waits for the time threshold for receiving the voice segment data after receiving the last segment data.

According to table 1, the difference of the slice sequence numbers is 1, the waiting time is 1200ms, which means that after the last piece of data is received, if the last but one piece of data is not received, the last but one piece of data can be waited for 1200ms, if the last but one piece of data is still not received within 1200ms, the waiting is not continued, the voice slice data which is not transmitted and is not marked as the neglected state is transmitted to the voice recognition service device, and the instruction for ending the voice session is transmitted to the voice recognition service device. If the penultimate piece of data is received within 1200ms, for example, the penultimate piece of data is received within 100ms, waiting is not continued, voice fragment data which is not transmitted and is not marked as an ignore state is transmitted to the voice recognition service device, and an instruction to end the voice session is transmitted to the voice recognition service device.

According to table 1, if the penultimate piece of data and the penultimate piece of data are not received after the last piece of data is received, it is available according to table 1 that the penultimate piece of data corresponds to a waiting time of 1200ms and the penultimate piece of data corresponds to a waiting time of 600ms, and thus the maximum waiting time is 1200ms, and the server can wait to receive the voice sliced data until the waiting time reaches 1200ms or all of the sliced data except for the ignored state has been received.

In some embodiments, since the server spends some time from the last piece of data received to the computation of maxWaitTime, and during this time the server is actually waiting, to get a more accurate server-waiting time, the server may increment CurrentWaitTime (current waiting time) by 20ms when maxWaitTime is computed, i.e. set CurrentWaitTime to start timing from 20ms, end timing when CurrentWaitTime reaches maxWaitTime, or end timing when all of the piece of fragmented data except the ignore state has been received. Wherein the above 20ms is only exemplary, other times may be set, and the server may set the CurrentWaitTime from 0.

The above embodiment shows a processing procedure of the server after receiving the voice fragment data, and in order to more intuitively describe the processing procedure, taking an example that the voice input by the user includes 7 pieces of voice fragment data, fig. 6 to 13 exemplarily describe a schematic diagram of a buffer file.

Referring to fig. 6, the voice fragment data of one voice session received by the server for the first time is a first piece of data, the server may send the first piece of data to the voice recognition service device, and create a buffer file shown in fig. 6, record lastUpdateTime, waitCounter, fragment serial number, status and voice data in the buffer file, and for the voice fragment data whose status is sent, the display device may delete it, so that the content of the voice fragment data is not displayed in the buffer file.

Referring to fig. 7, the second received voice fragment data of the voice session is third fragment data, and the server may not send the third fragment data to the voice recognition service device, and update the cache file shown in fig. 6, so as to obtain the cache file shown in fig. 7, because the server has not yet sent the second fragment data. The update process is as follows: lastUpdateTime, waitCounter, fragment number, status and voice data are recorded in the cache file.

Referring to fig. 8, the third received voice fragment data of the voice session is a fourth piece of data, and the server may temporarily not send the fourth piece of data to the voice recognition service device because the third piece of data is not yet sent, and update the cache file shown in fig. 7 to obtain the cache file shown in fig. 8. The update process is as follows: lastUpdateTime, waitCounter, fragment number, status and voice data are recorded in the cache file.

Referring to fig. 9, the fourth received voice fragment data of the voice session is a sixth piece of data, and the server may temporarily not send the sixth piece of data to the voice recognition service device because the fifth piece of data is not yet sent, and update the cache file shown in fig. 8 to obtain the cache file shown in fig. 9. The update process is as follows: lastUpdateTime, waitCounter, fragment number, status and voice data are recorded in the cache file.

After obtaining the buffer file shown in fig. 9, the server marks the second piece of data as an ignored state according to the second piece of data although the third piece of data after receiving the second piece of data is not received, and marks the second piece of data as an ignored state according to the second piece of data, and sends the third piece of data and the fourth piece of data after the second piece of data to the speech recognition service device, and because the fifth piece of data is not marked as an ignored state, the sixth piece of data is not sent temporarily, and then the buffer file is updated, thereby obtaining the buffer file shown in fig. 10.

Referring to fig. 11, the fifth received voice fragment data of the voice session is the fifth fragment data, and since the fourth fragment data is already transmitted, the server may transmit the fifth fragment data and the sixth fragment data to the voice recognition service device, and update the buffer file shown in fig. 10 to obtain the buffer file shown in fig. 11. The update process is as follows: lastUpdateTime, waitCounter, fragment number, status and voice data are recorded in the cache file.

Referring to fig. 12, the server receives the second piece of voice fragment data of the voice session for the sixth time, and the server may discard the second piece of data directly without updating the buffer file because the second piece of data is already marked as an ignore state.

Referring to fig. 13, the seventh piece of data of the voice session received by the server for the seventh time is the seventh piece of data, that is, the last piece of data, and since the sixth piece of data is already transmitted, the server may transmit the seventh piece of data to the voice recognition service device and update the buffer file shown in fig. 12 to obtain the buffer file shown in fig. 13. The update process is as follows: lastUpdateTime, waitCounter, fragment number, status and voice data are recorded in the cache file.

In some embodiments, the voice recognition service device may perform real-time recognition according to the received voice fragment data, and feed back the recognized fruit to the server, after receiving the recognition result, the server parses and responds to the recognition result, and finally generates a response result, and feeds back the response result to the display device.

As can be seen from the foregoing embodiments, in the embodiments of the present application, after voice slicing data is received, according to that the last piece of voice slicing data is not yet sent to the voice recognition service device, the voice slicing data is temporarily not sent until the server has sent the last piece of data, or the last piece of data is marked as an ignore state, and then the voice slicing data is sent to the voice recognition service device, so that the timing sequence of the voice slicing data received by the voice recognition service device is ensured, which is beneficial to improving the accuracy of voice recognition of the voice recognition service device, further is beneficial to improving the response accuracy of voice interaction, and improves the voice interaction experience.

Since the foregoing embodiments are all described in other modes by reference to the above, the same parts are provided between different embodiments, and the same and similar parts are provided between the embodiments in the present specification. And will not be described in detail herein.

It should be noted that in this specification, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a circuit structure, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such circuit structure, article, or apparatus. Without further limitation, the statement "comprises" or "comprising" a … … "does not exclude the presence of other identical elements in a circuit structure, article or apparatus that comprises the element.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure of the application herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

The above embodiments of the present application do not limit the scope of the present application.

Claims

1. A server, wherein the server is configured to:

receiving voice fragment data from a display device;

if the voice slicing data is not the last piece of data of the voice session, according to the fact that a server does not send the last piece of data of the voice slicing data to voice recognition service equipment, the voice slicing data is temporarily not sent until the server has sent the last piece of data or the last piece of data is marked as an ignore state, and then the voice slicing data is sent to the voice recognition service equipment, wherein for the voice slicing data which are not received, if the number of the voice slicing data which are sequenced after the voice slicing data are received reaches a preset number threshold value, the voice slicing data which are not received are marked as ignore states, and the ignore states indicate that the voice slicing data are not sent to the voice recognition service equipment any more;

2. The server of claim 1, wherein the server is further configured to:

and for the unreceived voice slicing data, if the number of the received and sequenced voice slicing data reaches a preset number threshold, marking the unreceived voice slicing data as an neglected state.

3. The server of claim 1, wherein the server is further configured to:

4. The server of claim 1, wherein the voice chunk data comprises voice data and a chunk parameter, the chunk parameter comprising a chunk sequence number, the chunk sequence number being used to determine an order of the voice chunk data in a voice session.

5. The server of claim 4, wherein the chunk sequence number comprises an array, the array comprising two values, a first value of the array representing an order of the voice chunk data in a voice session, and a second value of the array representing an order of a next chunk data of the voice chunk data in the voice session.

6. The server of claim 4, wherein the fragmentation sequence number indicates the order of the voice fragments in a voice session, and wherein the last fragment data fragmentation parameter further comprises an end identifier of the voice session.

7. The server of claim 1, wherein the server is further configured to:

starting from the first piece of data of the voice conversation, detecting whether the difference value between the time interval of the last received voice fragment data and the current time is larger than a preset timeout threshold, and if so, determining that the voice conversation is ended.

8. The server according to claim 7, wherein the detecting whether a difference between a time interval of last received voice-slice data and a current time is greater than a preset timeout threshold comprises: detecting whether the difference value between the time interval of the last received voice fragment data and the current time is larger than a preset timeout threshold value or not at intervals of preset periods.

9. The server of claim 1, wherein the server is further configured to:

after the voice slicing data are received, the voice slicing data are stored in a cache file corresponding to the voice session.

10. A method of speech recognition, comprising:

the display device sends the voice fragment data to the server;

the server receives the voice slicing data from the display device;

if the voice slicing data is not the last piece of data of the voice session, the server temporarily does not send the voice slicing data according to the last piece of data of the voice slicing data which is not sent to the voice recognition service equipment until the server has sent the last piece of data or the last piece of data is marked as an ignore state, and then sends the voice slicing data to the voice recognition service equipment, wherein for the voice slicing data which is not received, if the number of the voice slicing data which is sequenced after the voice slicing data is received reaches a preset number threshold value, marking the voice slicing data which is not received as an ignore state, and the ignore state indicates that the voice slicing data is not sent to the voice recognition service equipment any more;

The voice recognition service equipment performs real-time voice recognition according to the received voice fragment data.