CN113053380A - Server and voice recognition method - Google Patents

Server and voice recognition method Download PDF

Info

Publication number
CN113053380A
CN113053380A CN202110335864.9A CN202110335864A CN113053380A CN 113053380 A CN113053380 A CN 113053380A CN 202110335864 A CN202110335864 A CN 202110335864A CN 113053380 A CN113053380 A CN 113053380A
Authority
CN
China
Prior art keywords
voice
data
fragment data
server
voice fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110335864.9A
Other languages
Chinese (zh)
Other versions
CN113053380B (en
Inventor
胡帆
雷将
徐侃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Electronic Technology Wuhan Co ltd
Original Assignee
Hisense Electronic Technology Wuhan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Electronic Technology Wuhan Co ltd filed Critical Hisense Electronic Technology Wuhan Co ltd
Priority to CN202110335864.9A priority Critical patent/CN113053380B/en
Publication of CN113053380A publication Critical patent/CN113053380A/en
Application granted granted Critical
Publication of CN113053380B publication Critical patent/CN113053380B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42204User interfaces specially adapted for controlling a client device through a remote control device; Remote control devices therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application provides a server and a voice recognition method, wherein the server is configured to: receiving voice fragment data from display equipment; if the voice fragment data is not the last piece of data of the voice conversation, according to the fact that the server does not send the last piece of data of the voice fragment data to the voice recognition service equipment, the voice fragment data is not sent for the moment until the server sends the last piece of data or the last piece of data is marked to be in an ignored state, and then the voice fragment data is sent to the voice recognition service equipment; if the voice fragment data is the last piece of data, sending the received voice fragment data which is not sent and marked as the ignored state to the voice recognition service equipment when the voice fragment data in the non-ignored state before the last piece of data is sent or the waiting time exceeds a preset time threshold. The method and the device solve the technical problem of low accuracy of voice recognition.

Description

Server and voice recognition method
Technical Field
The present application relates to the field of voice interaction technologies, and in particular, to a server and a voice recognition method.
Background
With the rapid development of artificial intelligence in the field of display devices, more and more display devices such as smart televisions start to support a voice control function, a user can input a voice conversation to the television, and the television can perform semantic recognition on the voice conversation through a semantic engine, so that the user intention is obtained, and then response is performed according to the user intention. In some scenes, the voice conversation input by the user is a long voice, and if the television receives the voice, the semantic engine performs semantic recognition on the voice conversation, and finally, a long time is required for response. In the related art, in order to improve the voice recognition speed, the television can send the user voice to the semantic engine in a fragment form in real time for voice recognition during the process of inputting the voice conversation by the user, so that the response speed of voice interaction can be improved. However, due to network fluctuation and other reasons, the sequence of the voice fragment data received by the semantic engine may not be the sequence of television transmission, which will result in the accuracy of voice recognition being reduced and the user experience being reduced.
Disclosure of Invention
In order to solve the technical problem of low accuracy of voice interaction, the application provides a server and a voice recognition method.
In a first aspect, the present application provides a server configured to:
receiving voice fragment data from display equipment;
if the voice fragment data is not the last piece of data of the voice conversation, according to the fact that the server does not send the last piece of data of the voice fragment data to the voice recognition service equipment, the voice fragment data is not sent for the moment until the server sends the last piece of data or the last piece of data is marked to be in an ignored state, and then the voice fragment data is sent to the voice recognition service equipment;
if the voice fragment data is the last piece of data, when the voice fragment data in the non-ignored state before the last piece of data is all sent or the waiting time exceeds a preset time threshold, sending the received voice fragment data which is not sent and is not marked as the ignored state to the voice recognition service equipment.
In some embodiments, for voice fragment data that is not received, if the number of received sequenced voice fragment data reaches a preset number threshold, the voice fragment data that is not received is marked as an ignore state.
In some embodiments, the server is further configured to:
for voice fragment data which is not received and is not marked as an ignored state, if the sequence of the voice fragment data is closer to the last piece of data, the preset waiting time corresponding to the voice fragment data is longer, and the preset time threshold is the maximum value of all the preset waiting times.
In some embodiments, the voice fragment data includes voice data and fragment parameters, and the fragment parameters include a fragment sequence number used to determine an order of the voice fragment data in a voice session.
In some embodiments, the fragment sequence number comprises an array, the array comprising two values, a first value of the array representing an order of the voice fragment data in a voice session, and a second value of the array representing an order of a next fragment data of the voice fragment data in the voice session.
In a second aspect, the present application provides a speech recognition method, comprising:
the display equipment sends the voice fragment data to a server;
the server receives voice fragment data from the display equipment;
if the voice fragment data is not the last piece of data of the voice session, the server temporarily does not send the voice fragment data according to the last piece of data of the voice fragment data which is not sent to the voice recognition service equipment until the server sends the last piece of data or the last piece of data is marked to be in an ignored state, and then sends the voice fragment data to the voice recognition service equipment;
if the voice fragment data is the last piece of data, the server sends the received voice fragment data which is not sent and is not marked as an ignored state to the voice recognition service equipment when the voice fragment data in a non-ignored state before the last piece of data is sent or the waiting time exceeds a preset time threshold;
the server and the voice recognition method provided by the application are used for carrying out real-time voice recognition by the voice recognition service equipment according to the received voice fragment data, and have the beneficial effects that:
according to the voice fragment data sending method and device, after the voice fragment data are received, the voice fragment data are not sent to the voice recognition service device according to the fact that the previous voice fragment data are not sent until the server sends the previous voice fragment data, or the previous voice fragment data are marked to be in the neglected state and then are sent to the voice recognition service device, so that the time sequence of the voice fragment data received by the voice recognition service device is guaranteed, the accuracy of voice recognition of the voice recognition service device is improved, the response accuracy of voice interaction is improved, and voice interaction experience is improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic diagram illustrating an operational scenario between a display device and a control apparatus according to some embodiments;
a block diagram of the hardware configuration of the control device 100 according to some embodiments is illustrated in fig. 2;
a block diagram of a hardware configuration of a display device 200 according to some embodiments is illustrated in fig. 3;
a schematic diagram of a software configuration in a display device 200 according to some embodiments is illustrated in fig. 4;
FIG. 5 is a schematic diagram illustrating a voice recognition network architecture, according to some embodiments;
FIG. 6 is a schematic diagram illustrating caching files according to some embodiments;
FIG. 7 is a schematic diagram illustrating caching files, according to some embodiments;
FIG. 8 is a schematic diagram illustrating caching files according to some embodiments;
FIG. 9 is a schematic diagram illustrating caching files, according to some embodiments;
FIG. 10 is a schematic diagram illustrating caching files, according to some embodiments;
FIG. 11 is a schematic diagram illustrating caching files, according to some embodiments;
FIG. 12 is a schematic diagram illustrating caching files, according to some embodiments;
a schematic diagram of caching files according to some embodiments is illustrated in fig. 13.
Detailed Description
To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.
Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, a user may operate the display apparatus 200 through the smart device 300 or the control device 100.
In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and controls the display device 200 in a wireless or wired manner. The user may input a user instruction through a key on a remote controller, voice input, control panel input, etc., to control the display apparatus 200.
In some embodiments, the smart device 300 (e.g., mobile terminal, tablet, computer, laptop, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application program running on the smart device.
In some embodiments, the display device 200 may also be controlled in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received by a module configured inside the display device 200 to obtain a voice command, or may be received by a voice control device provided outside the display device 200.
In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.
Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction from a user and convert the operation instruction into an instruction recognizable and responsive by the display device 200, serving as an interaction intermediary between the user and the display device 200.
Fig. 3 shows a hardware configuration block diagram of the display apparatus 200 according to an exemplary embodiment.
In some embodiments, the display apparatus 200 includes at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, a user interface.
In some embodiments the controller comprises a processor, a video processor, an audio processor, a graphics processor, a RAM, a ROM, a first interface to an nth interface for input/output.
In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, a component for receiving an image signal from the controller output, performing display of video content, image content, and a menu manipulation interface, and a user manipulation UI interface.
In some embodiments, the display 260 may be a liquid crystal display, an OLED display, and a projection display, and may also be a projection device and a projection screen.
In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, and other network communication protocol chips or near field communication protocol chips, and an infrared receiver. The display apparatus 200 may establish transmission and reception of control signals and data signals with the external control apparatus 100 or the server 400 through the communicator 220.
In some embodiments, the user interface may be configured to receive control signals for controlling the apparatus 100 (e.g., an infrared remote control, etc.).
In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for collecting ambient light intensity; alternatively, the detector 230 includes an image collector, such as a camera, which may be used to collect external environment scenes, attributes of the user, or user interaction gestures, or the detector 230 includes a sound collector, such as a microphone, which is used to receive external sounds.
In some embodiments, the external device interface 240 may include, but is not limited to, the following: high Definition Multimedia Interface (HDMI), analog or data high definition component input interface (component), composite video input interface (CVBS), USB input interface (USB), RGB port, and the like. The interface may be a composite input/output interface formed by the plurality of interfaces.
In some embodiments, the tuner demodulator 210 receives broadcast television signals via wired or wireless reception, and demodulates audio/video signals, such as EPG data signals, from a plurality of wireless or wired broadcast television signals.
In some embodiments, the controller 250 and the modem 210 may be located in different separate devices, that is, the modem 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.
In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored in memory. The controller 250 controls the overall operation of the display apparatus 200. For example: in response to receiving a user command for selecting a UI object to be displayed on the display 260, the controller 250 may perform an operation related to the object selected by the user command.
In some embodiments, the object may be any one of selectable objects, such as a hyperlink, an icon, or other actionable control. The operations related to the selected object are: displaying an operation connected to a hyperlink page, document, image, or the like, or performing an operation of a program corresponding to the icon.
In some embodiments the controller comprises at least one of a Central Processing Unit (CPU), a video processor, an audio processor, a Graphics Processing Unit (GPU), a RAM Random Access Memory (RAM), a ROM (Read-Only Memory), a first to nth interface for input/output, a communication Bus (Bus), and the like.
A CPU processor. For executing operating system and application program instructions stored in the memory, and executing various application programs, data and contents according to various interactive instructions receiving external input, so as to finally display and play various audio-video contents. The CPU processor may include a plurality of processors. E.g. comprising a main processor and one or more sub-processors.
In some embodiments, a graphics processor for generating various graphics objects, such as: icons, operation menus, user input instruction display graphics, and the like. The graphic processor comprises an arithmetic unit, which performs operation by receiving various interactive instructions input by a user and displays various objects according to display attributes; the system also comprises a renderer for rendering various objects obtained based on the arithmetic unit, wherein the rendered objects are used for being displayed on a display.
In some embodiments, the video processor is configured to receive an external video signal, and perform video processing such as decompression, decoding, scaling, noise reduction, frame rate conversion, resolution conversion, and image synthesis according to a standard codec protocol of the input signal, so as to obtain a signal that can be displayed or played on the direct display device 200.
In some embodiments, the video processor includes a demultiplexing module, a video decoding module, an image synthesis module, a frame rate conversion module, a display formatting module, and the like. The demultiplexing module is used for demultiplexing the input audio and video data stream. And the video decoding module is used for processing the video signal after demultiplexing, including decoding, scaling and the like. And the image synthesis module is used for carrying out superposition mixing processing on the GUI signal input by the user or generated by the user and the video image after the zooming processing by the graphic generator so as to generate an image signal for display. And the frame rate conversion module is used for converting the frame rate of the input video. And the display formatting module is used for converting the received video output signal after the frame rate conversion, and changing the signal to be in accordance with the signal of the display format, such as an output RGB data signal.
In some embodiments, the audio processor is configured to receive an external audio signal, decompress and decode the received audio signal according to a standard codec protocol of the input signal, and perform noise reduction, digital-to-analog conversion, and amplification processing to obtain an audio signal that can be played in the speaker.
In some embodiments, a user may enter user commands on a Graphical User Interface (GUI) displayed on display 260, and the user input interface receives the user input commands through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.
In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include a visual interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc.
In some embodiments, a system of a display device may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.
The system of the display device may include a Kernel (Kernel), a command parser (shell), a file system, and an application program. The kernel, shell, and file system together make up the basic operating system structure that allows users to manage files, run programs, and use the system. After power-on, the kernel is started, kernel space is activated, hardware is abstracted, hardware parameters are initialized, and virtual memory, a scheduler, signals and interprocess communication (IPC) are operated and maintained. And after the kernel is started, loading the Shell and the user application program. The application program is compiled into machine code after being started, and a process is formed.
As shown in fig. 4, the system of the display device is divided into three layers, i.e., an application layer, a middleware layer and a hardware layer from top to bottom.
The Application layer mainly includes common applications on the television and an Application Framework (Application Framework), wherein the common applications are mainly applications developed based on the Browser, such as: HTML5 APPs; and native applications (native apps);
an Application Framework (Application Framework) is a complete program model, and has all basic functions required by standard Application software, such as: file access, data exchange, and interfaces to use these functions (toolbars, status lists, menus, dialog boxes).
Native APPs (Native APPs) may support online or offline, message push, or local resource access.
The middleware layer comprises various television protocols, multimedia protocols, system components and other middleware. The middleware can use basic service (function) provided by system software to connect each part of an application system or different applications on a network, and can achieve the purposes of resource sharing and function sharing.
The hardware layer mainly comprises an HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for butting all the television chips, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..
The hardware or software architecture in some embodiments may be based on the description in the above embodiments, and in some embodiments may be based on other hardware or software architectures that are similar to the above embodiments, and it is sufficient to implement the technical solution of the present application.
For clarity of explanation of the embodiments of the present application, a speech recognition network architecture provided by the embodiments of the present application is described below with reference to fig. 5.
Referring to fig. 5, fig. 5 is a schematic diagram of a voice recognition network architecture according to an embodiment of the present application. In fig. 5, the smart device is configured to receive input information and output a processing result of the information. The voice recognition service equipment is electronic equipment with voice recognition service deployed, the semantic service equipment is electronic equipment with semantic service deployed, and the business service equipment is electronic equipment with business service deployed. The electronic device may include a server, a computer, and the like, and the speech recognition service, the semantic service (also referred to as a semantic engine), and the business service are web services that can be deployed on the electronic device, wherein the speech recognition service is used for recognizing audio as text, the semantic service is used for semantic parsing of the text, and the business service is used for providing specific services such as a weather query service for ink weather, a music query service for QQ music, and the like. In one embodiment, there may be multiple entity service devices deployed with different business services in the architecture shown in fig. 5, and one or more function services may also be aggregated in one or more entity service devices.
In some embodiments, the following describes an example of a process for processing information input to a smart device based on the architecture shown in fig. 5, where the information input to the smart device is an example of a query statement input by voice, the process may include the following three processes:
[ Speech recognition ]
The intelligent device can upload the audio of the query sentence to the voice recognition service device after receiving the query sentence input by voice, so that the voice recognition service device can recognize the audio as a text through the voice recognition service and then return the text to the intelligent device. In one embodiment, before uploading the audio of the query statement to the speech recognition service device, the smart device may perform denoising processing on the audio of the query statement, where the denoising processing may include removing echo and environmental noise.
[ semantic understanding ]
The intelligent device uploads the text of the query sentence identified by the voice identification service to the semantic service device, and the semantic service device performs semantic analysis on the text through semantic service to obtain the service field, intention and the like of the text.
[ semantic response ]
And the semantic service equipment issues a query instruction to corresponding business service equipment according to the semantic analysis result of the text of the query statement so as to obtain the query result given by the business service. The intelligent device can obtain the query result from the semantic service device and output the query result. As an embodiment, the semantic service device may further send a semantic parsing result of the query statement to the intelligent device, so that the intelligent device outputs a feedback statement in the semantic parsing result.
It should be noted that the architecture shown in fig. 5 is only an example, and is not intended to limit the scope of the present application. In the embodiment of the present application, other architectures may also be adopted to implement similar functions, for example: all or part of the three processes can be completed by the intelligent terminal, and are not described herein.
In some embodiments, the smart device shown in fig. 5 may be a display device, such as a smart television, and the display device may first send the collected user voice to a server of the display device, and then the server of the display device sends the user voice to a voice recognition service device for voice recognition.
In some embodiments, the smart device shown in fig. 5 may also be other devices that support voice interaction, such as a smart speaker, a smart phone, and so on.
In some embodiments, a single query statement or other interactive statement input by a user to a display device may be referred to as a voice session. The voice interaction scene may be subdivided into a question and answer scene and a chat scene, in the question and answer scene, the user inputs a voice session to the display device, the display device exits the voice interaction interface after responding, and in the chat scene, the user inputs a voice session to the display device, the display device does not exit the voice interaction interface after responding, but continuously collects sound, the user may input a new voice session to the display device, the display device may respond to the new voice session until a termination condition is reached, and then exits the voice interaction interface, for example, the termination condition may be: the user does not continue to enter a new voice session for a specified time.
In some embodiments, the voice session input by the user may be a long period of voice, and if the display device sends the voice to the server after the user finishes speaking, it takes a long time to finish sending the voice, which results in a long response time for voice interaction. In order to shorten the response time of voice interaction, in the process of inputting voice conversation by a user, the display equipment can upload voice data received in the time to the server at intervals, so that one voice conversation is divided into a plurality of pieces of data, the server sends the piece of data to the voice recognition service equipment after receiving one piece of data, when the user finishes inputting the voice conversation, the display equipment already sends most of the data of the voice conversation to the voice recognition service equipment through the server, and the display equipment can send the rest of the data of the voice conversation to the voice recognition service equipment through the server only in a short time, so that the response time of voice interaction can be shortened.
However, due to network fluctuation and other reasons, the voice fragment data uploaded to the server by the display device may not be continuous data, and the voice fragment data sent by the display device first may arrive at the server, which may cause the voice fragment data sent by the server to the voice recognition service device to also not be continuous data, and further cause the recognition accuracy of the voice recognition service device on the voice conversation to be reduced, so that the experience of voice interaction is poor.
In order to solve the technical problem, in some embodiments, after receiving a piece of voice fragment data, if the voice fragment data is not the first piece of data of a voice session and does not receive a previous piece of data, the server does not send the piece of data first, and after waiting for a period of time, if the previous piece of data is received, the previous piece of data and the voice fragment data are sent to the server, and if the previous piece of data is not received all the time, the voice fragment data can also be sent to the server, so that the server receives as continuous voice fragment data as possible, and thus, the server sends more continuous voice fragment data to the semantic service device, so that the voice recognition accuracy can be improved, and the voice interaction experience can be improved.
In the following, the technical scheme for improving the accuracy of speech recognition is described in detail by taking an example that a user sends a speech session to a display device.
In some embodiments, the method for improving the accuracy of speech recognition can be divided into two stages, the first stage occurs at the display device side and includes the receiving and processing of the speech session, and the second stage occurs at the server side and includes the processing and sending of the speech session.
Illustratively, the first stage process is as follows:
in some embodiments, a voice control button may be disposed on the remote controller of the display device, and after the user presses the voice control button on the remote controller, the controller of the display device may control the display of the display device to display the voice interaction interface and control the sound collector, such as a microphone, to collect sound around the display device. At this time, the user may input a voice session to the display device. During the user input of the voice session, the display device may send the voice session to the server in a fragmented form.
In some embodiments, the display device may support a voice wake-up function, and the sound collector of the display device may be in a state of continuously collecting sound. After the user speaks the awakening word, the display device performs voice recognition on the voice conversation input by the user, after the voice conversation is recognized to be the awakening word, the display device of the display device can be controlled to display a voice interaction interface, for the voice conversation with the content of the awakening word, the display device does not send the voice conversation to the server, and after the user inputs a new voice conversation to the display device, the display device sends the new voice conversation to the server. During the time that the user enters the new voice session, the display device may send the new voice session to the server in a fragmented form.
In some embodiments, after a user inputs a voice session, in a process that the display device acquires response data of the voice session or the display device responds according to the response data, the sound collector of the display device may maintain a sound collection state, and the user may press a voice control button on the remote controller at any time to re-input the voice session or speak a wakeup word, at this time, the display device may end a previous voice interaction process, and open a new voice interaction process according to a newly input voice session of the user, thereby ensuring real-time performance of voice interaction.
Taking the example that the user enters the voice interaction interface through the remote controller, in some embodiments, the user can speak after pressing the voice control key on the remote controller, and release the voice control key after the speaking is finished. The display device responds to the fact that the voice control key is pressed, controls the sound collector to collect sound, stores the collected sound, and generates a piece of voice fragment data at intervals of a period of time, such as at intervals of 300ms, wherein the voice fragment data comprises voice data and fragment parameters, and the fragment parameters can comprise session identifiers and fragment serial numbers. The conversation mark is used for distinguishing different voice conversations, different conversation marks are arranged on different voice conversations, and the display device generates a new conversation mark according to a received signal that the voice control key is pressed. The fragment sequence number may be an array, where the array includes two values, a first value represents an order of the voice fragment data in the voice session, and a second value represents an order of a next piece of data of the voice fragment data in the voice session. For example, the fragment serial number is 1-2, which indicates that the current voice fragment data is the first piece of data, the current voice session also has the next piece of data, the fragment serial number is 2-2, which indicates that the current voice fragment data is the second piece of data, and the current voice session has no next piece of data, i.e., the current voice fragment data is the last piece of data of the current voice session. When the user releases the voice control key, the display device can determine that the current voice conversation is finished according to the reset signal of the voice control key, and the fragment serial number of the last piece of voice fragment data is set to be the same as the former and latter two numerical values.
In some embodiments, the fragment sequence number may include only one value, which indicates the order of the voice fragment data in the voice session. For example, the fragment sequence number is 1, which indicates that the current voice fragment data is the first piece of data, and the fragment sequence number is 2, which indicates that the current voice fragment data is the second piece of data. When the user releases the voice control key, the display device can determine that the current voice conversation is ended according to a signal of resetting the voice control key, generate an end identifier, write the end identifier into the fragment parameter, if the fragment parameter of one voice fragment data has the end identifier, the voice fragment data is the last piece of data of the current voice conversation, and if the fragment parameter of one voice fragment data does not have the end identifier, the voice fragment data is not the last piece of data of the current voice conversation.
Taking the fragment sequence number as an array as an example, the display device intercepts the voice data in a period of time at intervals, then generates fragment parameters, generates voice fragment data according to the voice data and the fragment parameters, and sends the voice fragment data to the server so as to enable the server to perform the second stage of processing.
Illustratively, the second stage process is as follows:
in some embodiments, after receiving the voice fragment data, the server may extract a session identifier from the voice fragment data, and then detect whether a cache file corresponding to the session identifier already exists, and if the cache file corresponding to the session identifier does not exist, it indicates that the server receives the voice fragment data of the voice session for the first time. The server may create a cache file corresponding to the session identifier, and then set a session state parameter corresponding to the session identifier, where the session state parameter includes waitCounter (waiting number) and lastUpdateTime (last update time), where waitCounter represents the number of voice fragment data that has been received by the server but has not yet been sent to the speech recognition service device and is not marked as an ignore state, and lastUpdateTime represents the last time voice fragment data that has not been marked as an ignore state was received by the server, and exemplary lastUpdateTime may be a timestamp. The server can update the session state parameter once each time receiving one piece of voice fragment data. The server stores the voice data and the session state parameters in the voice fragment data in the cache file, wherein the voice data can be marked with a fragment serial number and a state when being stored, and the state can be non-received, to-be-sent, sent and ignored, wherein the "ignore" indicates that the voice data is not sent to the voice recognition service equipment any more.
According to the fragment sequence number of the voice fragment data, the voice fragment data received by the server at a single time may be a first piece of data, may be a middle piece of data, and may be a last piece of data. From the first time of receiving voice fragment data of a voice session, the server may perform different processing according to the fragment sequence number in the voice fragment data received each time.
The following describes processing of voice fragment data by a server by taking the first time voice fragment data is received by the server and the second time voice fragment data is received by the server as an example.
In some embodiments, in the case of a clear network, for a voice session corresponding to a session identifier, the voice fragment data received by the server for the first time is generally the first piece of data of the voice session, and after creating a cache file, the server may send a voice recognition request to the voice recognition service device, where the voice recognition request includes an instruction to open a voice session with the voice recognition service device and voice data in the first piece of data. After receiving the voice recognition request, the voice recognition service equipment starts a voice recognition process according to the instruction for starting the voice conversation, and carries out real-time voice recognition on voice data sent by the server through the process.
In some embodiments, for a voice session corresponding to a session identifier due to network fluctuation and the like, the voice fragment data received by the server for the first time may not be the first piece of data of the voice session, but may be the second piece of data or the third piece of data, and so on.
In some embodiments, for a voice session corresponding to one session identifier, the first time the server receives the voice fragment data is the first piece of data of the voice session, and the second time the server receives the voice fragment data is the second piece of data of the voice session, then the server has sent the first piece of data to the voice recognition service device. At this time, the server can send the voice data in the second piece of data to the voice recognition service equipment and update the cache file; if the voice fragment data is not the second piece of data of the voice session, but the third piece of data or the fourth piece of data, etc., in this case, the server may temporarily not send the voice data in the second received voice fragment data to the voice recognition service device, and then update the cache file.
In some embodiments, for a voice session corresponding to one session identifier, the voice fragment data received by the server for the first time is not the first piece of data of the voice session, and the voice fragment data received for the second time is the first piece of data of the voice session, then the server has not sent a voice recognition request to the voice recognition service device. At this time, if the voice fragment data received for the first time is exactly the second fragment data, the server may send a voice recognition request to the voice recognition service device, and update the cache file, where the voice recognition request includes an instruction to start a voice session with the voice recognition service device, and the voice data in the first fragment data and the voice data in the second fragment data; if the voice fragment data received for the first time is not the second fragment data, the server may send a voice recognition request to the voice recognition service device and update the cache file, where the voice recognition request includes an instruction to start a voice session with the voice recognition service device and the voice data in the first fragment data, and the voice data in the second fragment data is not sent to the voice recognition service device for the moment.
In some embodiments, for a voice session corresponding to one session identifier, the voice fragment data received by the server for the first time is not the first piece of data of the voice session, and the voice fragment data received for the second time is also not the first piece of data of the voice session, so that the server has not sent a voice recognition request to the voice recognition service device. At this time, the server may update only the cache file without transmitting the voice recognition request to the voice recognition service device.
The following describes processing of voice fragment data by a server by taking the example that the server receives the voice fragment data for the nth time and the voice fragment data is a middle fragment data, where N is greater than or equal to 2.
In some embodiments, after receiving voice fragment data, the server determines that the current voice fragment data is middle fragment data according to a fragment sequence number in the voice fragment data, determines whether a previous fragment of data of the current voice fragment data has been sent to the voice recognition service device, if the previous fragment of data has been sent to the voice recognition service device, sends the voice fragment data to the voice recognition service device, and updates a cache file, if the server does not send voice data in the previous fragment of data of the voice fragment data to the voice recognition service device, temporarily does not send voice data in the voice fragment data to the voice recognition service device, and updates the cache file until the server sends voice data in the previous fragment of data to the voice recognition service device, or the previous fragment of data meets a preset ignoring criterion, and then sends voice data in the voice fragment data to the voice recognition service device, and updating the cache file. For example, the server has received the voice fragment data with the fragment sequence numbers 3 and 4, where the voice fragment data with the fragment sequence numbers 3 and 4 are both the middlebox data, but the server has not received the voice fragment data with the fragment sequence number 2, the server does not send the voice data corresponding to the voice fragment data with the fragment sequence numbers 3 and 4 for the moment, and after a period of time elapses, the server has received the voice fragment data with the fragment sequence number 2, and can send the voice data corresponding to the voice fragment data with the fragment sequence numbers 2, 3, and 4 to the voice recognition service device.
In some embodiments, the server may mark some voice fragment data as ignored: for the voice fragment data which is not received, if the number of the voice fragment data which is received by the server and is sorted after the voice fragment data reaches a preset number threshold, the voice fragment data which is not received is marked as an ignore state, and for example, the preset number threshold may be 3. For example, the server has received the voice fragment data with the fragment sequence numbers 6, 7, and 8, the voice fragment data with the fragment sequence numbers 6, 7, and 8 are all the middlebox data, but the server has not received the voice fragment data with the fragment sequence number 5, the voice fragment data with the fragment sequence number 5 is marked as an ignored state, and even if the voice fragment data with the fragment sequence number 5 is received, the voice fragment data is not sent to the voice recognition service device.
In the process that the display device sends a plurality of pieces of voice fragment data to the server, if the server receives the last piece of data, no matter whether the data before the last piece of data is received, the server can judge that the user has finished voice input, and needs to obtain the voice recognition result from the voice recognition service device as soon as possible, and in some scenes, when the display device sends the voice fragment data to the server, the last piece of data is lost in the sending process due to network fluctuation and other reasons, the server cannot receive the last piece of data, and cannot determine whether the user has finished inputting, in order to solve the problem, in some embodiments, the server can be configured to start a timing task after first receiving the voice fragment data of the session identifier, poll the cache file corresponding to the session identifier every 5s to obtain lastupdate time, it is compared to the current time. If the difference between the current time and the lastUpdateTime is greater than 10 seconds, the request is considered to be overtime, the current cache file is deleted, an instruction for ending the voice session is sent to the voice recognition service equipment, and if the voice fragment data of the session identifier is subsequently received, the request is not sent to the voice recognition service equipment. If the difference between the current time and the lastUpdateTime is less than 10 seconds, the request is considered not to be overtime, the current cache file is not deleted, and the next piece of voice fragment data continues to be received. Of course, the server may also not start a timed task, but detect lastUpdateTime in real-time.
The following describes processing of voice fragment data by a server by taking the example that the server receives the voice fragment data M times and the voice fragment data is the last piece of data, where M is greater than or equal to 2.
In some embodiments, after receiving a piece of voice fragment data, the server updates the cache file, determines that the voice fragment data is the last piece of data according to the fragment sequence number of the voice fragment data, and may determine, according to the state of each voice fragment data in the cache file, whether all voice fragment data before the last piece of data have been sent to the voice recognition service device, where the ignored voice fragment data is excluded. If the voice fragment data before the last piece of data has been sent in addition to the ignored voice fragment data, the voice data in the last piece of data can also be sent to the voice recognition service device, and an instruction for ending the voice session is sent to the voice recognition service device. If there is any unsent voice fragment data before the last piece of data, in addition to the ignored voice fragment data, it indicates that there is any unsent voice fragment data, at this time, the server may calculate maxWaitTime (maximum waiting time) according to a preset waiting time table, and use the maximum waiting time as a preset time threshold, and continue to receive voice fragment data within the maximum waiting time.
For an exemplary latency schedule of voice slice data, see table 1.
TABLE 1
Sequence number difference of fragments 1 2 3 4 5 6
Waiting time (ms) 1200 600 300 300 300 150
Sequence number difference of fragments 7 8 9 10 11 12
Waiting time (ms) 150 150 150 0 0 0
In table 1, the fragment sequence number difference indicates a difference between the fragment sequence numbers of the voice fragment data that is not received and the last piece of data, and when the time indicates the fragment sequence number difference, the server waits for a time threshold for receiving the voice fragment data after receiving the last piece of data.
According to table 1, the difference between the fragmentation sequence numbers is 1, the waiting time is 1200ms, which indicates that after receiving the last piece of data, if the penultimate piece of data is not received, the longest waiting time can be 1200ms, if the penultimate piece of data is still not received within 1200ms, the waiting is not continued, the voice fragmentation data which is not sent and is not marked as the ignore state is sent to the voice recognition service device, and an instruction for ending the voice session is sent to the voice recognition service device. If the penultimate data is received within 1200ms, for example, the penultimate data is received within 100ms, the waiting is not continued, the voice fragment data which is not sent and is not marked as the neglected state is sent to the voice recognition service equipment, and an instruction for ending the voice session is sent to the voice recognition service equipment.
According to table 1, if the penultimate data and the third last data are not received after the last data is received, according to table 1, the waiting time corresponding to the penultimate data is 1200ms, and the waiting time corresponding to the third last data is 600ms, so that the maximum waiting time is 1200ms, and the server may wait to receive the voice fragment data until the waiting time reaches 1200ms or all the fragment data in the state except for the ignore state has been received.
In some embodiments, since the server spends some time from receiving the last piece of data to calculating maxWaitTime, and during this time, the server is actually in a waiting state, in order to obtain more accurate time that the server has waited, the server may increase CurrentWaitTime (current waiting time) by 20ms when calculating maxWaitTime, that is, set CurrentWaitTime to be counted from 20ms, and terminate the counting when the CurrentWaitTime reaches maxWaitTime, or terminate the counting when all pieces of data except for the ignored state have been received. The 20ms is only exemplary, and may be set to other times, and the server may also set the CurrentWaitTime to start timing from 0.
The foregoing embodiment shows a processing procedure of the server after receiving the voice fragment data, and in order to introduce the processing procedure more intuitively, taking the example that the voice input by the user includes 7 pieces of voice fragment data, fig. 6-13 exemplarily introduce a schematic diagram of a cache file.
Referring to fig. 6, the voice fragment data of a voice session received by the server for the first time is a first piece of data, the server may send the first piece of data to the voice recognition service device, and establish a cache file as shown in fig. 6, record lastUpdateTime, waitCounter, a fragment serial number, a status, and voice data in the cache file, and for the voice fragment data whose status is sent, the display device may delete the voice fragment data, so that the content of the voice fragment data is not displayed in the cache file.
Referring to fig. 7, the voice fragment data of the voice session received by the server for the second time is the third piece of data, and the server may not send the third piece of data to the voice recognition service device for the moment because the server has not sent the second piece of data, and updates the cache file shown in fig. 6 to obtain the cache file shown in fig. 7. The update process is as follows: the lastUpdateTime, waitCounter, fragment sequence number, status and voice data are recorded in the cache file.
Referring to fig. 8, the voice fragment data of the voice session received by the server for the third time is a fourth piece of data, and the server may not send the fourth piece of data to the voice recognition service device for the moment because the server has not sent the third piece of data, and update the cache file shown in fig. 7 to obtain the cache file shown in fig. 8. The update process is as follows: the lastUpdateTime, waitCounter, fragment sequence number, status and voice data are recorded in the cache file.
Referring to fig. 9, the voice fragment data of the voice session received by the server for the fourth time is a sixth piece of data, and the server may not send the sixth piece of data to the voice recognition service device for the moment because the server has not sent the fifth piece of data, and update the cache file shown in fig. 8 to obtain the cache file shown in fig. 9. The update process is as follows: the lastUpdateTime, waitCounter, fragment sequence number, status and voice data are recorded in the cache file.
After obtaining the cache file shown in fig. 9, the server marks the second piece of data as an ignore state according to that the second piece of data is not received but three pieces of data after the second piece of data are received and exceed a preset number threshold, and then sends the third piece of data and the fourth piece of data after the second piece of data to the speech recognition service device according to that the second piece of data is marked as the ignore state, and as the fifth piece of data is not marked as the ignore state, the sixth piece of data is not sent for the moment, and then updates the cache file, thereby obtaining the cache file shown in fig. 10.
Referring to fig. 11, the voice fragment data of the voice session received by the server for the fifth time is the fifth piece of data, and since the fourth piece of data has already been sent, the server may send the fifth piece of data and the sixth piece of data to the voice recognition service device, and update the cache file shown in fig. 10 to obtain the cache file shown in fig. 11. The update process is as follows: the lastUpdateTime, waitCounter, fragment sequence number, status and voice data are recorded in the cache file.
Referring to fig. 12, the voice fragment data of the voice session received by the server for the sixth time is the second fragment data, and since the second fragment data is already marked as the ignore state, the server may directly discard the second fragment data without updating the cache file.
Referring to fig. 13, the voice fragment data of the voice session received by the server for the seventh time is the seventh piece of data, that is, the last piece of data, and since the sixth piece of data has been sent, the server may send the seven pieces of data to the voice recognition service device, and update the cache file shown in fig. 12 to obtain the cache file shown in fig. 13. The update process is as follows: the lastUpdateTime, waitCounter, fragment sequence number, status and voice data are recorded in the cache file.
In some embodiments, the speech recognition service device may perform real-time recognition according to the received speech fragment data, feed back the recognized result to the server in real time, and after receiving the recognition result, the server performs parsing and response on the recognition result, and finally generates a response result, and feeds back the response result to the display device, where the process may refer to the description in fig. 5, and details are not repeated here.
As can be seen from the above embodiments, in the embodiments of the present application, after receiving voice fragment data, according to that a previous piece of voice fragment data has not been sent to the voice recognition service device, the voice fragment data is not sent for the moment until the previous piece of data has been sent by the server, or the previous piece of data is marked as an ignored state, and then the voice fragment data is sent to the voice recognition service device, so that a time sequence of the voice fragment data received by the voice recognition service device is ensured, which is beneficial to improving accuracy of voice recognition of the voice recognition service device, and is further beneficial to improving response accuracy of voice interaction, and improving voice interaction experience.
Since the above embodiments are all described by referring to and combining with other embodiments, the same portions are provided between different embodiments, and the same and similar portions between the various embodiments in this specification may be referred to each other. And will not be described in detail herein.
It is noted that, in this specification, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a circuit structure, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such circuit structure, article, or apparatus. Without further limitation, the presence of an element identified by the phrase "comprising an … …" does not exclude the presence of other like elements in a circuit structure, article, or device comprising the element.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
The above embodiments of the present application do not limit the scope of the present application.

Claims (10)

1. A server, wherein the server is configured to:
receiving voice fragment data from display equipment;
if the voice fragment data is not the last piece of data of the voice conversation, according to the fact that the server does not send the last piece of data of the voice fragment data to the voice recognition service equipment, the voice fragment data is not sent for the moment until the server sends the last piece of data or the last piece of data is marked to be in an ignored state, and then the voice fragment data is sent to the voice recognition service equipment;
if the voice fragment data is the last piece of data, when the voice fragment data in the non-ignored state before the last piece of data is all sent or the waiting time exceeds a preset time threshold, sending the received voice fragment data which is not sent and is not marked as the ignored state to the voice recognition service equipment.
2. The server of claim 1, wherein the server is further configured to:
and for the voice fragment data which is not received, if the number of the received sequenced voice fragment data reaches a preset number threshold, marking the voice fragment data which is not received as an ignore state.
3. The server of claim 1, wherein the server is further configured to:
for voice fragment data which is not received and is not marked as an ignored state, if the sequence of the voice fragment data is closer to the last piece of data, the preset waiting time corresponding to the voice fragment data is longer, and the preset time threshold is the maximum value of all the preset waiting times.
4. The server according to claim 1, wherein the voice fragment data comprises voice data and fragment parameters, and wherein the fragment parameters comprise a fragment sequence number, and wherein the fragment sequence number is used to determine an order of the voice fragment data in a voice session.
5. The server according to claim 3, wherein the fragment sequence number comprises an array, the array comprising two values, a first value of the array representing an order of the voice fragment data in a voice session, and a second value of the array representing an order of a next fragment data of the voice fragment data in the voice session.
6. The server according to claim 3, wherein the fragmentation sequence number indicates an order of the voice fragment data in a voice session, and the fragmentation parameter of the last fragment data further includes an end identifier of the voice session.
7. The server of claim 1, wherein the server is further configured to:
and detecting whether the difference value between the time interval of the last received voice fragment data and the current time is greater than a preset overtime threshold from the moment of receiving the first piece of data of the voice conversation, and if so, determining that the voice conversation is ended.
8. The server according to claim 7, wherein the detecting whether the difference between the time interval of the last received voice fragment data and the current time is greater than a preset timeout threshold comprises: and detecting whether the difference value between the time interval of the last received voice fragment data and the current time is greater than a preset overtime threshold value every a preset period.
9. The server of claim 1, wherein the server is further configured to:
and after receiving the voice fragment data, storing the voice fragment data into a cache file corresponding to the voice conversation.
10. A speech recognition method, comprising:
the display equipment sends the voice fragment data to a server;
the server receives voice fragment data from the display equipment;
if the voice fragment data is not the last piece of data of the voice session, the server temporarily does not send the voice fragment data according to the last piece of data of the voice fragment data which is not sent to the voice recognition service equipment until the server sends the last piece of data or the last piece of data is marked to be in an ignored state, and then sends the voice fragment data to the voice recognition service equipment;
if the voice fragment data is the last piece of data, the server sends the received voice fragment data which is not sent and is not marked as an ignored state to the voice recognition service equipment when the voice fragment data in a non-ignored state before the last piece of data is sent or the waiting time exceeds a preset time threshold;
and the voice recognition service equipment performs real-time voice recognition according to the received voice fragment data.
CN202110335864.9A 2021-03-29 2021-03-29 Server and voice recognition method Active CN113053380B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110335864.9A CN113053380B (en) 2021-03-29 2021-03-29 Server and voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110335864.9A CN113053380B (en) 2021-03-29 2021-03-29 Server and voice recognition method

Publications (2)

Publication Number Publication Date
CN113053380A true CN113053380A (en) 2021-06-29
CN113053380B CN113053380B (en) 2023-12-01

Family

ID=76516139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110335864.9A Active CN113053380B (en) 2021-03-29 2021-03-29 Server and voice recognition method

Country Status (1)

Country Link
CN (1) CN113053380B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114373449A (en) * 2022-01-18 2022-04-19 海信电子科技(武汉)有限公司 Intelligent device, server and voice interaction method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060029102A1 (en) * 2004-08-03 2006-02-09 Fujitsu Limited Processing method of fragmented packet
CN101510815A (en) * 2008-12-31 2009-08-19 成都市华为赛门铁克科技有限公司 Method, apparatus and system for processing slicing message
CN101510886A (en) * 2009-03-09 2009-08-19 华为技术有限公司 Method and apparatus for processing message of MP group
CN102868635A (en) * 2012-08-24 2013-01-09 汉柏科技有限公司 Multi-core and multi-thread method and system for preserving order of messages
CN104616652A (en) * 2015-01-13 2015-05-13 小米科技有限责任公司 Voice transmission method and device
WO2016129188A1 (en) * 2015-02-10 2016-08-18 Necソリューションイノベータ株式会社 Speech recognition processing device, speech recognition processing method, and program
CN107404446A (en) * 2016-05-19 2017-11-28 中兴通讯股份有限公司 A kind of method and device for handling fragment message
JP2018049058A (en) * 2016-09-20 2018-03-29 株式会社東芝 Speech processing system, speech recognition server and relay processing device group applied to speech processing system, speech processing method applied to relay processing device group, speech conversion method applied to speech recognition server, and program
US20180096695A1 (en) * 2016-10-01 2018-04-05 Intel Corporation Technologies for privately processing voice data
CN108683635A (en) * 2018-04-12 2018-10-19 国家计算机网络与信息安全管理中心 A kind of system and method for realizing the homologous chummage of IP fragmentation packet based on network processes chip
US20190253477A1 (en) * 2018-03-30 2019-08-15 Intel Corporation Data Fragment Recombination for Internet of Things Devices
CN110971352A (en) * 2018-09-30 2020-04-07 大唐移动通信设备有限公司 HARQ retransmission processing method and device for uplink enhanced RLC (radio link control) fragments

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060029102A1 (en) * 2004-08-03 2006-02-09 Fujitsu Limited Processing method of fragmented packet
CN101510815A (en) * 2008-12-31 2009-08-19 成都市华为赛门铁克科技有限公司 Method, apparatus and system for processing slicing message
CN101510886A (en) * 2009-03-09 2009-08-19 华为技术有限公司 Method and apparatus for processing message of MP group
CN102868635A (en) * 2012-08-24 2013-01-09 汉柏科技有限公司 Multi-core and multi-thread method and system for preserving order of messages
CN104616652A (en) * 2015-01-13 2015-05-13 小米科技有限责任公司 Voice transmission method and device
WO2016129188A1 (en) * 2015-02-10 2016-08-18 Necソリューションイノベータ株式会社 Speech recognition processing device, speech recognition processing method, and program
CN107404446A (en) * 2016-05-19 2017-11-28 中兴通讯股份有限公司 A kind of method and device for handling fragment message
JP2018049058A (en) * 2016-09-20 2018-03-29 株式会社東芝 Speech processing system, speech recognition server and relay processing device group applied to speech processing system, speech processing method applied to relay processing device group, speech conversion method applied to speech recognition server, and program
US20180096695A1 (en) * 2016-10-01 2018-04-05 Intel Corporation Technologies for privately processing voice data
US20190253477A1 (en) * 2018-03-30 2019-08-15 Intel Corporation Data Fragment Recombination for Internet of Things Devices
CN110324303A (en) * 2018-03-30 2019-10-11 英特尔公司 The data slot of internet of things equipment recombinates
CN108683635A (en) * 2018-04-12 2018-10-19 国家计算机网络与信息安全管理中心 A kind of system and method for realizing the homologous chummage of IP fragmentation packet based on network processes chip
CN110971352A (en) * 2018-09-30 2020-04-07 大唐移动通信设备有限公司 HARQ retransmission processing method and device for uplink enhanced RLC (radio link control) fragments

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114373449A (en) * 2022-01-18 2022-04-19 海信电子科技(武汉)有限公司 Intelligent device, server and voice interaction method

Also Published As

Publication number Publication date
CN113053380B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
WO2020244266A1 (en) Remote control method for smart television, mobile terminal, and smart television
CN112163086B (en) Multi-intention recognition method and display device
CN112511882B (en) Display device and voice call-out method
CN113630649B (en) Display equipment and video playing progress adjusting method
CN112492371A (en) Display device
CN112601117B (en) Display device and content presentation method
CN112153440B (en) Display equipment and display system
CN112653906A (en) Video hotspot playing method on display device and display device
CN113053380B (en) Server and voice recognition method
CN113066491A (en) Display device and voice interaction method
CN112905149A (en) Processing method of voice instruction on display device, display device and server
CN113111214A (en) Display method and display equipment for playing records
CN115701105A (en) Display device, server and voice interaction method
CN113573149B (en) Channel searching method and display device
CN113593559B (en) Content display method, display equipment and server
CN113079400A (en) Display device, server and voice interaction method
CN115291829A (en) Display device and subscription message reminding method
CN114900386A (en) Terminal equipment and data relay method
CN113079401A (en) Display device and echo cancellation method
CN111914565A (en) Electronic equipment and user statement processing method
CN113658598A (en) Voice interaction method of display equipment and display equipment
CN113490060A (en) Display device and method for determining common contact
CN112883144A (en) Information interaction method
WO2022160910A1 (en) Display device and volume display method
CN113852848B (en) Virtual remote controller control method, display device and terminal device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant