CN110379413A - A kind of method of speech processing, device, equipment and storage medium - Google Patents

A kind of method of speech processing, device, equipment and storage medium Download PDF

Info

Publication number
CN110379413A
CN110379413A CN201910580572.4A CN201910580572A CN110379413A CN 110379413 A CN110379413 A CN 110379413A CN 201910580572 A CN201910580572 A CN 201910580572A CN 110379413 A CN110379413 A CN 110379413A
Authority
CN
China
Prior art keywords
voice
speech
semantic
segments
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910580572.4A
Other languages
Chinese (zh)
Other versions
CN110379413B (en
Inventor
赵泽清
汪俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201910580572.4A priority Critical patent/CN110379413B/en
Publication of CN110379413A publication Critical patent/CN110379413A/en
Application granted granted Critical
Publication of CN110379413B publication Critical patent/CN110379413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/52User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail for supporting social networking services

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the present application discloses a kind of method of speech processing, which comprises voice is divided at least two sound bites;Determine the corresponding semantic segment of the sound bite;It is corresponding to show the sound bite and the semantic segment.The embodiment of the present application also discloses a kind of voice processing apparatus, equipment and storage medium.

Description

Voice processing method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, and relates to but is not limited to a voice processing method, a voice processing device, voice processing equipment and a storage medium.
Background
The user sometimes receives a long voice from the other party in the chat software, and if the user wants to listen to a certain voice again, the user needs to listen from the beginning, which is troublesome. In the related art, the current solution is to convert the voice into text, and when a finger touches the position of the corresponding text, the voice is played from the corresponding position; however, when the mobile phone terminal is operated by using the solution, the characters displayed on the mobile phone terminal are small, and the mistaken touch is easy to occur.
Disclosure of Invention
The embodiment of the application provides a voice processing method, a voice processing device, voice processing equipment and a storage medium.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a speech processing method, where the method includes:
segmenting speech into at least two speech segments;
determining semantic fragments corresponding to the voice fragments;
and correspondingly displaying the voice fragment and the semantic fragment.
In a second aspect, an embodiment of the present application provides a speech processing apparatus, including: the device comprises a segmentation module, a determination module and a display module; wherein,
the segmentation module is used for segmenting the voice into at least two voice segments;
the determining module is used for determining semantic fragments corresponding to the voice fragments;
the display module is used for correspondingly displaying the voice fragments and the semantic fragments.
In a third aspect, an embodiment of the present application further provides a speech processing apparatus, including: a processor and a memory for storing a computer program capable of running on the processor; wherein the processor is configured to execute the steps of the speech processing method according to any one of the above schemes when the computer program is executed.
In a fourth aspect, the present application further provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the speech processing method according to any one of the foregoing schemes.
In the embodiment of the application, the voice is divided into at least two voice segments; determining semantic fragments corresponding to the voice fragments; correspondingly displaying the voice fragments and the semantic fragments; therefore, the user can directly select the voice fragment to be listened again according to the displayed semantic fragment, the method is more convenient and faster, and the user experience is improved.
Drawings
In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. Like reference numerals having different letter suffixes may represent different examples of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed herein.
Fig. 1A is a first schematic flow chart illustrating an implementation of a speech processing method according to an embodiment of the present application;
fig. 1B is a first schematic diagram illustrating an effect of a speech processing method according to an embodiment of the present application;
fig. 2 is a schematic flow chart illustrating an implementation of a speech processing method according to an embodiment of the present application;
fig. 3 is a schematic flow chart illustrating an implementation of a speech processing method according to an embodiment of the present application;
fig. 4 is a schematic flow chart illustrating an implementation of a speech processing method according to an embodiment of the present application;
fig. 5 is a schematic flow chart illustrating an implementation of the speech processing method according to the embodiment of the present application;
fig. 6 is a schematic flow chart illustrating a sixth implementation of the speech processing method according to the embodiment of the present application;
fig. 7 is a schematic diagram illustrating an effect of a speech processing method according to an embodiment of the present application;
fig. 8 is a third schematic diagram illustrating an effect of the speech processing method according to the embodiment of the present application;
fig. 9 is a seventh schematic flow chart illustrating an implementation of the speech processing method according to the embodiment of the present application;
fig. 10 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic hardware structure diagram of a speech processing device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, specific technical solutions of the present application will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.
In describing the embodiments of the present application in detail, the cross-sectional views illustrating the structure of the device are not enlarged partially in a general scale for convenience of illustration, and the schematic drawings are only examples, which should not limit the scope of the present application. In addition, the three-dimensional dimensions of length, width and depth should be included in the actual fabrication.
The voice processing method provided by the embodiment of the application can be applied to a voice processing device, and the voice processing device can be implemented on voice processing equipment. The voice processing device divides the received voice into at least two voice segments; determining semantic fragments corresponding to at least two voice fragments; and correspondingly displaying at least two voice fragments and the semantic fragments, wherein the user can select the voice fragments needing to be listened again according to the displayed voice fragments and the semantic fragments corresponding to the voice fragments.
The embodiment of the present application provides a voice processing method, which is applied to a voice processing device for implementing the voice processing method, and each functional module in the voice processing device can be cooperatively implemented by hardware resources of the voice processing device (such as a terminal device and a server), such as computing resources like a processor, detection resources like a sensor, and communication resources.
The speech processing device may be any electronic device with information processing capabilities, and in one embodiment, the electronic device may be an intelligent terminal, such as a mobile terminal with wireless communication capabilities, e.g., a notebook, an AR/VR device. In another embodiment, the electronic device may also be a computing-capable terminal device that is not mobile, such as a desktop computer, a server, etc.
Of course, the embodiments of the present application are not limited to being provided as methods and hardware, and may be provided as a storage medium (storing instructions for executing a voice processing method provided by the embodiments of the present application) in various implementations.
Fig. 1A is a first schematic flow chart illustrating an implementation process of a speech processing method in an embodiment of the present application, as shown in fig. 1A, the method includes the following steps:
step 101: segmenting speech into at least two speech segments;
here, the speech processing apparatus receives a piece of speech, and divides the received speech into at least two speech segments. The voice received by the voice processing device may be sent by another device or sent by the server.
After the voice processing equipment receives a section of voice, segmenting the received voice according to a specified rule; here, the specified rule may be a division into equal parts according to the duration of the speech, or a division into parts according to the interval in the speech.
If the specified rule is: and averagely segmenting according to the duration of the voice, determining the duration of the voice after receiving a section of voice, averagely segmenting the duration to obtain average duration, and averagely segmenting the received voice according to the average duration. Such as: a voice is received for a period of 18 seconds and is divided into voice segments of 9 seconds on average.
If the specified rule is: the speech is segmented according to intervals in the speech, the intervals in the speech are determined after a section of speech is received, and the received speech is segmented according to the determined intervals. Such as: a segment of speech is received as "A1 … A2 … A3", and the speech is segmented into speech segments "A1", "A2", and "A3" according to a determined interval "…" in the speech.
When segmenting speech, the speech is segmented into at least two speech segments. Such as: dividing the voice A into two voice fragments A1 and A2; for another example: the speech a is divided into five speech segments a1, a2, A3, a4, a 5.
Here, the duration of the speech may be determined before the speech is segmented, and when the duration of the speech is longer, the speech is segmented.
Step 102: determining semantic fragments corresponding to the voice fragments;
here, when determining the semantic segment corresponding to the speech segment, the semantic segment corresponding to the speech segment may be determined according to a semantic recognition model. The semantic recognition model needs to be obtained by training according to various voice samples and semantic samples.
Here, when the voice is divided into at least two voice segments, the semantic segment corresponding to the voice segment is determined. Such as: the segmented voice fragments are 'A1', 'A2' and 'A3', the semantic fragment corresponding to the voice fragment 'A1' is determined to be 'this question', the semantic fragment corresponding to the voice fragment 'A2' is 'too difficult', and the semantic fragment corresponding to the voice fragment 'A3' is 'I can not do'; for another example: the segmented voice fragments are 'A1', 'A2', 'A3' and 'A4', the semantic fragment corresponding to the voice fragment 'A1' is determined to be 'the travel', the semantic fragment corresponding to the voice fragment 'A2' is 'very good experience', the semantic fragment corresponding to the voice fragment 'A3' is 'too good', and the semantic fragment corresponding to the voice fragment 'A4' is 'really good expectation'.
Here, when determining the semantic segments corresponding to the voice segments, the semantic segments corresponding to at least two voice segments may be determined for at least two voice segments obtained by segmentation, respectively; or determining the corresponding semantic meaning of the voice, and then segmenting a semantic meaning into a plurality of semantic fragments.
Step 103: and correspondingly displaying the voice fragment and the semantic fragment.
Here, after obtaining at least two voice fragments and semantic fragments corresponding to the voice fragments, the voice recognition apparatus displays the at least two voice fragments and semantic fragments corresponding to the voice fragments.
The displayed voice fragments and the semantic fragments corresponding to the voice fragments can be sequentially arranged on the display screen of the voice recognition device according to the sequence of receiving the voice.
The speech segment and the semantic segment corresponding to the speech segment may be displayed on a display screen of the speech recognition device. For example, as shown in FIG. 1B: the method comprises the steps of respectively displaying a semantic segment 11 corresponding to a voice segment A1 and a voice segment A1 as 'travel this time', a semantic segment 12 corresponding to a voice segment A2 and a voice segment A2 as 'experience which is always excellent', a semantic segment 13 corresponding to a voice segment A3 and a voice segment A3 as 'too excellent', and a semantic segment 14 corresponding to a voice segment A4 and a voice segment A4 as 'expectation really'.
The voice processing method provided by the embodiment of the application divides the voice into at least two voice segments; determining semantic fragments corresponding to the voice fragments; correspondingly displaying the voice fragments and the semantic fragments; therefore, the user can directly select the voice fragment to be listened again according to the displayed semantic fragment, the method is more convenient and faster, and the user experience is improved.
An embodiment of the present application provides a speech processing method, as shown in fig. 2, the method includes the following steps:
step 201: determining the duration corresponding to the voice;
here, after the speech processing device receives a piece of speech, the duration corresponding to the received speech is determined, for example: the voice processing equipment receives a section of voice A, and determines that the corresponding time length of the received voice A is 30 seconds according to the starting time and the ending time of the received voice.
Step 202: under the condition that the duration is greater than a preset specified duration, dividing the voice into at least two voice segments;
here, a specified duration is preset, and when the duration corresponding to the received section of speech is greater than the preset specified duration, the received section of speech may be considered as a long speech, and the long speech is divided into at least two speech segments.
Such as: the preset specified time length is 20 seconds, the time length corresponding to the received voice is 30 seconds, the time length corresponding to the received voice is longer than the preset specified time length, and the voice with the length of 30 seconds is divided into at least two voice segments.
When the time length corresponding to the received voice is less than the preset specified time length, the received voice can be regarded as short voice, and the received voice does not need to be segmented.
It should be noted that the preset specified time length may be set according to an actual situation, and the embodiment of the present application does not limit this.
Step 203: determining semantic fragments corresponding to the voice fragments;
step 204: and correspondingly displaying the voice fragment and the semantic fragment.
In step 203 to step 204, refer to step 102 to step 103 in the above embodiments, respectively.
The voice processing method provided by the embodiment of the application determines the duration corresponding to the voice; under the condition that the duration is greater than a preset specified duration, dividing the voice into at least two voice segments; determining semantic fragments corresponding to the voice fragments; correspondingly displaying the voice fragments and the semantic fragments; therefore, when the voice is determined to be the long voice, the voice can be segmented to obtain at least two voice segments, and user experience is improved.
An embodiment of the present application provides a speech processing method, as shown in fig. 3, the method includes the following steps:
step 301: displaying an operation interface based on the received first operation aiming at the voice;
here, after the voice processing device receives a piece of voice, the received voice is displayed on a display screen of the voice processing device, a user operates the voice on the display screen, and the voice processing device displays an operation interface based on the received first operation aiming at the voice; the operation interface includes: and (5) a voice segmentation key.
Such as: the voice processing equipment receives a voice with the duration of 30 seconds, displays the voice on a display screen, and carries out right click operation on the voice by a user; the operation interface includes: and (5) a voice segmentation key.
It should be noted that the first operation may be a touch operation such as clicking and touching, which is not limited in the embodiment of the present application.
Step 302: generating a voice segmentation instruction based on the received second operation aiming at the voice segmentation key;
here, after the operation interface is displayed on the display screen of the voice processing apparatus, the user operates the voice division key on the displayed operation interface, and the voice processing apparatus generates the voice division instruction based on the received second operation on the voice division key.
Such as: and clicking the voice segmentation key on the display operation interface by the user, and generating a voice segmentation instruction by the voice processing equipment after receiving the clicking operation aiming at the voice segmentation key.
It should be noted that the second operation may be a touch operation such as clicking and touching, which is not limited in the embodiment of the present application.
Step 303: based on the voice segmentation instruction, segmenting the voice into at least two voice segments;
here, after a voice division instruction is generated based on the voice division key, the voice is divided into at least two voice segments based on the voice division instruction.
The voice segmentation instruction can also carry a voice segmentation rule, and the voice is segmented according to the voice segmentation instruction and the voice segmentation rule.
Step 304: determining semantic fragments corresponding to the voice fragments;
step 305: and correspondingly displaying the voice fragment and the semantic fragment.
In step 304 to step 305, refer to step 102 to step 103 in the above embodiments, respectively.
According to the voice processing method provided by the embodiment of the application, an operation interface is displayed based on the received first operation aiming at the voice; the operation interface includes: a voice segmentation key; generating a voice segmentation instruction based on the received second operation aiming at the voice segmentation key; based on the voice segmentation instruction, segmenting the voice into at least two voice segments; determining semantic fragments corresponding to the voice fragments; correspondingly displaying the voice fragments and the semantic fragments; therefore, the voice segmentation instruction can be generated based on the received second operation aiming at the voice segmentation key, the voice can be segmented, and the user experience is improved.
An embodiment of the present application provides a speech processing method, as shown in fig. 4, the method includes the following steps:
step 401: determining a first segmentation boundary of the voice according to an interval in the voice;
here, when the voice is divided into at least two voice segments, a first division boundary of the voice is determined according to an interval in the voice. Wherein the interval in speech may be a pause in speech.
Such as: the received speech is "A1 … A2 … A3", and the pause in speech is "…", determined as the first segmentation boundary of the speech.
Step 402: according to the first segmentation boundary, segmenting the voice to obtain at least two voice segments;
here, the speech is segmented according to the determined first segmentation boundary to obtain at least two speech segments. Such as: the received voice is 'A1 … A2 … A3', and the voice is segmented according to a pause '…' in the voice to obtain three voice fragments: the voice segment "a 1", the voice segment "a 2", and the voice segment "A3".
Step 403: determining a phoneme corresponding to the voice fragment;
here, when determining the semantic segments corresponding to the speech segments, after obtaining at least two speech segments, the semantic segments corresponding to the at least two speech segments are determined, respectively.
When determining semantic fragments corresponding to at least two voice fragments, determining phonemes corresponding to the voice fragments. Such as: at least two voice fragments are 'A1' and 'A2', the phoneme corresponding to the voice fragment 'A1' is determined to be 'zhedaoti', and the phoneme corresponding to the voice fragment 'A2' is determined to be 'tainale'.
Here, the phoneme corresponding to the speech fragment may be determined according to a semantic recognition model in which a phoneme corresponding to the pronunciation of each character in the speech fragment is stored in advance.
Step 404: matching the phoneme with a set phoneme;
here, the phoneme corresponding to the determined speech fragment is matched with the set phoneme. Such as: determining the phoneme corresponding to the speech fragment "A1" as "zhedaoti", and matching the phoneme with the phonemes corresponding to all the characters.
Here, the set phonemes may be phonemes corresponding to all characters, which are stored in the speech processing apparatus in advance.
Step 405: if the phoneme is matched with a set phoneme, determining semantic information corresponding to the set phoneme as a semantic fragment corresponding to the voice fragment;
here, in the process of matching the phoneme with the set phoneme, if the phoneme corresponding to the voice fragment matches with the set phoneme, the semantic information corresponding to the set phoneme is determined as the semantic fragment corresponding to the voice fragment.
Such as: the phoneme corresponding to the speech fragment "a 1" is "zhedaoti", and is matched with the set phonemes "zhe", "dao", and ti ", and the semantic information corresponding to the set phonemes" zhe "," dao ", and ti" is "this", "track", and "topic", respectively, and the semantic information "this", "track", and "topic" is determined as the semantic fragment "this topic" corresponding to the speech fragment "a 1".
Step 406: and correspondingly displaying the voice fragment and the semantic fragment.
Wherein step 406 refers to step 103 in the above embodiment.
According to the voice processing method provided by the embodiment of the application, a first segmentation boundary of the voice is determined according to an interval in the voice; according to the first segmentation boundary, segmenting the voice to obtain at least two voice segments; determining a phoneme corresponding to the voice fragment; matching the phoneme with a set phoneme; if the phoneme is matched with a set phoneme, determining semantic information corresponding to the set phoneme as a semantic fragment corresponding to the voice fragment; correspondingly displaying the voice fragments and the semantic fragments; therefore, the voice is segmented according to the segmentation boundary, the semantic segments corresponding to the voice segments are determined according to the phonemes, and user experience is improved.
An embodiment of the present application provides a speech processing method, as shown in fig. 5, the method includes the following steps:
step 501: segmenting speech into at least two speech segments;
wherein, step 501 refers to step 101 in the above embodiment.
Step 502: determining semantic information corresponding to the voice;
here, when determining the semantic information, the received speech may be taken as a whole to determine the semantic information corresponding to the speech.
When semantic information corresponding to a section of voice is determined, determining a phoneme corresponding to the voice, matching the phoneme corresponding to the determined voice with a set phoneme, and if the phoneme corresponding to the voice is matched with the set phoneme, determining the semantic information corresponding to the set phoneme as the semantic information corresponding to the voice.
Such as: one piece of speech is "a 1 … a 2", a phoneme corresponding to the speech "a 1 … a 2" is determined as "zhendaoti … tainane", the phoneme is matched with phonemes corresponding to all characters, a phoneme corresponding to the speech "a 1 … a 2" is "zhendaoti … tainane", semantic information corresponding to the set phonemes "zhe", "dao", "ti", "tai", "nan", and "le" is matched with the set phonemes, semantic information corresponding to the set phonemes is "this", "track", "title", "too difficult", and the semantic information "this", "track", "title", "too difficult", is determined as semantic information "this" corresponding to the speech fragment "a 1 … a 2", which is too difficult.
Here, the phoneme corresponding to the speech fragment may be determined according to a semantic recognition model in which a phoneme corresponding to the pronunciation of each character in the speech fragment is stored in advance. The set phonemes can be phonemes corresponding to all characters and are stored in the voice processing equipment in advance.
Step 503: adding marks to the semantic information according to intervals in the voice;
here, after the semantic information is determined, a mark is added to the semantic information according to an interval in the speech; the interval in the speech may be a pause in the speech, and the mark added to the semantic information may be a punctuation mark added to the semantic information.
Such as: the received voice is 'A1 … A2', the corresponding semantic information is 'the question is too difficult', and punctuation is added to the semantic information 'the question is too difficult' according to the pause '…' in the voice to obtain 'the question is too difficult'.
Step 504: determining a second segmentation boundary of the semantic information according to the mark;
here, a label added to the semantic information is used as a second division boundary of the semantic information. Such as: the semantic information is 'this question is too difficult', the punctuation mark in the semantic information 'is determined as a second segmentation boundary of the semantic information'.
Step 505: according to the second segmentation boundary, segmenting the semantic information to obtain at least two semantic segments;
here, the semantic information is segmented according to the determined second segmentation boundary to obtain at least two semantic segments. Such as: the received semantic information is ' the question is too difficult ', the semantic information is segmented according to punctuation marks ' in the semantic information to obtain two semantic segments: semantic segment "this question", semantic segment "too difficult".
Step 506: and correspondingly displaying the voice fragment and the semantic fragment.
Wherein, step 506 refers to step 103 in the above embodiment.
The voice processing method provided by the embodiment of the application divides the voice into at least two voice segments; determining semantic information corresponding to the voice; adding marks to the semantic information according to intervals in the voice; determining a second segmentation boundary of the semantic information according to the mark; according to the second segmentation boundary, segmenting the semantic information to obtain at least two semantic segments; correspondingly displaying the voice fragments and the semantic fragments; therefore, according to the second segmentation boundary, the semantic information is segmented to obtain at least two semantic segments, and the user experience is improved.
An embodiment of the present application provides a speech processing method, as shown in fig. 6, the method includes the following steps:
step 601: segmenting speech into at least two speech segments;
step 602: determining semantic fragments corresponding to the voice fragments;
step 603: correspondingly displaying the voice fragments and the semantic fragments;
in step 601 to step 603, refer to step 101 to step 103 in the above embodiment, respectively.
Step 604: receiving a third operation for the voice segment;
here, after the voice segment and the semantic segment corresponding to the voice segment are displayed, a third operation for the voice segment is received. Such as: the method comprises the steps of respectively displaying a semantic segment corresponding to a voice segment A1 and a voice segment A1 as 'travel this time', a semantic segment corresponding to a voice segment A2 and a voice segment A2 as 'experience which is always very excellent', a semantic segment corresponding to a voice segment A3 and a voice segment A3 as 'too excellent', a semantic segment corresponding to a voice segment A4 and a semantic segment A4 as 'expectation really', clicking the voice segment A1 by a user, and receiving click operation aiming at the voice segment A1 by the voice recognition equipment.
It should be noted that the third operation may be a touch operation such as clicking and touching, which is not limited in the embodiment of the present application.
Step 605: and playing the voice clip based on the third operation.
Here, the voice segment is played based on the voice recognition apparatus receiving the third operation. Such as: the user clicks the voice segment "a 1", and the voice recognition device receives the clicking operation for the voice segment "a 1" and plays the voice segment "a 1".
The voice processing method provided by the embodiment of the application divides the voice into at least two voice segments; determining semantic fragments corresponding to the voice fragments; correspondingly displaying the voice fragments and the semantic fragments; receiving a third operation for the voice segment; based on the third operation, the voice segments are played, so that a user can directly select the voice segments to be listened again according to the displayed semantic segments, the operation is more convenient and faster, and the user experience is improved.
The speech processing method provided by the embodiment of the present application is described in the embodiment of the present application through a specific scenario.
In the embodiment of the application, after the user receives the long voice sent by the opposite side, the user selects the 'long voice segmentation' key, and segments the long voice and the recognition text thereof into segments according to the pause in the voice, and displays the segments to the user.
In one example, as shown in fig. 7, when the voice processing apparatus receives a long voice 71, a function selection box 72 is displayed on the display screen, and the function selection box 72 is displayed including: using the earpiece mode key 73, the favorite key 74, the long voice split key 75 is selected based on the user's operation on the function selection box 72. The voice processing device executes the segmentation function corresponding to the long voice segmentation key 75, and segments the received long voice 71 to obtain three voice segments and corresponding characters corresponding to the long voice 71, and correspondingly displays the three voice segments and the corresponding characters, as shown in fig. 8: voice segment a1 and corresponding text 74 "this travel is a very good experience", voice segment a2 and corresponding text 75 "tai bana", voice segment A3 and corresponding text 76 "get ready for the east and west bar".
An implementation flow diagram of the speech processing method in the embodiment of the present application is shown in fig. 9:
step 901: long speech is recognized as text and punctuated.
Step 902: and dividing the long voice and the text into voice segments and text segments corresponding to the voice segments according to the punctuations.
Step 903: and displaying the voice fragment and the text fragment thereof to the user.
The technical effects of the embodiment of the application are as follows: the user can directly select the voice segment to be listened again according to the recognized text, so that the method is more convenient and faster, and the user experience is improved.
The embodiment of the application also provides a voice processing device, and each module included in the device and each unit included in each module can be realized by a processor of the voice processing device; of course, the implementation can also be realized through a specific logic circuit; in implementation, the processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.
As shown in fig. 10, the speech processing apparatus 100 includes:
a segmentation module 1001 configured to segment a speech into at least two speech segments;
a determining module 1002, configured to determine a semantic segment corresponding to the voice segment;
the display module 1003 is configured to correspondingly display the voice segment and the semantic segment.
In some embodiments, the segmentation module 1001 includes: a first determination unit and a division unit; wherein,
the first determining unit is used for determining the duration corresponding to the voice;
and the segmentation unit is used for segmenting the voice into at least two voice segments under the condition that the time length is greater than the preset specified time length.
In some embodiments, the speech processing apparatus 100 further comprises: a display module and a generation module; wherein,
the display module is used for displaying an operation interface based on the received first operation aiming at the voice; the operation interface includes: a voice segmentation key;
the generating module is used for generating a voice segmentation instruction based on the received second operation aiming at the voice segmentation key;
accordingly, the segmentation module 1001 is configured to segment the speech into at least two speech segments based on the speech segmentation instruction.
In some embodiments, the segmentation module 1001 further comprises: a second determination unit and a third determination unit; wherein,
a second determining unit, configured to determine a first segmentation boundary of the speech according to an interval in the speech;
and the third determining unit is used for segmenting the voice according to the first segmentation boundary to obtain at least two voice segments.
In some embodiments, the determining module 1002 further comprises: a fourth determining unit, a matching unit and a fifth determining unit; wherein,
a fourth determining unit, configured to determine a phoneme corresponding to the speech fragment;
the matching unit is used for matching the phoneme with a set phoneme;
and a fifth determining unit, configured to determine, if the phoneme matches a set phoneme, semantic information corresponding to the set phoneme as a semantic fragment corresponding to the speech fragment.
In some embodiments, the determining module 1002 further comprises: a sixth determining unit, a marking unit, a seventh determining unit and an eighth determining unit; wherein,
a sixth determining unit, configured to determine semantic information corresponding to the speech;
a marking unit, configured to add a mark to the semantic information according to an interval in the speech;
a seventh determining unit, configured to determine a second segmentation boundary of the semantic information according to the label;
and the eighth determining unit is used for segmenting the semantic information according to the second segmentation boundary to obtain at least two semantic segments.
In some embodiments, the speech processing apparatus 100 further comprises: the device comprises a receiving module and a playing module; wherein,
a receiving module, configured to receive a third operation for the voice segment;
and the playing module is used for playing the voice clip based on the third operation.
It should be noted that: in the speech processing apparatus provided in the above embodiment, only the division of the above program modules is exemplified, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the voice processing apparatus and the voice processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
The speech processing apparatus 110 shown in fig. 11 includes: at least one processor 1110, memory 1140, at least one network interface 1120, and a user interface 1130. The various components in the speech processing device 110 are coupled together by a bus system 1150. It is understood that the bus system 1150 is used to enable communications among the components. The bus system 1150 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are designated as the bus system 1150 in fig. 11.
The user interface 1130 may include a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad or touch screen, among others.
The memory 1140 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM). The volatile Memory may be a Random Access Memory (RAM). Memory 1140 described in embodiments of the present invention is intended to comprise any suitable type of memory.
The memory 1140 in embodiments of the present invention is capable of storing data to support the operation of the speech processing device 110. Examples of such data include: any computer program for operating on the speech processing device 110, such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.
The processor 1110 is configured to execute the computer program to implement the steps in the speech processing method provided in the foregoing embodiments.
As an example of the method provided by the embodiment of the present invention implemented by a combination of hardware and software, the method provided by the embodiment of the present invention can be directly embodied as a combination of software modules executed by the processor 1110, such as a voice processing apparatus provided by the embodiment of the present invention, the software modules of the voice processing apparatus can be stored in the memory 1140, the processor 1110 reads executable instructions included in the software modules in the memory 1140, and the voice processing method provided by the embodiment of the present invention is completed in combination with necessary hardware (for example, including the processor 1110 and other components connected to the bus 1150).
By way of example, the Processor 1110 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.
Here, it should be noted that: the above description of the embodiment of the speech processing apparatus is similar to the above description of the method, and has the same beneficial effects as the embodiment of the method, and therefore, the description thereof is omitted. For technical details that are not disclosed in the embodiment of the speech processing apparatus of the present application, those skilled in the art should refer to the description of the embodiment of the method of the present application for understanding, and for the sake of brevity, will not be described again here.
In an exemplary embodiment, the present application further provides a storage medium, which may be a computer-readable storage medium, for example, including a memory storing a computer program, which can be processed by a processor to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when processed by a processor, implements the steps in the voice processing method provided in the foregoing embodiments.
Here, it should be noted that: the above description of the computer medium embodiment is similar to the above description of the method, and has the same beneficial effects as the method embodiment, and therefore, the description thereof is omitted. For technical details not disclosed in the embodiments of the storage medium of the present application, those skilled in the art should refer to the description of the embodiments of the method of the present application for understanding, and for the sake of brevity, will not be described again here.
The method disclosed by the embodiment of the present application can be applied to the processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor. The processor described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in a memory and the processor reads the information in the memory and performs the steps of the method described above in conjunction with its hardware.
It will be appreciated that the memory(s) of embodiments of the present application can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The non-volatile Memory may be ROM, Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic random access Memory (FRAM), Flash Memory (Flash Memory), magnetic surface Memory, optical Disc, or Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Random Access Memory), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), Synchronous link Dynamic Random Access Memory (SLDRAM, Synchronous Dynamic Random Access Memory), Direct Memory (DRmb Access Memory, Random Access Memory). The memories described in the embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory.
It should be understood by those skilled in the art that other configurations and functions of the speech processing method according to the embodiments of the present application are known to those skilled in the art, and are not described in detail in order to reduce redundancy.
In the description herein, reference to the description of the terms "one embodiment," "some embodiments," "an example," "a specific example" or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. A method of speech processing, the method comprising:
segmenting speech into at least two speech segments;
determining semantic fragments corresponding to the voice fragments;
and correspondingly displaying the voice fragment and the semantic fragment.
2. The method of claim 1, prior to said segmenting speech into at least two speech segments, comprising:
determining the duration corresponding to the voice;
and under the condition that the duration is greater than a preset specified duration, dividing the voice into at least two voice segments.
3. The method of claim 1, further comprising:
displaying an operation interface based on the received first operation aiming at the voice; the operation interface includes: a voice segmentation key;
generating a voice segmentation instruction based on the received second operation aiming at the voice segmentation key;
accordingly, the segmenting the speech into at least two speech segments includes:
based on the voice segmentation instruction, segmenting the voice into at least two voice segments.
4. The method of claim 1, the segmenting speech into at least two speech segments, comprising:
determining a first segmentation boundary of the voice according to an interval in the voice;
and segmenting the voice according to the first segmentation boundary to obtain at least two voice segments.
5. The method of claim 1, the determining semantic segments to which the speech segments correspond, comprising:
determining a phoneme corresponding to the voice fragment;
matching the phoneme with a set phoneme;
and if the phoneme is matched with the set phoneme, determining the semantic information corresponding to the set phoneme as a semantic segment corresponding to the voice segment.
6. The method of claim 1, the determining semantic segments to which the speech segments correspond, comprising:
determining semantic information corresponding to the voice;
adding marks to the semantic information according to intervals in the voice;
determining a second segmentation boundary of the semantic information according to the mark;
and segmenting the semantic information according to the second segmentation boundary to obtain at least two semantic segments.
7. The method of claim 1, further comprising:
receiving a third operation for the voice segment;
and playing the voice clip based on the third operation.
8. A speech processing apparatus, the apparatus comprising: the device comprises a segmentation module, a determination module and a display module; wherein,
the segmentation module is used for segmenting the voice into at least two voice segments;
the determining module is used for determining semantic fragments corresponding to the voice fragments;
the display module is used for correspondingly displaying the voice fragments and the semantic fragments.
9. A speech processing apparatus comprising a processor and a memory for storing a computer program operable on the processor; wherein the processor is configured to execute the steps of the speech processing method according to any one of claims 1 to 7 when running the computer program.
10. A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps in the speech processing method of any one of claims 1 to 7.
CN201910580572.4A 2019-06-28 2019-06-28 Voice processing method, device, equipment and storage medium Active CN110379413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910580572.4A CN110379413B (en) 2019-06-28 2019-06-28 Voice processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910580572.4A CN110379413B (en) 2019-06-28 2019-06-28 Voice processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110379413A true CN110379413A (en) 2019-10-25
CN110379413B CN110379413B (en) 2022-04-19

Family

ID=68251304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910580572.4A Active CN110379413B (en) 2019-06-28 2019-06-28 Voice processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110379413B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103187061A (en) * 2011-12-28 2013-07-03 上海博泰悦臻电子设备制造有限公司 Speech conversational system in vehicle
CN104616652A (en) * 2015-01-13 2015-05-13 小米科技有限责任公司 Voice transmission method and device
CN106559541A (en) * 2015-09-30 2017-04-05 北京奇虎科技有限公司 voice data processing method and device
CN106559540A (en) * 2015-09-30 2017-04-05 北京奇虎科技有限公司 voice data processing method and device
CN107305541A (en) * 2016-04-20 2017-10-31 科大讯飞股份有限公司 Speech recognition text segmentation method and device
CN108141498A (en) * 2015-11-25 2018-06-08 华为技术有限公司 A kind of interpretation method and terminal
CN108874904A (en) * 2018-05-24 2018-11-23 平安科技(深圳)有限公司 Speech message searching method, device, computer equipment and storage medium
CN109379641A (en) * 2018-11-14 2019-02-22 腾讯科技(深圳)有限公司 A kind of method for generating captions and device
US20190066683A1 (en) * 2017-08-31 2019-02-28 Interdigital Ce Patent Holdings Apparatus and method for residential speaker recognition
CN109473104A (en) * 2018-11-07 2019-03-15 苏州思必驰信息科技有限公司 Speech recognition network delay optimization method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103187061A (en) * 2011-12-28 2013-07-03 上海博泰悦臻电子设备制造有限公司 Speech conversational system in vehicle
CN104616652A (en) * 2015-01-13 2015-05-13 小米科技有限责任公司 Voice transmission method and device
CN106559541A (en) * 2015-09-30 2017-04-05 北京奇虎科技有限公司 voice data processing method and device
CN106559540A (en) * 2015-09-30 2017-04-05 北京奇虎科技有限公司 voice data processing method and device
CN108141498A (en) * 2015-11-25 2018-06-08 华为技术有限公司 A kind of interpretation method and terminal
CN107305541A (en) * 2016-04-20 2017-10-31 科大讯飞股份有限公司 Speech recognition text segmentation method and device
US20190066683A1 (en) * 2017-08-31 2019-02-28 Interdigital Ce Patent Holdings Apparatus and method for residential speaker recognition
CN108874904A (en) * 2018-05-24 2018-11-23 平安科技(深圳)有限公司 Speech message searching method, device, computer equipment and storage medium
CN109473104A (en) * 2018-11-07 2019-03-15 苏州思必驰信息科技有限公司 Speech recognition network delay optimization method and device
CN109379641A (en) * 2018-11-14 2019-02-22 腾讯科技(深圳)有限公司 A kind of method for generating captions and device

Also Published As

Publication number Publication date
CN110379413B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
US11398236B2 (en) Intent-specific automatic speech recognition result generation
US9299342B2 (en) User query history expansion for improving language model adaptation
CN110069608B (en) Voice interaction method, device, equipment and computer storage medium
US9378651B2 (en) Audio book smart pause
US10282469B2 (en) System and method for summarizing a multimedia content item
WO2017166650A1 (en) Voice recognition method and device
CN109213932B (en) Information pushing method and device
CN109343696B (en) Electronic book commenting method and device and computer readable storage medium
CN102057370A (en) Segmenting words using scaled probabilities
WO2015074548A1 (en) Method for sound control in browser, and browser
US8868419B2 (en) Generalizing text content summary from speech content
CN111128254B (en) Audio playing method, electronic equipment and storage medium
CN106202087A (en) A kind of information recommendation method and device
KR102353797B1 (en) Method and system for suppoting content editing based on real time generation of synthesized sound for video content
CN110020429B (en) Semantic recognition method and device
KR20240128047A (en) Video production method and device, electronic device and readable storage medium
WO2022206198A1 (en) Audio and text synchronization method and apparatus, device and medium
CN112685534B (en) Method and apparatus for generating context information of authored content during authoring process
CN114596859A (en) Conference voice transcription method, system, equipment and storage medium
US11322142B2 (en) Acoustic sensing-based text input method
CN110379413B (en) Voice processing method, device, equipment and storage medium
CN115547337A (en) Speech recognition method and related product
CN112837668B (en) Voice processing method and device for processing voice
KR102488623B1 (en) Method and system for suppoting content editing based on real time generation of synthesized sound for video content
WO2023246140A1 (en) Media control method and vehicle-mounted terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant