CN114341866A

CN114341866A - Simultaneous interpretation method, device, server and storage medium

Info

Publication number: CN114341866A
Application number: CN201980099995.2A
Authority: CN
Inventors: 郝杰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2022-04-12
Also published as: WO2021062757A1

Abstract

A simultaneous interpretation method, apparatus, server (50) and storage medium. Wherein the method comprises the following steps: obtaining first data to be processed (201); translating first voice data in the first data to be processed to obtain a first translation text (202); generating second voice data according to the first translation text; and performing at least one of: obtaining a typesetting document according to the first voice data and the first translation text; performing image character processing on first image data in the first data to be processed to obtain an image processing result (203); the first image data at least comprises a display document corresponding to the first voice data; and the first translation text, the typeset document, the second voice data and the image processing result are used for presenting at the client when the first voice data is played.

Description

Simultaneous interpretation method, device, server and storage medium

Technical Field

The present application relates to simultaneous interpretation technology, and in particular, to a simultaneous interpretation method, apparatus, server, and storage medium.

Background

The Machine co-transmission technology is a Speech Translation product aiming at a conference scene appearing in recent years, and combines an Automatic Speech Recognition (ASR) technology and a Machine Translation (MT) technology to provide multi-language caption presentation for the Speech content of a conference speaker to replace manual co-transmission service.

In the related machine co-transmission technology, generally speaking contents are translated and displayed through characters, but the displayed contents cannot enable a user to really understand the speaking contents.

Disclosure of Invention

In order to solve the related technical problems, embodiments of the present application provide a simultaneous interpretation method, apparatus, server, and storage medium.

The embodiment of the application provides a simultaneous interpretation method, which is applied to a server and comprises the following steps:

obtaining first data to be processed;

translating first voice data in the first data to be processed to obtain a first translation text;

generating second voice data according to the first translation text; and performing at least one of:

obtaining a typesetting document according to the first voice data and the first translation text;

performing image character processing on first image data in the first data to be processed to obtain an image processing result; the first image data at least comprises a display document corresponding to the first voice data; wherein,

the language corresponding to the first voice data is different from the language corresponding to the typesetting document; the language corresponding to the first voice data is different from the language corresponding to the first translation text; the language type corresponding to the first voice data is different from the language type corresponding to the second voice data; the language of the characters displayed by the first image data is different from the language of the characters included in the image processing result; the first translation text, the typeset document, the second voice data and the image processing result are used for presenting at the client when the first voice data is played.

The embodiment of the present application further provides a simultaneous interpretation device, including:

an acquisition unit configured to acquire first data to be processed;

the first processing unit is configured to translate first voice data in the first data to be processed to obtain a first translation text;

the second processing unit is configured to generate second voice data according to the first translation text;

a third processing unit configured to perform at least one of:

An embodiment of the present application further provides a server, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of any of the above-mentioned simultaneous interpretation methods when executing the program.

Embodiments of the present application also provide a storage medium having stored thereon computer instructions that, when executed by a processor, perform the steps of any of the above-described methods of simultaneous interpretation.

The simultaneous interpretation method, the device, the server and the storage medium provided by the embodiment of the application obtain first data to be processed; translating first voice data in the first data to be processed to obtain a first translation text; generating second voice data according to the first translation text; and performing at least one of: obtaining a typesetting document according to the first voice data and the first translation text; performing image character processing on first image data in the first data to be processed to obtain an image processing result; the first image data at least comprises a display document corresponding to the first voice data; the language corresponding to the first voice data is different from the language corresponding to the typesetting document; the language corresponding to the first voice data is different from the language corresponding to the first translation text; the language type corresponding to the first voice data is different from the language type corresponding to the second voice data; the language of the characters displayed by the first image data is different from the language of the characters included in the image processing result; the first translation text, the typesetting document, the second voice data and the image processing result are used for being displayed at the client when the first voice data is played, and the text translation result, the voice translation result and the typesetting document corresponding to the text translation result and the translation result related to the display document corresponding to the first voice data are provided for a user.

Drawings

FIG. 1 is a schematic diagram of a system architecture for simultaneous interpretation in the related art;

FIG. 2 is a flowchart illustrating a simultaneous interpretation method according to an embodiment of the present application;

FIG. 3 is a system architecture diagram illustrating an application of the simultaneous interpretation method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a structure of a simultaneous interpretation device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

Before the technical solution of the embodiment of the present application is explained in detail, a system applied to the simultaneous interpretation method in the related art is first briefly explained.

FIG. 1 is a schematic diagram of a system architecture for simultaneous interpretation in the related art; as shown in fig. 1, the system may include: the system comprises a machine co-transmission server, a voice recognition server, a translation server, a mobile terminal issuing server, a viewer mobile terminal, a Computer (PC) client and a display screen.

In practical application, a presenter can perform conference lecture through a PC client, and project a displayed document, such as a document of presentation document software (PPT, PowerPoint), onto the display screen, and display the document to a user through the display screen. In the process of conference lecture, a PC client acquires audio of a lecturer, the acquired audio is sent to a machine co-transmission server, the machine co-transmission server identifies audio data through a voice identification server to obtain an identification text, and the identification text is translated through a translation server to obtain a translation result; and the machine simultaneous transmission server sends the translation result to the PC client, and sends the translation result to the audience mobile terminal through the mobile terminal issuing server, so that the translation result is displayed for the user, and the speech content of the speaker is translated into the language required by the user and displayed.

The scheme in the related art can show the speech content (namely the translation result) of different languages, but only the dictation content of the speaker is simultaneously transmitted, and the document demonstrated by the speaker is not translated, so that users of different languages can hardly understand the content of the document, and the presentation of the speech content still has defects; aiming at the speech content, more translated characters are also directly displayed, and the user cannot be helped to typeset and abstract the speech content; in addition, compared with the manual simultaneous transmission service which is mainly based on listening, the current machine simultaneous transmission technology is more visual display of text contents, and in the speech expression process of a speaker, the excessive display of the text cannot well enable a user to understand the speech contents; the above problems result in poor sensory experience for the user.

Based on this, in various embodiments of the present application, the speech content is translated to obtain a translation result (which may include translated speech and text), the translation result is sorted (such as abstract extraction and typesetting), a typeset document and an abstract document are obtained, and the displayed document is translated; and the translation result, the document obtained by sorting and the translated display document are sent to the mobile terminal of the audience for displaying, so that the user can be helped to understand the speech content, and the user can conveniently summarize and summarize the speech content subsequently.

The present application will be described in further detail with reference to the following drawings and specific embodiments.

The embodiment of the application provides a simultaneous interpretation method, which is applied to a server, and fig. 2 is a schematic flow chart of the simultaneous interpretation method of the embodiment of the application; as shown in fig. 2, the method includes:

step 201: first data to be processed is obtained.

Here, the first data to be processed includes: first voice data and first image data.

Wherein the first image data at least comprises a presentation document corresponding to the first voice data. The presentation document may be a Word document, a PPT document, or other form of document, which is not limited herein.

In practical application, the first voice data and the first image data may be collected by the first terminal and sent to the server. The first terminal can be a mobile terminal such as a PC (personal computer), a tablet computer and the like.

The first terminal may be provided with or connected to a voice acquisition module, such as a microphone, and acquires voice through the voice acquisition module to obtain first voice data.

In practical application, the first terminal can be equipped with or be connected with image acquisition module (image acquisition module accessible stereo camera, two mesh cameras or structured light camera realize), through image acquisition module can shoot to the show document, in order to acquire first image data. In another embodiment, the first terminal may have a screen capture function, and the first terminal may capture a screen of a presentation document on its display screen, and use the captured screen as the first image data.

For example, in a simultaneous interpretation scene of a conference, when a speaker is speaking, a first terminal (e.g., a PC) acquires speech content by using a speech acquisition module to obtain first speech data; the method comprises the steps that a speaker displays a document (such as a PPT document) related to speech content, a first terminal shoots the displayed PPT document through an image acquisition module or captures the PPT document on a display screen of the first terminal to obtain first image data.

And establishing communication connection between the first terminal and the server. The first terminal sends the acquired first voice data and first image data to a server as first data to be processed, and the server can acquire the first data to be processed.

Step 202: and translating the first voice data in the first data to be processed to obtain a first translation text.

In an embodiment, the translating the first voice data in the first data to be processed to obtain a first translated text includes:

performing voice recognition on the first voice data to obtain a recognition text;

and translating the recognition text to obtain the first translation text.

Here, the server may perform voice recognition on the first voice data by using a voice recognition technology to obtain a recognition text.

The server may translate the recognition text by using a preset translation model to obtain the first translation text.

The translation model is used for translating the text of the first language into the text of at least one second language; the first language is different from the second language.

Step 203: generating second voice data according to the first translation text; and performing at least one of:

performing image character processing on first image data in the first data to be processed to obtain an image processing result;

the language corresponding to the first voice data is different from the language corresponding to the typesetting document; the language corresponding to the first voice data is different from the language corresponding to the first translation text; the language type corresponding to the first voice data is different from the language type corresponding to the second voice data; the language of the characters displayed by the first image data is different from the language of the characters included in the image processing result;

the first translation text, the typeset document, the second voice data and the image processing result are used for presenting at the client when the first voice data is played.

Specifically, the first translation text, the typeset document and the second voice data are used for being sent to a client, so that the content corresponding to the first voice data is displayed on the client when the first voice data is played; and the image processing result is used for being sent to the client so as to display the content corresponding to the display document included in the first image data at the client when the first voice data is played.

In practical application, for obtaining the first translation text and the second voice data corresponding to the first voice data, the server may first use a preset voice translation model to translate the first voice data, first obtain the second voice data corresponding to the first voice data, and then perform voice recognition on the second voice data to obtain the first translation text, in addition to using the above method. The language of the second voice data is at least one, and the language of the first translation text is at least one.

In actual application, the typesetting can be performed according to the content of the first voice data to obtain a typesetting document. Through concise and clear typesetting documents, the user can be helped to read and understand intuitively. In addition, the user can also conveniently perform induction and arrangement on the content of the first voice data subsequently.

In one embodiment, the determining the typeset document according to the first voice data and the first translation text comprises:

performing Voice Activity Detection (VAD) on the first Voice data, and determining a mute point in the first Voice data;

acquiring a context corresponding to the mute point in the first translation text;

segmenting the first translation text according to the mute point and the semantics of the context to obtain at least one paragraph;

and typesetting the at least one paragraph to obtain the typeset document.

Here, the server may perform voice activity detection on first voice data, determine a mute period in the first voice data, and record a mute duration of the mute period, and when the mute duration satisfies a condition (for example, the mute duration exceeds a preset duration), take the determined mute period as a mute point in the first voice data.

Because the first translation text is obtained by translating according to the first voice data, and the content of the first translation text and the content of the first voice data have a corresponding relationship, the server can pre-segment the first translation text corresponding to the first voice data according to a mute point to obtain at least one pre-segmented paragraph; and obtaining a context corresponding to the silent point in the first translation text, performing semantic analysis on the context by using a Natural Language Processing (NLP) technology, and determining whether to perform segmentation according to a pre-segmented paragraph according to a semantic analysis result. The first translation text can be segmented finally by the method to obtain at least one paragraph.

Specifically, the generating of the second speech data according to the first translation text includes:

segmenting the first translation text to obtain at least one paragraph in the first translation text;

generating at least one segmented speech from at least one paragraph in the first translated text;

and synthesizing second voice data corresponding to the first translation text by using the at least one segmented voice.

Here, Text-To-Speech (TTS) technology is applied To convert the paragraphs into corresponding segmented Speech.

In an embodiment, segmenting the first translation text may include: and carrying out semantic recognition on the first translation text, and segmenting the first translation text according to a semantic recognition result. In another embodiment, a combination of a voice activity detection technique and a semantic recognition technique may also be used for segmentation, and the specific description is described in the above determining the typeset document according to the first voice data and the first translated text, which is not described herein again.

In actual application, the displayed document can be abstracted, so that the user can be helped to summarize the content of the first voice data, and the user can better understand the first voice data. In addition, the user can also conveniently perform induction and arrangement on the content of the first voice data subsequently.

Based on this, in an embodiment, the method may further include:

abstracting the abstract of the first translation text to obtain an abstract document aiming at the first translation text; the summary document is used for presenting at the client when the first voice data is played.

Here, an NLP technique is used to perform Automatic summarization (Automatic summarization) extraction on the first translated text, so as to obtain a summarized document for the first translated text.

Specifically, the image word processing on the first image data to obtain an image processing result includes:

determining characters in the first image data and positions corresponding to the characters;

extracting characters in the first image data, and translating the extracted characters;

and generating the image processing result according to the translated characters.

Here, determining the text in the first image data using an Optical Character Recognition (OCR) technique; the OCR technology is a technology of performing character recognition on an image to translate characters in the image into text. And determining the position corresponding to the characters by using an interface positioning technology.

The translating the extracted words comprises the following steps: and translating the characters by using a preset translation model.

Here, the translation model is used for translating the characters in the first language into at least one character in the second language; the first language is different from the second language.

Specifically, the generating of the image processing result according to the translated text includes at least one of:

replacing the characters corresponding to the positions in the first image data according to the translated characters to obtain second image data, and taking the second image data as the image processing result;

and generating a second translation text by using the translated characters, and taking the second translation text as the image processing result.

Here, the image processing result may include at least one of: second image data, second translated text.

In practical applications, in order to help users understand the first speech data and present the document, the simultaneous interpretation data may be various to provide corresponding documents (including at least one of the first interpreted text, the second speech data, and the typeset document, the image processing result, and the abstract document) according to the needs of users.

Based on this, in one embodiment, the simultaneous interpretation data obtained by using the first to-be-processed data corresponds to at least one language; the method may further comprise:

storing the simultaneous interpretation data corresponding to at least one language in different databases according to languages.

Here, the simultaneous interpretation data includes: the first translation text and the second voice data further include: at least one of a layout document, an image processing result, and a digest document.

Here, whether to provide the layout document, the image processing result, and the digest document may be determined according to user requirements. For example, the user sends a request through the client to inform the server whether the document for composition, the image processing result, and the digest document are required. For another example, if it is determined that the user needs to know the summary of the first voice data, the simultaneous interpretation data may include a typeset document and a summary document in addition to the first translation text and the second voice data; when it is determined that the user needs to know the content of the displayed document, the simultaneous interpretation data may further include an image processing result.

In practical application, the simultaneous interpretation data corresponding to at least one language can be stored in different databases according to languages, the first interpretation text and the second speech data of the same language and at least one of the typesetting document, the image processing result and the abstract document can be correspondingly stored in the same database, and the database is correspondingly provided with the language identification.

In this embodiment, considering that the simultaneous interpretation data is oriented to multiple clients, sending the simultaneous interpretation data to each client, that is, executing a serial service, in order to ensure timeliness of sending the simultaneous interpretation data to multiple clients, a cache manner may be adopted. When the data needs to be sent, the server directly obtains the corresponding result from the cache, so that the high timeliness of sending the simultaneous interpretation data can be ensured, and the computing resources of the server can be protected.

Based on this, in an embodiment, the method may further include:

and classifying and caching the simultaneous interpretation data corresponding to at least one language according to languages.

In practical application, the server may determine the preset language of each client in at least one client in advance, and obtain the simultaneous interpretation data corresponding to the preset language from the database for caching.

Through the cache operation, when the client selects other languages different from the preset language, the simultaneous interpretation data of the corresponding language can be directly obtained from the cache, so that the timeliness and the protection of computing resources can be improved.

In practical application, the client selects other languages different from the preset language, the simultaneous interpretation data of the other languages may not be cached, and when the server determines that the client sends an acquisition request for selecting other languages different from the preset language, the server can cache the simultaneous interpretation data of the other languages requested by the client; when other clients select the same language, corresponding simultaneous interpretation data can be directly obtained from the cache, so that timeliness and protection of computing resources can be improved.

In practical applications, in order to provide the simultaneous interpretation data corresponding to the language meeting the user requirements, the simultaneous interpretation data corresponding to the target language may be obtained according to an obtaining request sent by the user through the client.

Based on this, in an embodiment, the method may further include:

receiving an acquisition request sent by a client; the acquisition request is used for acquiring simultaneous interpretation data; the acquisition request at least comprises: target language;

obtaining simultaneous interpretation data corresponding to the target language from the cached simultaneous interpretation data;

and sending the obtained simultaneous interpretation data corresponding to the target language to a client.

Here, the client may be provided with a human-computer interaction interface through which a user may select a language, and the client generates an acquisition request including a target language according to the selection of the user and sends the acquisition request to the server, so that the server receives the acquisition request.

The client can be installed at a mobile phone end; here, considering that most users carry mobile phones with them at present, the simultaneous interpretation data is sent to the client installed at the mobile phone end, and no other equipment is needed to be added to receive and display the simultaneous interpretation data, so that the cost can be saved, and the operation is convenient.

Here, the first data to be processed corresponds to simultaneous interpretation data corresponding to at least one language, and the simultaneous interpretation data includes: the first translated text, the second speech data; further comprising at least one of: the typesetting document, the image processing result and the abstract document. That is, the first data to be processed corresponds to a first translation text in at least one language, second speech data in at least one language, and at least one of: typesetting documents of at least one language, image processing results of at least one language and summary documents of at least one language.

In practical application, in order to help a user to quickly obtain the simultaneous interpretation data at a certain time point, the corresponding simultaneous interpretation data can be obtained according to the target time sent by the client.

Based on this, in one embodiment, the acquisition request may include a target time; when the simultaneous interpretation data corresponding to the target language is obtained from the cached content, the method further includes:

according to a preset time corresponding relation, simultaneous interpretation data corresponding to the target time are obtained from a cache; the time correspondence represents a time relationship between each piece of data in the simultaneous interpretation data.

Here, the user can also select time through the human-computer interaction interface, and the client generates an acquisition request containing the target time according to the selection of the user. For example: the simultaneous interpretation method is applied to a conference; the user selects a point in time in the conference as the target time.

Here, the time relationship between the respective data in the simultaneous interpretation data refers to the time relationship between the first translated text, the second speech data, and at least one of the layout document, the image processing result, and the digest document in the simultaneous interpretation data.

Specifically, the time correspondence is generated in advance according to a time axis of the first voice data and a time point at which the first image data is acquired.

It should be noted that, for two cases that the acquisition request includes the target language and the acquisition request includes the target time, the two cases may be implemented as separate schemes (for a case that the acquisition request includes only the target time, the target language may adopt a preset language set by the client); or in the same scheme (that is, if the obtaining request includes both the target language and the target time, the server obtains the simultaneous interpretation data of the target time corresponding to the target language).

In practical application, the corresponding relation between each data in the simultaneous acoustic transmission data can be generated in advance, and based on the corresponding relation, when a certain data in the simultaneous acoustic transmission data is acquired, other corresponding data can be acquired at the same time. For example, when the first translated text is obtained, the second voice data, the abstract document, the typeset document corresponding to the first translated text can be correspondingly obtained, and the image processing result corresponding to the document can be displayed.

Based on this, in an embodiment, the method further comprises:

determining a first time axis corresponding to the first voice data and a time point for acquiring the first image data;

generating a time corresponding relation between data in the simultaneous interpretation data according to the first time axis and the time point;

and correspondingly storing each data in the simultaneous interpretation data by utilizing the time corresponding relation.

In an embodiment, when the server receives the first voice data, a receiving time is determined, an ending time is determined according to a duration of the first voice data, and the first time axis for the first voice data is generated according to the receiving time and the ending time. In another embodiment, the first terminal determines a start time and a duration of the first voice data when the first voice data is collected, and sends the start time and the duration to the server, and the server determines a first time axis of the first voice data according to the start time and the duration.

In an embodiment, a time point corresponding to when the server acquires the first image data may be adopted as the time point of acquiring the first image data. In another embodiment, the first terminal determines a corresponding time point when intercepting the first image data, sends the determined time point and the first image data to a server together, and the server receives the time point and the first image data and takes the time point as a time point for acquiring the first image data.

Here, from the first time axis and the time point, a time relationship between the first voice data and the first image data may be determined; the first translation text, the second voice data, the typeset document and the abstract document in the simultaneous interpretation data are all generated on the basis of the first voice data, so that the time relationship between the first translation text, the second voice data, the typeset document and the abstract document and the first voice data can be determined. Based on this, a time correspondence between each data in the simultaneous interpretation data can be generated.

Specifically, the time correspondence relationship may be embodied in the form of a time axis, that is, a second time axis is generated; the second time axis may be based on a time axis of second voice data; and the second time shaft is marked with a starting time point and an ending time point corresponding to each segmented voice in the second voice data.

For the first translation text, the time corresponding to each paragraph in the first translation text is marked on the second time axis; the time may specifically adopt a time point of the segmented speech in the second speech data corresponding to each segment on the second time axis.

For the typeset document, the time corresponding to the typeset document is marked on the second time axis, and specifically, the time point of the segmented speech in the second speech data corresponding to the typeset document on the second time axis may be adopted.

For the summary document, the time corresponding to the summary document is marked on the second time axis, and specifically, the time point of the segmented speech in the second speech data corresponding to the summary document on the second time axis may be adopted.

And for the image processing result, the time corresponding to the image processing result is marked on the second time axis. Here, the relationship between the time corresponding to the image processing result and the second time axis may be determined according to a relationship between a first time axis and a time point of the first image data.

In practical applications, in order to help a user understand the first speech data and enable the user to better accept the content of the speech, the paragraphs in the first translation text and the corresponding segmented speech are sent to the client together, so that the user can listen to the corresponding speech together when viewing the paragraphs of a certain translation document.

Based on this, in an embodiment, sending the first translation text and the second voice data in the simultaneous interpretation data to the client includes:

sending at least one paragraph in the first translation text and the segmented voice corresponding to the paragraph to a client; and the segmented voice is used for playing when the client displays the paragraph corresponding to the segmented voice.

Here, the paragraph and the segmented speech corresponding to the paragraph are sent to the client together, and when the paragraph is presented by the client, the client may play the segmented speech corresponding to the paragraph at the same time.

Specifically, the method may further include: generating a target document in a preset format according to the typesetting document and the abstract document; the target document is used for presenting at the client when the first voice data is played.

According to the typesetting document and the abstract document, the server generates a target document containing the contents of the typesetting document and the abstract document, and the target document can display the extracted abstract and the typesetting together.

The method provided by the embodiment of the application can be applied to simultaneous interpretation scenes, such as simultaneous interpretation of a conference, and under the scene, a user can more clearly know the lecture content of a speaker by combining the presentation document through the translation of the presentation document of the conference; the method helps users to better summarize and search by typesetting and abstracting the lecture content (namely the first voice data); at least one paragraph in the first translation text and the segmented voice corresponding to the paragraph are correspondingly sent to the client, so that the user can be helped to better accept the speech content aiming at the intensive word translation content.

It should be understood that the order of steps (e.g., obtaining the first translated text, generating the second speech data, obtaining the typeset document, obtaining the abstract document, and obtaining the image processing result) described in the above embodiments does not imply any order of execution, and the order of execution of the steps should be determined by their functions and inherent logic, and should not limit the implementation process of the embodiments of the present application.

The simultaneous interpretation method provided by the embodiment of the invention obtains first data to be processed; translating first voice data in the first data to be processed to obtain a first translation text; generating second voice data according to the first translation text; and performing at least one of: obtaining a typesetting document according to the first voice data and the first translation text; performing image character processing on first image data in the first data to be processed to obtain an image processing result; the first image data at least comprises a display document corresponding to the first voice data; the language corresponding to the first voice data is different from the language corresponding to the typesetting document; the language corresponding to the first voice data is different from the language corresponding to the first translation text; the language type corresponding to the first voice data is different from the language type corresponding to the second voice data; the language of the characters displayed by the first image data is different from the language of the characters included in the image processing result; the first translation text, the typesetting document, the second voice data and the image processing result are used for being presented at the client when the first voice data is played, and the text translation result, the voice translation result, the typesetting document corresponding to the text translation result and the translation result related to the display document corresponding to the first voice data are provided for the user, so that the content of the first voice data of the speech can be visually displayed more intuitively and comprehensively, the user can understand the summary of the speech content and the content of the display document through the client, the user is helped to better accept the speech content, and the user experience is improved; and the subsequent induction and arrangement of the speech content by the user can be facilitated.

The embodiment of the application also provides an embodiment of the simultaneous interpretation method; fig. 3 is a schematic diagram of a system architecture applied to the simultaneous interpretation method according to the embodiment of the present application, and as shown in fig. 3, the system is applied to simultaneous interpretation in a conference, and includes: the system comprises a machine co-transmission server, a voice recognition server, a translation server, an audience mobile terminal, a PC client, a display screen, conference management equipment, a TTS server, an OCR server and an NLP server. Here, considering that only one server is used for simultaneous interpretation, the performance requirement on the server is high, in order to improve the data processing efficiency and ensure high timeliness, each function can be realized on a plurality of servers, specifically on a voice recognition server, a translation server, a TTS server, an OCR server, an NLP server, a conference management device and the like, so that the efficiency of simultaneous interpretation is improved and high timeliness is ensured.

The PC client is used for acquiring the audio frequency of the speech content of the speaker in the conference, namely acquiring first voice data; the method comprises the steps that a document to be displayed is projected to a display screen in a screen projection mode, and the document is displayed to other users participating in a conference by the display screen; and, capturing first image data for the document. Here, the document may be a PPT document, a Word document, or the like.

And the PC client is also used for sending the acquired first voice data and the acquired first image data to the machine simultaneous transmission server.

Here, the PC client may have a screen capture function in addition to functions such as voice capture, screen projection, and control, so that a document currently displayed by a speaker may be acquired in real time through screen capture operation on a screen, that is, first image data is acquired; and, the time corresponding to the first image data can be recorded correspondingly, and the first image data and the corresponding time can be sent to the machine simultaneous transmission server.

The machine simultaneous transmission server is used for sending the first voice data to a voice recognition server; the first voice data is recognized by a voice recognition server by using a voice recognition technology to obtain a recognition text and sent to a machine simultaneous transmission server; and the number of the first and second groups,

sending the identification text to a translation server; the recognition text is translated by the translation server by using a preset translation model to obtain a first translation text and sent to the machine co-transmission server;

the machine simultaneous transmission server is further configured to send the first translation text and the first image data to a conference management device.

Here, the first translation text and the first image data respectively carry corresponding time information, where the time information corresponding to the translation result may include time information of each segmented voice in the first voice data corresponding to each segment in the translation result.

The conference management device is used for receiving the first translation text and the first image data;

sending the first translation text to an NLP server; the translation result is obtained by the NLP server and at least one of a typesetting document and a summary document according to the first translation text;

sending the first image data to an OCR server; the first image data is received by the OCR server, characters in the first image data are extracted, and positions of the characters are determined; sending the extracted characters and the positions of the characters to conference management equipment;

sending the received extracted characters to a translation server, receiving a translation result sent by the translation server, and generating an image processing result according to the translation result and the positions of the extracted characters; here, the extracted characters are received by the translation server, the extracted characters are translated, and the translation result is sent to the conference management device; and the number of the first and second groups,

sending the first translated text to a TTS server; and the TTS server receives the first translation text, generates second voice data according to the first translation text and sends the second voice data to the conference management equipment.

The conference management equipment is also used for sending the first translation text, the second voice data, the typesetting document, the abstract document and the image processing result to the mobile terminal of the audience.

Specifically, the OCR server is configured to obtain, through an OCR technology and an interface positioning technology, extractable characters in a presentation document corresponding to the first image data and interface positioning information corresponding to the characters; performing multi-language translation on the extracted content by a machine translation technology; according to the interface positioning information, corresponding different language translation contents are combined into the picture to obtain an image processing result; and storing the image processing result in a corresponding server according to the language.

Specifically, the NLP server is configured to generate at least one of a layout document and a digest document according to the first translation text.

The NLP server is configured to generate a typesetting document according to the first speech data and the first translation text by applying an NLP technique and a VAD technique; and generating a summary document according to the first translation text. Specifically, Voice Activity Detection (VAD) technology is applied to perform Voice Activity Detection on first Voice data, a mute period in the first Voice data is determined, the duration of the mute period is determined to exceed a preset duration threshold, and the determined mute period is used as a mute point in the first Voice data; pre-segmenting a first translation text corresponding to the first voice data according to a mute point to obtain at least one pre-segmented paragraph; obtaining a context corresponding to the silent point in the first translation text, performing semantic analysis on the context by applying an NLP (non line segment) technology, determining whether to segment a pre-segmented paragraph, and finally determining at least one paragraph.

And the NLP server is also used for sorting the abstract content of the first translation text by using an abstract extraction technology of the NLP to obtain an abstract document.

Specifically, the conference management device is further configured to fill in a preset form according to typesetting documents of different languages and abstract documents of different languages, so as to generate a target form.

The preset table may be in the format of table 1 below, and may include: meeting name, meeting time, topic name of the meeting, topic time, speaker, identification content, translation content of language A, translation content of language B, abstract content of language A and abstract content of language B.

The co-transmission account number, the conference name, the conference time, the topic name of the conference, the topic time and the speaker can be filled in by the user in advance according to the actual situation.

The conference management equipment correspondingly fills in language one translation content, language two translation content, … … and language N translation content in a language corresponding table 1 according to a typesetting document of at least one language or at least one image processing result; and correspondingly filling the summary content of the language one, the summary content of the language two, … … and the summary content of the language N in the language corresponding table 1 according to the summary document of at least one language, thereby realizing the arrangement and induction of the conference content.

TABLE 1

The TTS server provides TTS simultaneous transfer service; specifically, the TTS server is configured to invoke a TTS service on the first translated text in the different languages, and synthesize audio content in the different languages, that is, obtain the second speech data.

Specifically, the conference management device is further configured to store the first translated text, the second speech data, and at least one of the layout document, the digest document, and the image processing result in a database of a corresponding language according to a time correspondence relationship. The time correspondence may be implemented by using a time axis, and a specific implementation method is described in the method shown in fig. 1, which is not described herein again.

Through the time corresponding relation, when the mobile terminal pulls the corresponding first translation document according to the time axis, the corresponding second voice data can be obtained together; at least one of the corresponding layout document, the digest document, and the image processing result may also be obtained together.

In the embodiment of the application, simultaneous transmission of PPT documents is increased through the OCR server, typesetting and abstract of conference contents are provided through the NLP server, listening services of simultaneous machine and simultaneous interpretation are increased through the TTS server, sensory experience of users in a conference is improved, users are helped to better understand speech contents and document contents, and convenience is brought to subsequent induction and arrangement of conference contents of audiences.

In order to implement the simultaneous interpretation method of the embodiment of the application, the embodiment of the application also provides a simultaneous interpretation device. FIG. 4 is a schematic diagram of a structure of a simultaneous interpretation device according to an embodiment of the present application; as shown in fig. 4, the simultaneous interpretation apparatus includes:

an acquisition unit 41 configured to obtain first data to be processed;

a first processing unit 42 configured to translate first voice data in the first data to be processed to obtain a first translated text;

a second processing unit 43 configured to generate second speech data from the first translated text;

a third processing unit 44 configured to perform at least one of:

In an embodiment, the third processing unit 44 is configured to perform voice activity detection on the first voice data, and determine a mute point in the first voice data;

and typesetting the at least one paragraph to obtain the typeset document.

In an embodiment, the second processing unit 43 is configured to segment the first translation text to obtain at least one paragraph in the first translation text;

Here, the second processing unit 43 segments the first translation text, and the same segmentation method as the first processing unit 42 may be adopted.

In an embodiment, the third processing unit 44 is configured to perform abstract extraction on the first translated text, so as to obtain an abstract document for the first translated text; the summary document is used for presenting at the client when the first voice data is played.

Here, the summary document is provided for the user, so that the user is helped to summarize the content of the first voice data, and the content of the first voice data is better accommodated.

In an embodiment, the third processing unit 44 is configured to determine a text in the first image data and a position corresponding to the text;

Specifically, the image processing result may include at least one of: second image data, second translated text.

The third processing unit 44 is configured to perform at least one of the following to generate the image processing result:

In the embodiment of the application, the simultaneous interpretation data obtained by utilizing the first to-be-processed data corresponds to at least one language;

the device further comprises: a storage unit; the storage unit is configured to classify and cache the simultaneous interpretation data corresponding to at least one language according to languages.

In one embodiment, the apparatus further comprises: a communication unit; the communication unit is configured to receive an acquisition request sent by a client; the acquisition request is used for acquiring simultaneous interpretation data; the acquisition request at least comprises: target language;

In one embodiment, the acquisition request further includes a target time;

the communication unit is further configured to, when the simultaneous interpretation data corresponding to the target language is acquired from the cached content, acquire the simultaneous interpretation data corresponding to the target time from the cache according to a preset time correspondence; the time corresponding relation represents the time relation among all data in the simultaneous interpretation data; the time correspondence is generated in advance according to a time axis of the first voice data and a time point at which the first image data is acquired.

In an embodiment, the storage unit is further configured to determine a first time axis corresponding to the acquisition of the first voice data and a time point of the acquisition of the first image data;

In an embodiment, the communication unit is further configured to send at least one paragraph in the first translation text and segmented speech corresponding to the paragraph to a client; when the paragraph is displayed by the client, the segmented voice corresponding to the paragraph is played by the client.

In practical application, the obtaining unit 41 may be implemented by a communication interface; the first Processing Unit 42, the second Processing Unit 43, and the third Processing Unit 44 may be implemented by a Processor in the server, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA); the communication unit may be implemented by a communication interface in a server.

It should be noted that: in the above embodiment, when performing simultaneous interpretation, the apparatus is only illustrated by dividing the program modules, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the terminal is divided into different program modules to complete all or part of the processing described above. In addition, the apparatus provided in the above embodiment and the embodiment of the simultaneous interpretation method belong to the same concept, and the specific implementation process thereof is described in the method embodiment and will not be described herein again.

Based on the hardware implementation of the above device, an embodiment of the present application further provides a server, fig. 5 is a schematic diagram of a hardware composition structure of the server according to the embodiment of the present application, as shown in fig. 5, a server 50 includes a memory 53, a processor 52, and a computer program stored on the memory 53 and capable of running on the processor 52; when the processor 52 located in the server executes the program, the method provided by one or more of the above-mentioned technical solutions on the server side is implemented.

In particular, the processor 52 located at the server 50, when executing the program, implements: obtaining first data to be processed; translating first voice data in the first data to be processed to obtain a first translation text; generating second voice data according to the first translation text; and performing at least one of:

performing image character processing on first image data in the first data to be processed to obtain an image processing result; the first image data at least comprises a display document corresponding to the first voice data; the first translation text, the typeset document, the second voice data and the image processing result are used for presenting at the client when the first voice data is played.

It should be noted that, the specific steps implemented when the processor 52 located in the server 50 executes the program have been described in detail above, and are not described herein again.

It will be appreciated that the server also includes a communications interface 51; the various components in the server are coupled together by a bus system 54. It will be appreciated that the bus system 54 is configured to enable connected communication between these components. The bus system 54 includes a power bus, a control bus, a status signal bus, and the like, in addition to the data bus.

It will be appreciated that the memory 53 in this embodiment may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memories described in the embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present application may be applied to the processor 52, or implemented by the processor 52. Processor 52 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 52. The processor 52 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 52 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located on a storage medium in memory where information is read by processor 52 to perform the steps of the methods described above in conjunction with its hardware.

The embodiment of the application also provides a storage medium, in particular to a computer storage medium, and more particularly to a computer readable storage medium. Stored thereon are computer instructions, i.e. computer programs, which when executed by a processor perform the methods provided by one or more of the above-mentioned server-side solutions.

In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The technical means described in the embodiments of the present application may be arbitrarily combined without conflict.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

A simultaneous interpretation method is applied to a server and comprises the following steps:

obtaining first data to be processed;

translating first voice data in the first data to be processed to obtain a first translation text;

generating second voice data according to the first translation text; and performing at least one of:

obtaining a typesetting document according to the first voice data and the first translation text;

performing image character processing on first image data in the first data to be processed to obtain an image processing result; the first image data at least comprises a display document corresponding to the first voice data; wherein,

the language corresponding to the first voice data is different from the language corresponding to the typesetting document; the language corresponding to the first voice data is different from the language corresponding to the first translation text; the language type corresponding to the first voice data is different from the language type corresponding to the second voice data; the language of the characters displayed by the first image data is different from the language of the characters included in the image processing result; the first translation text, the typeset document, the second voice data and the image processing result are used for presenting at the client when the first voice data is played.
The method of claim 1, wherein determining a comp document based on the first speech data and the first translated text comprises:

performing voice activity detection on the first voice data, and determining a mute point in the first voice data;

acquiring a context corresponding to the mute point in the first translation text;

segmenting the first translation text according to the mute point and the semantics of the context to obtain at least one paragraph;

and typesetting the at least one paragraph to obtain the typeset document.
The method of claim 1, wherein the generating second speech data from the first translated text comprises:

segmenting the first translation text to obtain at least one paragraph in the first translation text;

generating at least one segmented speech from at least one paragraph in the first translated text;

and synthesizing second voice data corresponding to the first translation text by using the at least one segmented voice.
The method of claim 1, wherein the method further comprises:

abstracting the abstract of the first translation text to obtain an abstract document aiming at the first translation text; the summary document is used for presenting at the client when the first voice data is played.
The method of claim 1, wherein said image word processing said first image data to obtain an image processing result comprises:

determining characters in the first image data and positions corresponding to the characters;

extracting characters in the first image data, and translating the extracted characters;

and generating the image processing result according to the translated characters.
The method of claim 5, wherein the generating the image processing result from the translated text comprises at least one of:

replacing the characters corresponding to the positions in the first image data according to the translated characters to obtain second image data, and taking the second image data as the image processing result;

and generating a second translation text by using the translated characters, and taking the second translation text as the image processing result.
The method according to any one of claims 1 to 6, wherein the simultaneous interpretation data obtained by using the first data to be processed corresponds to at least one language; the method further comprises the following steps:

and classifying and caching the simultaneous interpretation data corresponding to at least one language according to languages.
The method of claim 7, wherein the method further comprises:

receiving an acquisition request sent by a client; the acquisition request is used for acquiring simultaneous interpretation data; the acquisition request at least comprises: target language;

obtaining simultaneous interpretation data corresponding to the target language from the cached simultaneous interpretation data;

and sending the obtained simultaneous interpretation data corresponding to the target language to a client.
The method of claim 8, wherein the get request further includes a target time; when the simultaneous interpretation data corresponding to the target language is obtained from the cached content, the method further includes:

according to a preset time corresponding relation, simultaneous interpretation data corresponding to the target time are obtained from a cache; the time corresponding relation represents the time relation among all data in the simultaneous interpretation data; the time correspondence is generated in advance according to a time axis of the first voice data and a time point at which the first image data is acquired.
The method of claim 9, wherein the method further comprises:

determining a first time axis corresponding to the first voice data and a time point for acquiring the first image data;

generating a time corresponding relation between data in the simultaneous interpretation data according to the first time axis and the time point;

and correspondingly storing each data in the simultaneous interpretation data by utilizing the time corresponding relation.
The method of claim 8, wherein sending the first translated text and the second speech data in the simultaneous interpretation data to a client comprises:

sending at least one paragraph in the first translation text and the segmented voice corresponding to the paragraph to a client; when the paragraph is displayed by the client, the segmented voice corresponding to the paragraph is played by the client.
A simultaneous interpretation apparatus comprising:

an acquisition unit configured to acquire first data to be processed;

the first processing unit is configured to translate first voice data in the first data to be processed to obtain a first translation text;

the second processing unit is configured to generate second voice data according to the first translation text;

a third processing unit configured to perform at least one of:

obtaining a typesetting document according to the first voice data and the first translation text;

performing image character processing on first image data in the first data to be processed to obtain an image processing result; the first image data at least comprises a display document corresponding to the first voice data; wherein,

the language corresponding to the first voice data is different from the language corresponding to the typesetting document; the language corresponding to the first voice data is different from the language corresponding to the first translation text; the language type corresponding to the first voice data is different from the language type corresponding to the second voice data; the language of the characters displayed by the first image data is different from the language of the characters included in the image processing result; the first translation text, the typeset document, the second voice data and the image processing result are used for presenting at the client when the first voice data is played.
A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 11 when executing the program.
A storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 11.