KR20190046124A

KR20190046124A - Method and apparatus for real-time automatic interpretation based on context information

Info

Publication number: KR20190046124A
Application number: KR1020170139323A
Authority: KR
Inventors: 김운; 김영길
Original assignee: 한국전자통신연구원
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2019-05-07

Abstract

The present disclosure relates to a method and apparatus for real-time automatic interpretation based on context information. According to one embodiment of the present disclosure, the method for real-time automatic interpretation using context information may comprise the steps of: encoding current speech content; encoding the context information related to the current speech content; correcting the encoded result of the current speech content based on the encoded result of the context information; and decoding the current speech content based on the encoded result of the corrected current speech content.

Description

[0001] METHOD AND APPARATUS FOR REAL-TIME AUTOMATIC INTERPRETATION BASED ON CONTEXT INFORMATION [0002]

The present disclosure relates to automatic interpretation technology, and more particularly, to a method and apparatus for providing automatic interpretation based on contextual information in real time.

Real-time automatic interpretation technology refers to a technique of receiving speech data in the original language of a speaker and automatically translating the speech data into a target language of the listener in real time.

In the prior art of the automatic interpretation technology, the automatic interpretation system based on Statistical Machine Translation has been difficult to provide interpretation in real time because the translation process can be performed only after the utterance ends. In addition, an automatic interpretation method based on deep learning can provide a real-time interpretation using a sequence deepening learning model or a sequence-to-sequence deepening learning model, but the learning data is not sufficient When there is a large difference between the learning data and the input data, there is a problem that the performance deteriorates significantly.

A technical object of the present invention is to provide a real-time automatic interpretation method and apparatus improved interpreting accuracy by using input data of a speaker and context information together.

A technical object of the present invention is to provide a real-time automatic interpretation method and apparatus in which interpreting accuracy is improved by using a subject, a keyword, a previous speech, etc. of a speaker as context information and applying a convolution network method.

The technical objects to be achieved by the present disclosure are not limited to the above-mentioned technical subjects, and other technical subjects which are not mentioned are to be clearly understood from the following description to those skilled in the art It will be possible.

A real-time automatic interpretation method using context information according to an aspect of the present disclosure includes: encoding a current utterance; Encoding the context information related to the current speech content; Correcting the encoded result of the current speech content based on the encoded result of the context information; And decoding the current speech based on the corrected result of the current speech content.

The features briefly summarized above for this disclosure are only exemplary aspects of the detailed description of the disclosure which follow, and are not intended to limit the scope of the disclosure.

According to the present disclosure, by using the input data of the speaker and the context information together, a real-time automatic interpretation method and apparatus improved interpreting accuracy can be provided.

According to the present disclosure, it is possible to provide a real-time automatic interpretation method and apparatus that improves interpreting accuracy by using a subject, keyword, previous speech content, etc. of a speaker as context information and applying a convolution network method.

The effects obtainable from the present disclosure are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below will be.

1 is a diagram for explaining an automatic translation method based on a deep learning to which the present disclosure can be applied.
FIG. 2 is a diagram for explaining an automatic translation based on deep learning using context information according to the present disclosure.
FIG. 3 is a view for explaining an example of an automatic translation based on the deep learning using the context information according to the present disclosure.
4 is a diagram for explaining a context information encoding unit and a context information combination unit in an example using multiple types of context information according to the present disclosure;
5 is a diagram for explaining an example of a context information combination unit according to the present disclosure;
6 is a flowchart for explaining a real-time automatic translation method using context information according to the present disclosure.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings, which will be easily understood by those skilled in the art. However, the present disclosure may be embodied in many different forms and is not limited to the embodiments described herein.

In the following description of the embodiments of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present disclosure rather unclear. Parts not related to the description of the present disclosure in the drawings are omitted, and like parts are denoted by similar reference numerals.

In the present disclosure, when an element is referred to as being "connected", "coupled", or "connected" to another element, it is understood that not only a direct connection relationship but also an indirect connection relationship May also be included. Also, when an element is referred to as " comprising " or " having " another element, it is meant to include not only excluding another element but also another element .

In the present disclosure, the terms first, second, etc. are used only for the purpose of distinguishing one element from another, and do not limit the order or importance of elements, etc. unless specifically stated otherwise. Thus, within the scope of this disclosure, a first component in one embodiment may be referred to as a second component in another embodiment, and similarly a second component in one embodiment may be referred to as a first component .

In the present disclosure, the components that are distinguished from each other are intended to clearly illustrate each feature and do not necessarily mean that components are separate. That is, a plurality of components may be integrated into one hardware or software unit, or a single component may be distributed into a plurality of hardware or software units. Thus, unless otherwise noted, such integrated or distributed embodiments are also included within the scope of this disclosure.

In the present disclosure, the components described in the various embodiments are not necessarily essential components, and some may be optional components. Thus, embodiments consisting of a subset of the components described in one embodiment are also included within the scope of the present disclosure. Also, embodiments that include other elements in addition to the elements described in the various embodiments are also included in the scope of the present disclosure.

In the following, various examples according to the present disclosure will be described.

The present disclosure includes various examples of methods and apparatus for using contextual information in real-time automatic interpretation. More specifically, a description will be given of a method of recognizing voice information input in real time and using context information related to or perceived voice information in performing real-time automatic interpretation based on artificial intelligence on recognized voice information .

In the present disclosure, real-time automatic interpretation using context information includes a deep-learning-based real-time automatic interpretation method using sentences that have already been spoken or similar sentences that have already been uttered.

More specifically, the context information may include a topic of a speaker, a previous utterance of a speaker, a previous utterance translation of a speaker, a key word of a speaker, and the like. Such various context information can be encoded and used for the translation of the current utterance contents. In addition, by applying a convolution network learning technique to real-time automatic interpretation using such various context information, the performance of real-time automatic interpretation can be maximized.

Real-time automatic interpretation technology refers to a technique of receiving speech data in the original language of a speaker and automatically translating the speech data into a target language of the listener in real time. That is, the real-time automatic interpretation system can be used in a situation where the voice data of the original language is input in real time (for example, when people using different languages are talking, when listening to lectures of people using different languages) To output the voice data in the target language in order to facilitate the understanding of the voice of the user.

A conventional automatic translation and interpretation system for processing voice data inputted in real time receives a speech utterance in one language as an input and converts the result of speech recognition into text and uses it as an input to an automatic translation and interpretation system. This automatic translation and interpretation system automatically translates input text information from one language to another (e.g., from Korean to English) using a rules-based or statistical machine translation (SMT) approach, The result can be synthesized again and delivered to the user. Such an automatic translation and interpretation system is complicated and lacks real-time character because it has a drawback in that translation input is not received until one sentence or person's utterance is over. In addition, SMT, which is a rule-based or statistical method, is not suitable for real-time automatic interpretation because its performance is not high.

Further, Deep Learning-based interpretation method, which is a conventional real-time automatic interpretation system, can be used for both speech recognition and automatic translation and has a great effect on the entire industry because of its excellent performance. Deepening learning can achieve the desired performance if the theoretically sufficient learning data is provided. However, it is difficult to obtain desired performance when the learning data is inappropriate or insufficient. For example, when learning is performed only with female voice data, the performance of recognizing male voice is deteriorated, and voice data not including learning data can not be interpreted and transmitted or the performance thereof is remarkably deteriorated.

In order to solve the problem of the automatic interpretation method based on the deepening learning, according to the various examples of the present disclosure, the performance of the real time automatic interpretation can be improved by using the context information.

First, the operation principle based on the deep learning will be described, and the features of this disclosure will be described later.

1 is a diagram for explaining an automatic translation method based on a deep learning to which the present disclosure can be applied.

The automatic learning method based on deepening learning can be summarized as generating a learning model using learning data and then converting the source language sentence of the input sentence in one language into the target language in the other language by using the generated learning model have. At this time, in the learning stage, encoding and decoding can be largely performed.

As an example of an automatic translation system based on deepening learning, FIG. 1 shows a sequence-sequence deepening learning-based automatic translation structure 100. For example, the deep learning-based automatic translation structure 100 includes encoding nodes 111, 112, 113, and 114 corresponding to the encoding unit 110, decoding nodes 121 and 121 corresponding to the decoding unit 120, 122, 123, 124, 125). Such a plurality of nodes 111, 112, 113, 114, 121, 122, 123, 124 and 125 may be, for example, a Recurrent Neural Network (RNN) or a Long Short Term Memory (LSTM) network which is a special type of RNN It may correspond to the nodes constituting it.

In the example of Fig. 1, the source language is Korean and the target language is English. For example, in Korean, "I go to school." Quot; to " I go to school. &Quot; In the example of FIG. 1, w represents a resultant vector value obtained by encoding an input sentence. Specifically, using the automatic translation system based on the deep learning shown in FIG. 1, " I go to school. &Quot; To generate an output value of a vector form < RTI ID = 0.0 > w. &Lt; / RTI > In the decoding step, the value of the next node can be generated using the w vector and the value of the previous node.

More specifically, in the encoding unit 110, a source language sentence of the learning data can be encoded. Primitive language sentences consist of elements such as words or tokens, which can be input sequentially into the deepening learning structure. In the example of Figure 1, the input sentence may consist of four words or tokens, namely, "I", "go to school", "go to". The elements of the input sentence are converted into a vector form in the encoding unit 110 of the deepening learning process, and the words expressed in each vector form are classified into a sequential deep learning model RNN (Recurrent Neural Network) or a special form of the RNN method Lt; RTI ID = 0.0 > (LSTM) < / RTI > network. That is, an element of the input language language sentence "I" is at node 111, an element at school "at school" at node 112, an element at "go to" at node 113, May be encoded at node 114 to produce an output value in the form of one vector, < RTI ID = 0.0 > w, < / RTI > This output value may be involved in generating each element of the target language sentence in the decoding unit 120. [

The decoding unit 120 may be represented by an RNN or LSTM based network, which is a sequential learning model, such as the encoding unit 110. [ That is, the decoding unit 120 may generate words or tokens of the target language one by one based on the encoding output value w. At the first node 121 of the decoding unit 120, the first element (word or token) of the target language can be generated based on the encoding result output value w. In the example of FIG. 1, the word generated at the first node 121 of the decoding unit 120 indicates the case of " I ". Next, the already generated word or token may be involved sequentially and directly with the next word or token generation with the encoded output value w. For example, the second node 122 of the decoding unit 120 may generate an element called " go " based on the result value " I " of the previous node 121 and the encoded output value w. The third node 123 of the decoding unit 120 can generate an element called " to " based on the result " go " of the previous node 122 and the encoded output value w. The fourth node 124 of the decoding unit 120 may generate an element called " school " based on the result value " to " of the previous node 123 and the encoded output value w. The fifth node 125 of the decoding unit 120 generates an element called " <EOF> " (i.e., End Of Frame) based on the result value of the previous node 124 and the encoded output value w can do.

Finally, based on the encoded vector value w for the input sentence " I go to school, " the decoding unit 120 may output the translation result " I go to school. &Quot;

In the real-time automatic translation based on this deepening learning, both the source language input sentence and the target language correct sentence (that is, the accurate translation result of the source language sentence) are given in the learning process using the learning model, The learning process is repeated while adjusting each node of the deepening learning network. Adjusting the nodes of the deep learning network may include adjusting the parameters and bias values of each RNN or LSTM node of encoding and decoding.

As described with reference to FIG. 1, a source language input sentence has a great influence on a decoding process while being encoded, and is directly related to translation performance.

In this disclosure, a method of generating a more appropriate target language word or token in the decoding step by using context information in the encoding step in addition to the input sentence generated in real time by speech is described.

FIG. 2 is a diagram for explaining an automatic translation based on deep learning using context information according to the present disclosure.

2 includes context elements corresponding to the encoding unit 110 and the decoding unit 120 of FIG. 1, and additionally includes a context information encoding unit 210, And a context information combining unit 220. [0033] FIG.

The context information encoding unit 210 may encode various types of context information and output an encoding result value. The type of the context information may include a subject of the current utterance content of the utterer, a previous sentence related to the current utterance content, a translation result of the previous utterance content, a keyword of the current utterance content, and the like. The encoded context information corresponding to the output of the context information encoding unit 210 may be represented by " C ". Or the output of the context information encoding unit 210 may be classified according to the type of the context information. For example, the encoding result of the context information of the first type corresponding to the subject of the current speech content of the speaker is represented by C1, and the encoding result of the context information of the second type corresponding to the previous sentence related to the current speech content is C2 The encoding result of the context information of the third type corresponding to the translation result of the previous speech content is represented by C3 and the encoding result of the context information of the fourth type corresponding to the keyword of the current speech content is represented by C4 . The scope of the present disclosure is not limited to the above examples of the context information type, and other context information considered for interpretation or translation may also be encoded by the context information encoding unit 210. [

The context information combination unit 220 receives the output (i.e., the encoded context information C) of the context information encoding unit 210 and the encoding result value (i.e., the w vector) of the current utterance content (or the current input sentence) , And output the corrected encoding result value "Wc" based on the encoded context information. For example, the context information combination unit 220 may add or subtract the result of various types of context information encoding (e.g., C1, C2, C3, C4) included in the context information encoding value C, Or to combine only values that are equal to or greater than a predetermined threshold value.

FIG. 3 is a view for explaining an example of an automatic translation based on the deep learning using the context information according to the present disclosure.

3 includes components corresponding to the encoding unit 110 and the decoding unit 120 of FIG. 1, and further includes a context information encoding unit 310, And a context information combination unit 320. [0040] FIG.

3, the context information encoding unit 310 and the context information combining unit 320 may correspond to an embodiment of the context information encoding unit 210 and the context information combining unit 220 of FIG. 2 have. That is, in the example of FIG. 3, for clarity of explanation, the context information encoding unit 210 typically represents the encoding of the context information corresponding to the previous sentence related to the current speech content. Here, the previous sentence is not limited to the immediately preceding speech content of the current speech, but may correspond to one or more sentences related to the current speech among a plurality of sentences preceding the current speech content.

The encoding unit 110 can generate an output value w by encoding an input sentence " I go to school " corresponding to the current utterance content of the utterer's native language.

The context information encoding unit 310 may generate the output value C by encoding the context information that is " far from school " corresponding to the previous sentence related to the current speech content. For example, the words " as ", " school ", and " far, " corresponding to the three elements of the previous sentence are converted into a vector form in the context information encoding unit 310, May be encoded by respective nodes of the RNN or LSTM network, which is a sequential deepening learning model. Element of the previous sentence in the context information encoding node 311, the element of " school " in the context information encoding node 312, and the element of " , And finally generates an output value of one vector form, For example, the result of the contextual information encoding of C is the meaning of the concession in relation to the input sentence "Even if the school is far," says "I go to school" (ie, Quot;) < / RTI >

The context information combining unit 320 may output the corrected encoding result value Wc reflecting the context information by combining the output value w of the encoding unit 110 and the output value C of the context information encoding unit 310. [

For example, when translating without contextual information on the input sentence "I go to school" (ie, based only on the encoding result w) as in the example of FIG. 1, the result of "I go to school." However, it may be more appropriate to express the will of the speaker in the input sentence "I go to school", considering the contextual information "Although school is far". That is, the intention of the speaker can be grasped through contextual information that "(yet) I go to school." Thus, if the context information includes meaning of concession, it is more appropriate if an element representing the will of the speaker, such as " still " in the translation of the input sentence (for example, the word "still" in English) It can be translated. To this end, the context information combination unit 320 adds the meaning of concession to the encoded vector value w for the input sentence " I go to school " based on the result C of the context information encoding unit 310 The corrected encoding result value Wc can be generated.

Accordingly, the decoding unit 120 can generate one word or token of the target language based on the corrected encoding result value Wc. For example, at the first node 121 of the decoding unit 120, the first element (word or token) of the target language can be generated based on the corrected encoding result value Wc. In the example of FIG. 3, the word generated at the first node 121 of the decoding unit 120 indicates the case of " I ". Next, the already generated word or token may be involved sequentially and directly with the next word or token generation with the corrected encoding result value Wc. For example, the second node 122 of the decoding unit 120 can generate an element called " still " based on the result value " I " of the previous node 121 and the corrected encoding result value Wc have. The third node 123 of the decoding unit 120 may generate an element called " go " based on the result value " still " of the previous node 122 and the corrected encoding result value Wc. The fourth node 124 of the decoding unit 120 may generate an element called " to " based on the result value " to " of the previous node 123 and the corrected encoding result value Wc. The fifth node 125 of the decoding unit 120 may generate an element called " school " based on the result " to " of the previous node 124 and the corrected encoding result value Wc. EOF " (i.e., End Of Frame) based on the result value " school " of the previous node 125 and the corrected encoding result value Wc at the sixth node 126 of the decoding unit 120 Element can be created.

Finally, based on the encoding result value Wc corrected based on the encoding vector value C for the context information "Even if school is far" for the encoded vector value w for the input sentence "I am going to school" The decoding unit 120 can output the translation result " I still go to school. &Quot;

4 is a diagram for explaining a context information encoding unit and a context information combination unit in an example using multiple types of context information according to the present disclosure;

4, the context information encoding unit 410 and the context information combining unit 420 may correspond to an embodiment of the context information encoding unit 210 and the context information combining unit 220 of FIG. 2 have.

The context information encoding unit 410 may generate a context information encoding result value for each of a plurality of types of context information. For example, the context information encoding unit 410 may include a first type context information encoding unit 411 for encoding first type context information, a second type context information encoding unit 412 for encoding second type context information, A third type context information encoding unit 413 for encoding the third type context information, and a fourth type context information encoding unit 414 for encoding the fourth type context information.

The scope of the present disclosure is not limited by the example and the number of context information types, and may include some types of context information encoding units in the example of FIG. 4, and may further include other types of context information encoding units. In the example of FIG. 4, one type-context information encoding unit is included for each type of context information. However, the scope of the present disclosure is not limited to this, and for the same type of context information, a plurality of type- It is possible.

For example, the first type context information encoding unit 411 may output C1, which is a result value obtained by encoding the first type of context information corresponding to the subject of the current utterance contents of the utterance. More specifically, the subject of the current utterance content of the utterance may be selected by the utterer in advance in any one of the subject categories that pre-classified the utterance contents. In addition, the subject of the current utterance contents of the utterance may be classified by calculating the similarity with the subject of the utterance. For example, one or more topic identifiers (topic_id) or topic code values may be assigned to a topic or topic category that pre-classifies the content of the utterance, and the topic or topic category that is selected or computed may be assigned a particular topic_id value Can be expressed. The topic_id value, which is the subject of the current utterance, may be entered and processed in the node 411_1 to generate the finally encoded context information C1.

The second type context information encoding unit 412 may output C2 which is a result value obtained by encoding the second type of context information corresponding to the previous sentence related to the current utterance. More specifically, if the current utterance of the speaker is " I am going to school, " it is possible to use " Though school is far, " In general, the present utterance is an extension of the previous utterance of the utterer, and the ambiguous content or word of the utterance can be accurately expressed in the previous utterance, and important information such as prepositional phrases or pronouns . For example, the content such as "I am here to gather you here" is long but functions as a narrative element of the sentence, and without this content, it is difficult to understand the contents of the following. Also, since the order of words in different languages can be different (for example, in case of translation from Korean to English, the order of words in some sentences will be inaccurate). The context information C2 that is finally encoded can be generated by inputting and processing the words of the previous utterance contents into the nodes 412_1, 412_2, and 412_3 for each of the words "as is", "school", and "far".

The third type context information encoding unit 413 may output C3 which is a result value obtained by encoding the context information of the third type corresponding to the translation result of the previous utterance contents. More specifically, when the current utterance is "I go to school," the previous utterance related to this can be used as the context information by encoding "even school far", which is the translation result of "although school is far". The translation result of the previous utterance contents may include information useful for the translation of the current utterance contents, and may be used as useful context information for enhancing the accuracy of the translation, for example, referring to the same translation result for the same element. The words "even", "school", and "far" of the translations of the previous utterance contents are input and processed in the nodes 413_1, 413_2, and 413_3 for each element to generate the finally encoded context information C3.

The fourth type context information encoding unit 414 may output C4, which is a result value obtained by encoding the context information of the fourth type corresponding to the keyword of the current speech content. More specifically, the core keyword of the speaker can be encoded and utilized. Since most utterances are related to a particular topic, appropriate keywords for that topic can be used as contextual information. These keywords often appear in the contents of the speaker from the beginning to the end of the utterance. Therefore, if the frequency and importance of the words in the contents of the speaker are calculated in real time and the keywords are added to the context information, the contents of the current speaker can be greatly assisted. In addition, such a keyword can contribute to the translation performance by reducing the word selection error in the translation process by using the corresponding speech words together as the context information. Quot; kw1 ", " kw2 ", and " kw2 " corresponding to the keywords of the current utterance contents may be input and processed to the nodes 414_1, 414_2, and 414_3, respectively, and the finally encoded context information C4 may be generated.

Thus, by using one or more of various types of context information, it is possible to improve the translation performance by reducing the ambiguity in understanding and interpreting the context of the conversation.

5 is a diagram for explaining an example of a context information combination unit according to the present disclosure;

In the example of FIG. 5, the context information combination unit 520 may correspond to an embodiment of the context information combination unit 220 of FIG. 2.

For example, the context information combination unit 520 includes a context information encoding result value input unit 521, a convolution learning unit 522, a current speech content encoding result value input unit 523, and a correction unit 524 .

The context information encoding result value input unit 521 may receive the encoded context information corresponding to the output of the context information encoding units 210, 310 and 410. The encoded context information may include contextual information for one or more types. In the example of FIG. 5, it is assumed that C1, C2, C3, and C4, which are the context information encoding result values for the four types, are input. For example, C1, C2, C3, and C4 are input to the first type context information encoding result value input node 521_1, the second type context information encoding result value input node 521_2, the third type context information encoding result value input The node 521_3, and the fourth type context information encoding result value input node 521_4.

When one or more context information is provided in this manner, it is possible to maximize the effect of improving the translation performance by selecting the usefulness of the encoded context information without directly determining what context information influences the translation of the current utterance contents have. To this end, the context information combination unit 520 according to the present disclosure can obtain the final context information value through learning using a convolution network. This final result value may be passed to the corrector 524.

Specifically, the convolution learning unit 522 can expand and combine the encoded context information input values (C1, C2, C3, C4) through various routes. For example, five different combination nodes 522_1, 522_2, 522_3, 522_4, and 522_5 can combine one or more of C1, C2, C3, and C4 according to different criteria. For example, the combination of contextual information types entered at some combination node may be the same or different than the combination of contextual information types input at different combination nodes. In addition, the combination criterion applied at a certain combination node may be the same as or different from the combination criterion of the other combination node.

For example, C1, C2, and C3 may be combined in the first combination node 522_1, C1, C2, C3, and C4 may be combined in the second combination node 522_2, C2, C3 and C4 can be combined in the fourth combination node 522_4 and C1, C3 and C4 can be combined in the fifth combination node 522_3 in the fourth combination node 522_4 . In this case, the combinations of the types of the encoded context information to be input to the second combination node 522_2 and the third combination node 522_3 are the same, but the combination methods (for example, The sum of squares, the sum of values that are equal to or greater than a predetermined threshold) may be different, and the parameters or bias values applied to the combination may be different.

The scope of the present disclosure is not limited by the number of combination nodes, combinations of types of context information input to combination nodes, and the like, and may include examples of combining various context information in various ways. Also, some context information may be passed to the corrector 524 without being combined with other context information.

The combination nodes 522_1, 522_2, 522_3, 522_4, and 522_5 included in the convolution learning unit 522 are connected to the context information encoding result value input unit 521, (hidden layer), and can be extended to various types of nodes included in the hidden layer. The extended hidden layer may output the final node value to the correcting unit 524. [

The correction unit 524 performs correction on the current spoken content encoding result value w transmitted from the current spoken word encoding result value input unit 523 based on the result value output from the convolution learning unit 522 . Accordingly, the current speech content encoding result value w is corrected by the final optimal context information C, and the corrected encoding result value Wc can be output.

6 is a flowchart for explaining a real-time automatic translation method using context information according to the present disclosure.

6 may be performed by a context-based real-time automatic interpretation apparatus (hereinafter referred to as apparatus) according to the present disclosure.

In step S610, the apparatus can perform encoding on the current utterance contents. For example, it is possible to construct an input sentence by speech recognition of the current utterance contents of the source language, perform encoding through one or more nodes based on the deepening learning model for the elements of the input sentence, Lt; / RTI >

In step S620, the device may encode context information related to the current utterance. For example, an encoding result value of one or more types of context information, such as the subject of the current speech content, the previous speech content, the translation result of the previous speech content, the keyword of the current speech content, etc., can be generated. In addition, the encoded resultant values of the plurality of context information may be combined in a convolution manner to generate the finally encoded context information (C).

In step S630, the apparatus can correct the encoded current speech content, which is the result of step S610, based on the encoded context information that is the result of step S620. That is, using the context information related to the current speech content, it is possible to generate the current speech content encoding result value Wc corrected to a form in which an optimal translation result suitable for the situation can be expected.

In step S640, the apparatus can decode the current utterance content based on the corrected current utterance content encoding result Wc. For example, the decoding result value of the first node in the target language is generated using the current spoken content encoding result value Wc of the corrected original language based on the context information, and the corrected current spoken content encoding result value Wc ) And the decoding result of the previous node to sequentially generate the result value of the next node.

Although the exemplary methods of this disclosure are represented by a series of acts for clarity of explanation, they are not intended to limit the order in which the steps are performed, and if necessary, each step may be performed simultaneously or in a different order. In order to implement the method according to the present disclosure, the illustrative steps may additionally include other steps, include the remaining steps except for some steps, or may include additional steps other than some steps.

The various embodiments of the disclosure are not intended to be all-inclusive and are intended to be illustrative of the typical aspects of the disclosure, and the features described in the various embodiments may be applied independently or in a combination of two or more.

In addition, various embodiments of the present disclosure may be implemented by hardware, firmware, software, or a combination thereof. In the case of hardware implementation, one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays A general processor, a controller, a microcontroller, a microprocessor, and the like.

The scope of the present disclosure is to be accorded the broadest interpretation as understanding of the principles of the invention, as well as software or machine-executable instructions (e.g., operating system, applications, firmware, Instructions, and the like are stored and are non-transitory computer-readable medium executable on the device or computer.

100 Advanced learning-based automatic translation structure
110 encoding unit
111, 112, 113, 114 encoding nodes
120 decoding unit
121, 122, 123, 124, 125, 126,
210, 310, 410 The context information encoding unit
220, 320, 420, 520,
311, 312, 313 Context information encoding node
411, 412, 413, 414 type context information encoding unit
411_1, 412_1, 412_2, 412_3, 413_1, 413_2, 413_3, 414_1, 414_2, 414_3 type context information encoding nodes
521 Context information encoding result value input unit
521_1, 521_2, 521_3, 521_4 type context information encoding result value input unit
522 Convolutional Learning Unit
522_1, 522_2, 522_3, 522_4, 522_5 combination nodes
523 Present Encoding Content Encoding Result Value Inputting Unit
524 correction unit

Claims

In a real-time automatic interpretation method using context information,
Encoding the current speech content;
Encoding the context information related to the current speech content;
Correcting the encoded result of the current speech content based on the encoded result of the context information; And
And decoding the current speech content based on the encoded result of the corrected current speech content.