WO2024139307A1

WO2024139307A1 - Sentence correction method and apparatus for image, and electronic device and storage medium

Info

Publication number: WO2024139307A1
Application number: PCT/CN2023/115054
Authority: WO
Inventors: 李晓川; 赵雅倩; 李仁刚; 郭振华; 范宝余
Original assignee: 苏州元脑智能科技有限公司
Priority date: 2022-12-27
Filing date: 2023-08-25
Publication date: 2024-07-04
Also published as: CN115659959B; CN115659959A

Abstract

Provided in the embodiments of the present application are a sentence correction method and apparatus for an image, and an electronic device and a non-volatile readable storage medium, which are applied to a multi-modal sentence correction system. The method comprises: firstly, performing feature extraction on both an input image and an input original sentence, so as to obtain an image feature and an original sentence feature; then, generating a comprehensive encoded feature by means of feature splicing, and obtaining an encoded sentence feature by means of capturing from the comprehensive encoded feature a feature corresponding to the position of the original sentence feature; then, performing feature correction on the encoded sentence feature by means of a sentence feature correction module, so as to generate a corrected sentence feature, and performing feature fusion on the corrected sentence feature and the encoded sentence feature by means of a correction vector accessor, so as to obtain a target sentence feature; and finally, performing feature replacement on the original sentence feature by means of a correction decoder and by using the target sentence feature, and outputting target sentence information. In this way, the identification and correction of a fine-grained error in an original sentence are realized, thereby greatly reducing the error rate when a multi-modal task is performed.

Description

Image text error correction method, device, electronic device and storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to a Chinese patent application filed with the China Patent Office on December 27, 2022, with application number 202211680372.4 and application name “Text Error Correction Method, Device, Electronic Device and Storage Medium for Images”, all contents of which are incorporated by reference into this application.

Technical Field

The present application relates to the field of artificial intelligence technology, and in particular to a method for correcting text errors in images, a device for correcting text errors in images, an electronic device, and a computer non-volatile readable storage medium.

Background technique

In recent years, multimodal artificial intelligence has become one of the important research directions in the field of AI (Artificial Intelligence). Multimodal research aims to integrate multiple modal inputs such as images, videos, audio, text, sensor signals, and comprehensively understand or generate information that can be used by humans. For example, in the fields of Visual Question Answering (VQA) and Visual Grounding, multimodal relationships such as images and texts are included. With the widespread application of the Transformer (deep learning model based on self-attention mechanism) structure, the Transformer-based multimodal network structure has become more and more popular in multimodal tasks such as Visual Question Answering (VQA), Image Caption, and Visual Dialog.

In the real world, human language often contains common language phenomena such as slips of the tongue and metaphors, which are difficult to be grasped by existing computer language technology. Therefore, when matching text and images, it is often impossible to match these words with slips of the tongue or metaphors with the images. In other words, the current multimodal theoretical research cannot finely distinguish minor errors in the text. For example, a paragraph of text may have a wrong word or phrase, which will cause the algorithm to make mistakes when completing multimodal tasks. For example, in tasks based on visual question answering, it is very likely that the algorithm will not be able to understand the problem that humans actually want to describe due to deliberate metaphors in the text content, making it impossible for the Transformer-based multimodal structure to give a correct response through the algorithm, and thus giving a wrong answer.

Summary of the invention

The embodiments of the present application provide a method, device, electronic device and computer non-volatile readable storage medium for text correction of images to solve or partially solve the problem that the algorithm is prone to errors when completing multimodal tasks due to the inability to strictly match text and image.

The embodiment of the present application discloses a method for text error correction of an image, which is applied to a multimodal text error correction system. The multimodal text error correction system at least includes a text feature correction module, an error correction vector accessor, and an error correction decoder. The method includes:

In response to input operations for images and texts, image information and original text information corresponding to the input operations are acquired, and feature extraction is performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;

The image features are concatenated with the original text features to obtain comprehensive coding features, and the comprehensive coding features and the original text features are intercepted to obtain text coding features;

The text encoding features are corrected by the text feature correction module to generate text correction features, and the text correction features are fused with the text encoding features by the error correction vector accessor to obtain the target text features;

The error correction decoder uses the target text features to replace the original text features and outputs the corresponding target text information.

Optionally, the text encoding feature is corrected by a text feature correction module to generate a text correction feature, including:

The text encoding features are self-attention encoded through the text feature correction module to obtain the corresponding initial self-attention vector, and the initial self-attention vector is processed by character prediction to obtain the corresponding target self-attention vector, which contains the correlation features between the image features and the original text features;

Predict the effective information amount of the text encoding feature to obtain the corresponding effective text information vector, which represents the probability that each character in the text encoding feature contains effective information;

The text encoding features are bidirectionally intercepted to obtain the front dislocation features and the back dislocation features respectively, and the adjacent feature interaction vector is generated according to the front dislocation features and the back dislocation features;

Performing coherence prediction processing on the adjacent feature interaction vectors to obtain the corresponding adjacent text information vectors, where the adjacent text information vectors represent the probability of adjacent characters being coherent in the text encoding features;

The target self-attention vector, the effective text information vector and the adjacent text information vector are used to perform feature correction on the text encoding features to generate text correction features.

Optionally, the text encoding feature is self-attention encoded by the text feature correction module to obtain a corresponding initial self-attention vector, including:

Input the text encoding features into the self-attention layer, using the formula

Perform self-attention encoding to obtain the corresponding initial self-attention vector; where,

W _q , W _k , and W _v are all learnable weights, and f is the text encoding feature.

Optionally, character prediction processing is performed on the initial self-attention vector to obtain a corresponding target self-attention vector, including:

The initial self-attention vector is input into two groups of fully connected layers to perform current character prediction processing and previous character prediction processing respectively, and the current prediction vector and the previous prediction vector are obtained;

The target self-attention vector is determined based on the current prediction vector and the previous prediction vector.

Optionally, determining a target self-attention vector according to a current prediction vector and a previous prediction vector includes:

The current prediction vector is used to predict the text encoding feature to obtain the target current character corresponding to the text encoding feature;

Using the pre-prediction vector to predict the target current character, obtaining the target pre-prediction character corresponding to the target current character;

The target preceding character is concatenated with the target current character, the corresponding target character is output, and the target self-attention vector corresponding to the target character is generated.

Optionally, the current prediction vector is used to perform prediction processing on the text encoding feature to obtain a target current character corresponding to the text encoding feature, including:

According to the current prediction vector, the text encoding feature is probability matched with each preset character in the preset dictionary to obtain the current prediction probability corresponding to each preset character, and the preset character with the largest current prediction probability is determined as the target current character.

Optionally, the target current character is predicted using the pre-prediction vector to obtain the target current character corresponding to the target Prefix characters include:

According to the pre-prediction vector, the target current character is probability matched with each preset character in the preset dictionary to obtain the pre-prediction probability corresponding to each preset character, and the preset character with the largest pre-prediction probability is determined as the target pre-prediction character.

Optionally, predicting the effective information amount of the text encoding feature to obtain a corresponding effective text information vector includes:

Using formula

Predict the effective information volume of the text encoding features to obtain the corresponding effective text information vector; where:

_Wiq , _Wik , and _Wiv are all transfer matrix weights, _Wiw is the information prediction weight, _bib is the learnable model parameter, and f is the text encoding feature.

Optionally, the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and the adjacent feature interaction vector is generated according to the front misalignment feature and the back misalignment feature, including:

The front misaligned features and the back misaligned features are vector-concatenated to generate an adjacent feature interaction vector of size [M-1, d×2] corresponding to the text encoding features.

Optionally, performing coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors includes:

Using formula

Perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors;

W _nw , W _nq , W _nv , and W _nk are all transfer matrix weight parameters, bin _is the bias vector parameter, and f _nb is the adjacent feature interaction vector.

Optionally, the error correction vector accessor includes at least a feature storage space. After performing feature interception according to the comprehensive coding feature and the original text feature to obtain the text coding feature, the method further includes:

The text encoding feature is split into several sub-text features, and each sub-text feature is stored in the feature storage space in sequence.

Optionally, the error correction vector accessor includes a repair judgment gate and a feature updater, and the text correction feature and the text encoding feature are fused through the error correction vector accessor to obtain the target text feature, including:

Performing repair judgment on each sub-text feature through a repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced;

The feature updater uses the text correction feature to perform feature replacement on at least one replacement sub-text feature to obtain respective corresponding target sub-text features, and performs feature fusion on at least one target sub-text feature to obtain corresponding target text features.

Optionally, performing a repair judgment on each sub-text feature through a repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced includes:

Using formula

Perform repair judgment on each sub-text feature;

When s(x _k ) is 1, the subtext feature with feature number k is determined as the replacement subtext feature to be replaced; wherein,

k represents the feature number corresponding to the sub-text feature, p _ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p _nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh _ifo represents the settable information quantity probability threshold, thresh _nbo represents the settable fluency probability threshold, and s(x _k ) represents whether the sub-text feature with feature number k needs to be replaced.

Optionally, the feature updater uses the text correction feature to perform feature replacement on at least one replacement sub-text feature to obtain respective corresponding target sub-text features, including:

According to the text correction features, the formula
f _ko =f _k ×(1-μ)+(p _ifok ×θ+p _nbok ×(1-θ))×μ×o _emlm

Calculate the text feature value corresponding to the replacement sub-text feature, and perform feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target text feature; wherein,

_fk is the sub-text feature with feature number k, _oemlm is the target self-attention vector, and θ and μ are both preset parameters with a size of 0 to 1.

Optionally, feature replacement is performed on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature, including:

The original text feature value of the replacement sub-text feature is replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature.

Optionally, the image features are concatenated with the original text features to obtain comprehensive coding features, including:

The image features are concatenated with the original text features, and cross-modal encoding is performed to obtain comprehensive encoding features.

Optionally, feature extraction is performed based on the comprehensive coding feature and the original text feature to obtain the text coding feature, including:

The features in the comprehensive coding features corresponding to the original text feature positions are intercepted to obtain text coding features corresponding to the original text features.

The embodiment of the present application also discloses a text error correction device for an image, which is applied to a multimodal text error correction system. The multimodal text error correction system at least includes a text feature correction module, an error correction vector accessor, and an error correction decoder. The device includes:

A feature extraction module, for obtaining image information and original text information corresponding to the input operation in response to the input operation for the image and the text, and performing feature extraction on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;

The text coding feature generation module is used to perform feature concatenation of image features and original text features to obtain comprehensive coding features, and perform feature interception based on the comprehensive coding features and original text features to obtain text coding features;

The target text feature generation module is used to perform feature correction on the text encoding feature through the text feature correction module to generate text correction features, and to fuse the text correction features with the text encoding features through the error correction vector accessor to obtain Get the target text features;

The text feature replacement module is used to replace the original text features with the target text features through the error correction decoder and output the corresponding target text information.

Optionally, the text encoding feature generation module includes:

A target self-attention vector generation module is used to perform self-attention encoding on the text encoding features through the text feature correction module to obtain the corresponding initial self-attention vector, and perform character prediction processing on the initial self-attention vector to obtain the corresponding target self-attention vector, where the target self-attention vector contains the correlation features between the image features and the original text features;

An effective text information vector generation module is used to predict the effective information amount of the text encoding feature and obtain the corresponding effective text information vector. The effective text information vector represents the probability that each character in the text encoding feature contains effective information.

The adjacent feature interaction vector generation module is used to perform bidirectional interception on the text encoding features to obtain the front dislocation features and the back dislocation features respectively, and generate the adjacent feature interaction vector according to the front dislocation features and the back dislocation features;

An adjacent text information vector generation module is used to perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors, where the adjacent text information vectors represent the probability of adjacent characters being coherent in text encoding features;

The text correction feature generation module is used to use the target self-attention vector, the effective text information vector and the adjacent text information vector to perform feature correction on the text encoding feature to generate text correction features.

Optionally, the target self-attention vector generation module includes:

The initial self-attention vector generation module is used to input the text encoding features into the self-attention layer using the formula

Optionally, the target self-attention vector generation module includes:

A character prediction processing module is used to input the initial self-attention vector into two groups of fully connected layers to perform current character prediction processing and previous character prediction processing respectively, and obtain a current prediction vector and a previous prediction vector;

The target self-attention vector determination submodule is used to determine the target self-attention vector based on the current prediction vector and the previous prediction vector.

Optionally, the target self-attention vector determination submodule includes:

A target current character generation module is used to use the current prediction vector to predict the text encoding feature to obtain the target current character corresponding to the text encoding feature;

A target preceding character generating module is used to predict the target current character using the preceding prediction vector to obtain the target preceding character corresponding to the target current character;

The target character output module is used to concatenate the target preceding character with the target current character, output the corresponding target character, and generate a target self-attention vector corresponding to the target character.

Optionally, the target current character generation module is specifically used for:

Optionally, the target prefix character generating module is specifically configured to include:

Optionally, the effective text information vector generation module is specifically used for:

Using formula

Predict the effective information volume of the text encoding features to obtain the corresponding effective text information vector;

Optionally, the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and the adjacent feature interaction vector generation module is specifically used for:

Optionally, the adjacent text information vector generation module is specifically used for:

Using formula

Optionally, the error correction vector accessor includes at least a feature storage space, and the device further includes:

The sub-text feature splitting module is used to split the text encoding feature into several sub-text features, and store each sub-text feature in the feature storage space in turn.

Optionally, the error correction vector accessor includes a repair judgment gate and a feature updater, and the target text feature generation module includes:

A replacement subtext feature determination module is used to perform repair judgment on each subtext feature through a repair judgment gate to determine at least one replacement subtext feature that needs to be replaced;

The target sub-text feature determination module is used to replace at least one replacement sub-text feature with a text correction feature through a feature updater to obtain the corresponding target sub-text features, and to fuse at least one target sub-text feature to obtain the corresponding target text feature.

Optionally, the replacement subtext feature determination module is specifically used for:

Using formula

Perform repair judgment on each sub-text feature;

Optionally, the target subtext feature determination module includes:

The text feature value calculation module is used to correct the features according to the text, using the formula
f _ko =f _k ×(1-μ)+(p _ifok ×θ+p _nbok ×(1-θ))×μ×o _emlm

Calculate the text feature value corresponding to the replacement sub-text feature, and perform feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature; wherein,

Optionally, the text feature value calculation module is specifically used for:

Optionally, the text encoding feature generation module includes:

The cross-modal coding processing module is used to concatenate image features with original text features and perform cross-modal coding processing to obtain comprehensive coding features.

Optionally, the text encoding feature generation module includes:

The corresponding feature position interception module is used to intercept the features corresponding to the original text feature positions in the comprehensive coding features to obtain text coding features corresponding to the original text features.

The embodiment of the present application also discloses an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

Memory, used to store computer programs;

The processor is used to implement the method of the embodiment of the present application when executing the program stored in the memory.

The embodiment of the present application also discloses a computer non-volatile readable storage medium having instructions stored thereon, which, when executed by one or more processors, enables the processors to execute the method as in the embodiment of the present application.

The embodiments of the present application include the following advantages:

In an embodiment of the present application, an image-based text error correction method for a multimodal text error correction system is provided. First, feature extraction is performed on an input image and an original text respectively to obtain image features and original text features. Then, the image features and the original text features are generated into a comprehensive coding feature by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is feature corrected by a text feature correction module to generate a text correction feature, and the text correction feature is feature fused with the text coding feature by an error correction vector accessor to obtain a target text feature. Finally, the target text feature is used to replace the original text feature by an error correction decoder, and the corresponding target text information is output, thereby realizing the recognition and correction of fine-grained errors in the original text according to the image, greatly reducing the error rate when performing multimodal tasks.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

FIG1 is a schematic diagram of the interference of the current non-strictly matched multimodal samples on the existing methods;

FIG2 is a schematic diagram of a multimodal sample of image-based text error correction provided in an embodiment of the present application;

FIG3 is a schematic diagram of a text error correction system based on a visual elastic mask provided in an embodiment of the present application;

FIG4 is a flowchart of a method for correcting text errors in an image provided in an embodiment of the present application;

FIG5 is a schematic diagram of a visual elastic mask provided in an embodiment of the present application;

FIG6 is a schematic diagram of adjacent word relationship prediction provided in an embodiment of the present application;

FIG7 is a schematic diagram of a structural framework of an error correction vector accessor provided in an embodiment of the present application;

FIG8 is a structural block diagram of a text error correction device for an image provided in an embodiment of the present application;

FIG9 is a schematic diagram of a computer non-volatile readable medium provided in an embodiment of the present application;

FIG. 10 is a block diagram of an electronic device provided in an embodiment of the present application.

Detailed ways

In order to make the above-mentioned objects, features and advantages of the present application more obvious and easy to understand, the present application is further described in detail below in conjunction with the accompanying drawings and specific implementation methods.

In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present application, some technical features involved in the embodiments of the present application are explained and illustrated below:

Sentence Correction (SC): Detect text errors and make corresponding corrections.

Multi Modal (MM): refers to the collaborative reasoning of multiple heterogeneous modal data. In the field of artificial intelligence, it often refers to the collaboration of perceptual information, such as images, text, video, audio, etc., to help artificial intelligence understand the external world more accurately.

Masked Language Modeling (MLM): Mask is often used in image processing scenarios. Mask technology can be used to predict the corresponding text in the image.

Masked Language Modeling Elastic (EMLM): A more accurate text prediction method than masked text prediction. Different from the existing masking method, the elastic masked text prediction used in this application can realize vocabulary replacement of text of indefinite length.

As an example, in the real world, human language often contains common language phenomena such as slips of the tongue and metaphors, which are difficult to be grasped by existing computer language technology. Therefore, when matching text and images, it is often impossible to match these words with slips of the tongue or metaphors with the images. In other words, the current multimodal theoretical research cannot finely distinguish minor errors in the text. For example, a paragraph of text may have a wrong character or a word, which may cause the algorithm to make errors when completing multimodal tasks. For example, in tasks based on visual question answering, it is very likely that the algorithm will not be able to understand the actual problem that humans want to describe due to deliberate metaphors in the text content, making it impossible for the Transformer-based multimodal structure to give a correct response through the algorithm, and thus giving a wrong answer.

For better explanation, referring to FIG1 , a schematic diagram of the interference of a current non-strictly matched multimodal sample on an existing method is shown. As shown in the right frame of FIG1 , the non-strictly matched multimodal sample is shown, in which the display scenario is presented in the form of a question and answer, specifically: the party asking the question asks “What is the red pickup truck with a tall bucket attached to it doing?”, and the party answering the question replies “Stop the car”. From the image content, it can be concluded that the actual situation is: the body color of the pickup truck with a tall bucket attached to it is actually white, and the white pickup truck is at a traffic light, and there is a car in front of it, then it can be clearly seen that the white pickup truck is actually The vehicle is actually driving, not parked, so the matching relationship between the image and the text should be as shown in the left frame of Figure 1. The left frame of Figure 1 is a strictly matched multimodal sample, which is also presented in the form of question and answer. Specifically, the party asking the question asks, "What is the white pickup truck with a tall bucket attached to it doing?", and the party answering replies, "Driving". It can be concluded that when matching between images and texts, if the image and text cannot be strictly matched, it is easy to output an incorrect answer due to incorrect information judgment.

Further, referring to Figure 2, a schematic diagram of a multimodal sample of image-based text correction provided in an embodiment of the present application is shown. As shown in the figure, the image still uses the image in Figure 1, and the input text is "a red pickup truck with a tall bucket attached is driving on the road". It can be seen that the body color of the pickup truck is obviously wrong, and the erroneous information in the input text needs to be corrected according to the information displayed in the image, such as "red" should be corrected to "white", so that the input text is corrected through an image-based text correction method provided in the present application, and the corresponding output text should be "a white pickup truck with a tall bucket attached is driving on the road", achieving a strict match between the image and the text, and obtaining the correct text content.

Therefore, one of the core application points of the embodiment of the present application is to provide an image-based text correction method for a multimodal text correction system. First, the input image and the original text are respectively feature extracted to obtain image features and original text features. Then, the image features and the original text features are generated into a comprehensive coding feature by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is feature corrected by the text feature correction module to generate a text correction feature, and the text correction feature is feature fused with the text coding feature by the error correction vector accessor to obtain the target text feature. Finally, the target text feature is used to replace the original text feature by the error correction decoder, and the corresponding target text information is output, so as to identify and correct the fine-grained errors in the original text according to the image, so as to greatly reduce the error rate when performing multimodal tasks.

Referring to Figure 3, a schematic diagram of a text correction system based on a visual elastic mask provided in an embodiment of the present application is shown. Through this text correction system, combined with the image text correction method provided in the present application, erroneous or potentially erroneous input text can be further judged and corrected to obtain correct output text that strictly matches the image.

As shown in the figure, the text error correction system may at least include an image/text encoding module 301 , a feature extraction module 302 , a text feature correction module 303 , an error correction vector accessor 304 and an error correction decoder 305 .

Among them, first, the image and text that need to be corrected can be encoded separately through the image/text encoding module 301. For example, after receiving the image and the corresponding input text "a red pickup truck with a tall bucket attached is driving on the road", the image and the input text can be encoded separately to obtain the image code corresponding to the image and the text code corresponding to the input text.

Then, the feature extraction module 302 can first merge the features of the image code and the text code, then perform feature encoding to obtain the corresponding comprehensive coding features, and then extract the text feature segments of the comprehensive coding features to obtain the text coding features corresponding to the text coding.

The text encoding features can then be input into the text feature correction module 303 for feature correction to obtain corresponding text correction features. Specifically, for error correction of text features, the text feature correction module 303 is provided with three submodules, namely, a visual elastic mask submodule, an information prediction network submodule, and an adjacent vocabulary relationship prediction submodule.

Specifically, the visual elastic mask submodule can be used to correct indefinite length sentences. For example, if the number of characters corresponding to the wrong word in the input text is 2, the number of characters corresponding to the corrected word is actually 3, so that the corrected word can be corrected. The variable length error correction of input text makes text correction more accurate and more reliable.

Furthermore, in addition to possibly lengthening the original sentence, error correction may also shorten the original sentence. In other words, some characters in the original sentence should be deleted, such as changing from 3 to 2. In order to make the model capability of the text error correction system more comprehensive, the present application designs an information prediction network sub-module in the text feature correction module 303, so that the information prediction network sub-module can enable the features of the corresponding position to predict whether the characters at that position contain effective information.

If there are errors in the input text, the adjacent words may be incoherent and incoherent. Therefore, the present application also designs an adjacent word relationship prediction submodule in the text feature correction module 303 to predict the text coherence of the features corresponding to the input text.

After the text encoding features are corrected by the visual elastic mask submodule, the information prediction network submodule and the adjacent vocabulary relationship prediction submodule in the text feature correction module 303 and the corresponding text correction features are obtained, the text correction features can be input into the error correction vector accessor 304. At the same time, the text encoding features can also be input into the error correction vector accessor 304. The text correction features and the text encoding features can be fused through the error correction vector accessor 304 to obtain the target text features.

Finally, the target text features can be input into the error correction decoder 305, and the target text features can be used by the error correction decoder 305 to replace the original text features, and output the corresponding target text information. For example, for the input text "a red pickup truck with a tall bucket attached is driving on the road", after being corrected by the text error correction system of the present application, the correct output text "a white pickup truck with a tall bucket attached is driving on the road" can be obtained.

It should be pointed out that, in order to better assist in the explanation, the image-based text correction multimodal sample in FIG. 2 is used for exemplary explanation in this embodiment, and the description of the related processes of the text correction method using a text correction system combined with an image in this embodiment is relatively simple, which is only used as a simple explanation of the implementation principle. The more specific implementation steps can be obtained in the detailed description of FIG. 4 in the following content. It can be understood that this application does not limit this.

It should be noted that the embodiments of the present application include but are not limited to the above examples. It is understandable that those skilled in the art can also make settings according to actual needs under the guidance of the ideas of the embodiments of the present application, and the present application does not impose any restrictions on this.

In an embodiment of the present application, a multimodal text error correction system is provided, which may include at least an image/text encoding module, a feature extraction module, a text feature correction module, an error correction vector accessor, and an error correction decoder. The text error correction system provided by the present application uses the currently popular Transformer network structure as the backbone network, and implements the model's ability to correct text errors by designing submodules such as visual elastic mask, information volume prediction network, and adjacent vocabulary relationship prediction. Therefore, in the image-based text error correction process, the input image and the original text can be respectively feature extracted to obtain image features and original text features, and then the image features and the original text features are extracted through A comprehensive coding feature is generated by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is corrected by a text feature correction module to generate a text correction feature, and the text correction feature is fused with the text coding feature by an error correction vector accessor to obtain the target text feature. Finally, the target text feature is used to replace the original text feature by the error correction decoder, and the corresponding target text information is output. Thus, the above-mentioned text error correction system, combined with the image-based text error correction method, can realize the identification and correction of fine-grained errors in the original text, which greatly reduces the error rate when performing multimodal tasks.

4, a flowchart of a method for correcting text errors in an image provided in an embodiment of the present application is shown. The method can be applied to a multimodal text error correction system. The multimodal text error correction system includes at least a text feature correction module, an error correction vector accessor, and an error correction decoder. The method may specifically include the following steps:

Step 401, in response to an input operation for an image and text, image information and original text information corresponding to the input operation are obtained, and feature extraction is performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;

Transformer, an N-input and N-output structure, that is, each Transformer unit is equivalent to a layer of RNN (Recursive Neural Network), which can receive all the words in a sentence as input, and then make an output for each word in the sentence. But unlike RNN, Transformer can process all the words in a sentence at the same time, and the operation distance between any two words is 1.

Based on the Transformer multimodal network structure, the present application can realize image-based text correction. Specifically, in response to input operations on images and texts, image information and original text information corresponding to the input operations can be obtained, and feature extraction can be performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information. In this way, features in the image and text can be extracted respectively through feature extraction, so that more refined text correction can be performed based on the extracted features in the subsequent process.

Exemplarily, for an image with an input size of N and a text with a size of M, after encoding them respectively, the corresponding image encoding and text encoding are obtained, and then the existing encoder model can be used for feature extraction to obtain image features of size [N, d] corresponding to the image and text features of size [M, d] corresponding to the text, where "d" specifically refers to the dimension of the feature, that is, how many numbers each feature consists of. As for the feature extraction method, the current mainstream feature extraction models, such as Convolutional Neural Networks (CNN) and BERT (Bidirectional Encoder Representation from Transformers, bidirectional language model) encoders are used for extraction, so they are not elaborated. Those skilled in the art can use other similar encoders or image/text models for feature extraction, and this application does not limit this.

Step 402, performing feature concatenation of the image features and the original text features to obtain a comprehensive coding feature, and performing feature interception based on the comprehensive coding feature and the original text features to obtain a text coding feature;

In a specific implementation, after obtaining the image features and the original text features, the two can be feature-concatenated to obtain comprehensive coding features. Further, the image features and the original text features can be feature-concatenated to obtain comprehensive coding features. Specifically, the image features and the original text features can be feature-concatenated and cross-modal encoding processing can be performed to obtain comprehensive coding features.

For example, for image features of size [N, d] and text features of size [M, d], the two can be concatenated to obtain a comprehensive feature of size [N+M, d], and the comprehensive feature is input into the Transformer structure for cross-modal encoding to obtain a comprehensive coded feature of size [N+M, d]. Among them, cross-modal coding essentially utilizes the semantics between multimodal code streams to perform joint encoding of correlations, and is one of the key technologies for realizing cross-modal communication. Therefore, by concatenating image features with original text features, the comprehensive coded features of image and text can be obtained to realize cross-modal interaction.

After obtaining the comprehensive coding features, feature extraction can be performed based on the comprehensive coding features and the original text features to obtain text coding features. Specifically, the features corresponding to the original text feature positions in the comprehensive coding features are extracted to obtain text coding features corresponding to the original text features.

That is, the position corresponding to the text feature of size [M, d] in the comprehensive coding feature of size [N+M, d] can be intercepted to obtain the text coding feature of size [M, d], and the text coding feature can be stored in the error correction vector accessor. It should be noted that although the size of the text feature and the text coding feature are both [M, d], the contents contained in the two are completely different. The text feature only represents the features corresponding to the text, and the text coding feature is obtained from the comprehensive coding feature. It is extracted, so in addition to the features corresponding to the text, it is also correlated with the features corresponding to the image. Therefore, in the subsequent feature correction process for the text, the correlation with the image features cannot be ignored.

Step 403, performing feature correction on the text encoding feature through the text feature correction module to generate a text correction feature, and performing feature fusion on the text correction feature and the text encoding feature through the error correction vector accessor to obtain a target text feature;

In a specific implementation, the text encoding feature is corrected by a text feature correction module to generate a text correction feature, which may include the following sub-steps:

Sub-step 4031, performs self-attention encoding on the text encoding features through the text feature correction module to obtain the corresponding initial self-attention vector, and performs character prediction processing on the initial self-attention vector to obtain the corresponding target self-attention vector, wherein the target self-attention vector contains the correlation features between the image features and the original text features.

In the field of natural language processing, the task of masked text prediction is indispensable. Masked text prediction can be used to correct occluded or erroneous words. However, this method can only correct sentences with a corresponding number of characters and cannot change the length of the original sentence. For example, if the original sentence has 10 characters, the corrected sentence output after correction also has 10 characters.

However, in actual use, it is not possible to guarantee that the length of the corrected content must be the same as the length of the original sentence. For example, if for a sentence "A boy is playing baseball on the playground", if the "baseball" in the sentence should actually be "basketball", then the masked text prediction can realize the error correction process. However, if the correct content corresponding to "baseball" should be "hockey" (that is, the content to be corrected causes the change in the original sentence length), the masked text prediction will fail in this case. In order to overcome this problem, the present application designs a flexible masked text prediction method for realizing error correction of sentences of variable length.

For better explanation, referring to FIG. 5 , a schematic diagram of a visual elastic mask provided in an embodiment of the present application is shown. It can be seen from the figure that, in the indefinite-length character mask prediction corresponding to (a), the input text corresponding to the image is “A boy is playing basketball and he is very happy”, but the actual corresponding text should be “A boy is playing hockey and he is very happy”, so that the word “basketball” with 2 characters and the word “hockey” with 3 characters are not of the same length. In this case, as shown in the current character prediction of (b) and the preceding character prediction of (c), first, for “basket” (sk) to “hockey” (tk-1, tk), “stick” (tk) is the current character of “basket” (sk), and “song” (tk-1) is “blue” ( sk), assuming that the feature finally corresponding to the word "篮" (sk) is a 768-dimensional text encoding feature f, then two sets of fully connected layers can be used to forward the text encoding feature f respectively, so as to obtain two new vectors. Assuming that there are 3000 words in the preset dictionary, the sizes of these two vectors are both 1000×1. After that, the first vector can be used to predict the current character, and the other one can be used to predict the preceding character. The prediction method is to find the position of the maximum value in the 3000 numbers and output the corresponding word in the preset dictionary. In this process, a mechanism that can predict multiple words at the same time is introduced, thereby enhancing the so-called elasticity, realizing text prediction based on visual elastic mask, and further realizing error correction of indefinite length sentences.

As an optional embodiment, the text encoding feature is self-attention encoded by the text feature correction module to obtain the corresponding initial self-attention vector, which may include: inputting the text encoding feature of size [M, d] into the self-attention layer, using the formula

i _emlm is the initial self-attention vector, W _q , W _k , W _v are all learnable weights, f is the text encoding feature, size represents the size of the text encoding feature, and T represents the transposed matrix.

The softmax function, also known as the normalized exponential function, is a generalization of the binary classification function sigmoid in multi-classification. Its purpose is to present the results of multi-classification in the form of probability. Self-attention refers to the attention model in which attention is calculated entirely based on feature vectors. Since the self-attention mechanism is a common method in current image/text processing, it will not be described here.

Furthermore, performing character prediction processing on the initial self-attention vector to obtain a corresponding target self-attention vector may include: inputting the initial self-attention vector i _emlm into two groups of fully connected layers to perform current character prediction processing and preceding character prediction processing respectively, obtaining a current prediction vector and a preceding prediction vector, and determining a target self-attention vector o _emlm according to the current prediction vector and the preceding prediction vector.

Among them, the above two groups of fully connected layers are new fully connected layers, which need to be trained and tuned separately. Specifically, during the model training process, the cross entropy between the fully connected layer output and the real character can be calculated to perform tuning training, so that when the model is put into use, the trained model can be used for calculation to output the target self-attention vector o _emlm .

As an embodiment, determining the target self-attention vector based on the current prediction vector and the previous prediction vector may include: using the current prediction vector to perform prediction processing on the text encoding feature to obtain the target current character corresponding to the text encoding feature. Furthermore, the process may specifically be to probability match the text encoding feature with each preset character in a preset dictionary based on the current prediction vector to obtain the current prediction probability corresponding to each preset character, and determine the preset character with the largest current prediction probability as the target current character, so that by performing current prediction processing on the text encoding feature, the corresponding target current character can be obtained for subsequent previous character prediction, which increases the flexibility of mask text prediction to a certain extent.

Then, the target current character can be predicted using the preceding prediction vector to obtain the target preceding character corresponding to the target current character. Specifically, the target current character can be probability matched with each preset character in a preset dictionary according to the preceding prediction vector to obtain the preceding prediction probability corresponding to each preset character, and the preset character with the largest preceding prediction probability can be determined as the target preceding character. Thus, the corresponding target preceding character can be determined through preceding prediction processing of the target current character, thereby increasing the flexibility of mask text prediction and realizing error correction for text of indefinite length.

As shown in Figure 5 above, the feature corresponding to the character "篮" is responsible for predicting the character "棒" in the current character prediction process, and the character "曲" needs to be predicted in the previous character prediction process. Since the specific prediction example is described in detail in the analysis process of Figure 5, it will not be repeated here.

Then the target preceding character can be concatenated with the target current character to output the corresponding target character, and a target self-attention vector corresponding to the target character can be generated. For example, the target preceding character "曲" corresponding to "蓝" can be concatenated with the target current character "棒" to obtain the target character "曲棍", and a target self-attention vector corresponding to "曲棍" can be generated for subsequent processing steps.

Sub-step 4032, predicting the effective information amount of the text encoding feature to obtain a corresponding effective text information vector, where the effective text information vector represents the probability that each character in the text encoding feature contains effective information.

It can be seen from the contents of the aforementioned embodiments that in addition to lengthening the length of the original sentence, error correction may also shorten the length of the original sentence. In order to make the model capability of the text error correction system more comprehensive, the present application designs an information prediction network sub-module in the text feature correction module, so that the information prediction network sub-module can enable the features of the corresponding position to predict whether the characters at that position contain effective information.

In a specific implementation, predicting the effective information amount of the text encoding feature to obtain the corresponding effective text information vector may include: using the formula

The effective information volume of the text coding feature is predicted to obtain the corresponding effective text information vector; wherein, p _ifo is an effective text information vector of size [M, 1], indicating the probability that each character in the text coding feature contains effective information, _Wiq , _Wik , _Wiv are all transfer matrix weights of size [d, d], _Wiw is the information volume prediction weight of size [d, 1], _bib is a learnable model parameter, and f is a text coding feature. It should be noted that the parameters involved in the embodiment are all randomly initialized and obtained based on actual data set training and tuning, and the subscripts of the parameter symbols in each formula, such as q, w, k, etc., are only used to facilitate the distinction between the parameters and have no special meaning. Those skilled in the art can set them according to the actual situation or actual needs. It can be understood that this application does not limit this.

Among them, the sigmoid function is a common S-shaped function in biology, also known as the S-shaped growth curve. In information science, due to the monotonic increasing properties of the sigmoid function and the monotonic increasing properties of the inverse function, it is often used as an activation function of a neural network to map variables to between [0, 1].

For better explanation, refer to FIG6 , which shows a schematic diagram of adjacent word relationship prediction provided in an embodiment of the present application, wherein (a) represents the original feature, the shaded portion represents the portion of the original feature that needs to be corrected for text errors, (b) represents the front misalignment feature and the back misalignment feature after bidirectional feature interception, and (c) represents the adjacent feature interaction vector obtained after vector concatenation of the front misalignment feature and the back misalignment feature, wherein the shaded portion can be represented as the adjacent prediction content corresponding to the text error correction. Specifically, this embodiment will describe the generation process of the adjacent feature interaction vector in conjunction with the following sub-steps 4033 to 4034:

Sub-step 4033, bidirectionally intercepts the text encoding features to obtain the front misalignment features and the back misalignment features respectively, and generates an adjacent feature interaction vector based on the front misalignment features and the back misalignment features.

Specifically, the size of the text encoding feature can be [M, d]. After the bidirectional interception of the feature, the size of the front misalignment feature and the back misalignment feature are both [M-1, d]. Then, further, according to the front misalignment feature and the back misalignment feature, the adjacent feature interaction vector is generated, which can include: the front misalignment feature and the back misalignment feature are vector cascaded to generate an adjacent feature interaction vector of size [M-1, d×2] corresponding to the text encoding feature. Thus, by bidirectional interception of the feature and vector cascade of the misalignment feature, the adjacent feature interaction vector can be obtained to facilitate the subsequent probability calculation of adjacent predictions and further improve the accuracy of text error correction.

Sub-step 4034, performs coherence prediction processing on the adjacent feature interaction vector to obtain a corresponding adjacent text information vector, where the adjacent text information vector represents the probability of adjacent characters in the text encoding feature being coherent.

Specifically, performing coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors may include: using the formula

Among them, _pnbo is the adjacent text information vector of size [M-1, 1], _Wnw , _Wnq , _Wnv , _Wnk are all transfer matrix weight parameters of size [d, d], _bin is the bias vector parameter of size [1, d], and _fnb is the adjacent feature interaction vector.

To keep the size of all vectors the same, we can then update p _nbo to a vector of the same size [M-1, 1] Vector, since the first character of the default sentence does not need to be consistent with any previous text, the first character of the sentence must be reasonable, so a 1 can be added in front of the vector.

Sub-step 4035, uses the target self-attention vector, the effective text information vector and the adjacent text information vector to perform feature correction on the text encoding feature to generate a text correction feature.

For the text feature correction module, the target self-attention vector, the valid text information vector and the adjacent text information vector can be output, so that the target self-attention vector, the valid text information vector and the adjacent text information vector can be used in the subsequent process to perform feature correction on the text encoding to generate text correction features.

The above is the content corresponding to sub-steps 4031 to 4035. After the text encoding features are processed by the text feature correction module, the text correction features can be input into the error correction vector accessor for the next step of processing.

Referring to Figure 7, a schematic diagram of the structural framework of an error correction vector accessor provided in an embodiment of the present application is shown, wherein the error correction vector accessor can mainly include three parts, a feature storage space 701 for storing features, a repair judgment gate 702 for judging feature repair, and a feature updater 703 that can update features. After feature interception is performed based on the comprehensive coding feature and the original text feature to obtain the text coding feature, the text coding feature can also be split into several sub-text features, and each sub-text feature can be stored in the feature storage space 701 in sequence. Specifically, the text coding feature of size [M, d] can be split into M sub-text features.

After being processed by the text feature correction module, since the text correction feature is newly added with repair information, the repair judgment gate 702 can be used to judge whether each text correction feature needs to be repaired to determine whether the corresponding vector in the feature storage space 701 needs to be updated.

In a specific implementation, the text correction feature and the text encoding feature are fused by the error correction vector accessor to obtain the target text feature, which can be: the repair judgment gate 702 is used to perform a repair judgment on each sub-text feature to determine at least one replacement sub-text feature that needs to be replaced. Further, the repair judgment gate 702 is used to perform a repair judgment on each sub-text feature to determine at least one replacement sub-text feature that needs to be replaced. The formula is used.

Perform a repair judgment on each sub-text feature to determine whether the sub-text feature needs to be replaced.

Wherein, k represents the feature number corresponding to the sub-text feature, p _ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p _nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh _ifo represents the settable information quantity probability threshold, thresh _nbo represents the settable fluency probability threshold, and s(x _k ) represents whether the sub-text feature with feature number k is determined as the replacement character text feature that needs to be replaced. Exemplarily, if s(x _k ) is 1, it means that the sub-text feature with feature number k needs to be replaced, and at this time, the sub-text feature with feature number k can be determined as the replacement sub-text feature that needs to be replaced, and if s(x _k ) is 0, it means that the sub-text feature with feature number k does not need to be replaced, and at this time, the sub-text feature with feature number k is not determined as the replacement sub-text feature that needs to be replaced.

Then, the feature updater 703 can use the text correction feature to replace the feature of at least one replacement sub-text feature to obtain the corresponding target sub-text features, and perform feature fusion on at least one target sub-text feature to obtain the corresponding target text feature, so that the text correction feature can be updated through feature replacement and feature fusion.

In a specific implementation, the feature updater 703 uses the text correction feature to update at least one replacement sub-text feature. Replace the row features to obtain the corresponding target sub-text features, which can be: According to the text correction features, use the formula
f _ko =f _k ×(1-μ)+(p _ifok ×θ+p _nbok ×(1-θ))×μ×o _emlm

Calculate the text feature value corresponding to the replacement sub-text feature, and perform feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature.

Then, the feature replacement of the replacement sub-text feature can be performed according to the text feature value to obtain the corresponding target sub-text feature. Specifically, the original text feature value of the replacement sub-text feature can be replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature. Finally, the target sub-text feature after feature replacement is feature fused to obtain the corresponding target text feature, thereby realizing the update of the text correction feature.

Step 404: replace the original text features with the target text features through the error correction decoder, and output the corresponding target text information.

Finally, the target text features can be transmitted from the error correction vector accessor to the error correction decoder, and the target text features can be used by the error correction decoder to replace the original text features. After decoding, the corresponding target text information is output, so that the correct text corresponding to the image can be generated, completing the image-based text error correction process.

Exemplarily, the error correction decoder in this embodiment can be a sentence generator, which can be implemented using current mainstream models such as GPT (Generative Pre-Training, generative pre-trained language model). It can be understood that this application does not impose any restrictions on this.

It should be noted that, for the method embodiments, for the sake of simplicity, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the embodiments of the present application are not limited by the described action sequence, because according to the embodiments of the present application, certain steps can be performed in other sequences or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present application.

8, a structural block diagram of a text error correction device for an image provided in an embodiment of the present application is shown, which is applied to a multimodal text error correction system. The multimodal text error correction system includes at least a text feature correction module, an error correction vector accessor, and an error correction decoder. The device may specifically include the following modules:

The feature extraction module 801 is used to obtain image information and original text information corresponding to the input operation in response to the input operation of the image and text, and to extract features from the image information and the original text information respectively to obtain features corresponding to the image information. The corresponding image features, and the original text features corresponding to the original text information;

The text coding feature generation module 802 is used to perform feature splicing of the image feature and the original text feature to obtain a comprehensive coding feature, and perform feature interception based on the comprehensive coding feature and the original text feature to obtain a text coding feature;

The target text feature generation module 803 is used to perform feature correction on the text encoding feature through the text feature correction module to generate text correction features, and perform feature fusion of the text correction features with the text encoding features through the error correction vector accessor to obtain the target text features;

The text feature replacement module 804 is used to replace the original text features with the target text features through the error correction decoder, and output the corresponding target text information.

In an optional embodiment, the text encoding feature generation module 802 includes:

In an optional embodiment, the target self-attention vector generation module includes:

In an optional embodiment, the target self-attention vector determination submodule includes:

The target character output module is used to concatenate the target preceding character with the target current character and output the corresponding target character. character and generates the target self-attention vector corresponding to the target character.

In an optional embodiment, the target current character generation module is specifically used for:

In an optional embodiment, the target prefix character generating module is specifically configured to include:

In an optional embodiment, the effective text information vector generation module is specifically used for:

Using formula

In an optional embodiment, the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and the adjacent feature interaction vector generation module is specifically used for:

In an optional embodiment, the adjacent text information vector generation module is specifically used to:

Using formula

In an optional embodiment, the error correction vector accessor includes at least a feature storage space, and the device further includes:

In an optional embodiment, the error correction vector accessor includes a repair judgment gate and a feature updater, and the target text feature generation module 803 includes:

In an optional embodiment, the replacement subtext feature determination module is specifically used to:

Using formula

Perform repair judgment on each sub-text feature;

In an optional embodiment, the target subtext feature determination module includes:

In an optional embodiment, the text feature value calculation module is specifically used for:

As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

In addition, an embodiment of the present application also provides an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, each process of the above-mentioned text error correction method embodiment of the image is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be described here.

As shown in FIG9 , the embodiment of the present application further provides a computer non-volatile readable storage medium 901, on which a computer program is stored. When the computer program is executed by a processor, each process of the above-mentioned text error correction method embodiment of the image is implemented, and the same technical effect can be achieved. To avoid repetition, it is not repeated here. Among them, the computer non-volatile readable storage medium 901 is, for example, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.

FIG. 10 is a schematic diagram of the hardware structure of an electronic device for implementing various embodiments of the present application.

The electronic device 1000 includes but is not limited to: a radio frequency unit 1001, a network module 1002, an audio output unit 1003, input unit 1004, sensor 1005, display unit 1006, user input unit 1007, interface unit 1008, memory 1009, processor 1010, and power supply 1011 and other components. Those skilled in the art will understand that the electronic device structure involved in the embodiments of the present application does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown in the figure, or combine certain components, or arrange components differently. In the embodiments of the present application, the electronic device includes but is not limited to a mobile phone, a tablet computer, a laptop computer, a PDA, a vehicle-mounted terminal, a wearable device, and a pedometer.

It should be understood that in the embodiment of the present application, the RF unit 1001 can be used for receiving and sending signals during information transmission or calls. Specifically, after receiving downlink data from the base station, it is sent to the processor 1010 for processing; in addition, uplink data is sent to the base station. Generally, the RF unit 1001 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, etc. In addition, the RF unit 1001 can also communicate with the network and other devices through a wireless communication system.

The electronic device provides users with wireless broadband Internet access through the network module 1002, such as helping users to send and receive emails, browse web pages, and access streaming media.

The audio output unit 1003 can convert the audio data received by the RF unit 1001 or the network module 1002 or stored in the memory 1009 into an audio signal and output it as sound. Moreover, the audio output unit 1003 can also provide audio output related to a specific function performed by the electronic device 1000 (for example, a call signal reception sound, a message reception sound, etc.). The audio output unit 1003 includes a speaker, a buzzer, a receiver, etc.

The input unit 1004 is used to receive audio or video signals. The input unit 1004 may include a graphics processor (GPU) 10041 and a microphone 10042, and the graphics processor 10041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode. The processed image frame can be displayed on the display unit 1006. The image frame processed by the graphics processor 10041 can be stored in the memory 1009 (or other storage medium) or sent via the radio frequency unit 1001 or the network module 1002. The microphone 10042 can receive sound and can process such sound into audio data. The processed audio data can be converted into a format output that can be sent to a mobile communication base station via the radio frequency unit 1001 in the case of a telephone call mode.

The electronic device 1000 also includes at least one sensor 1005, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 10061 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 10061 and/or the backlight when the electronic device 1000 is moved to the ear. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), and can detect the magnitude and direction of gravity when stationary, which can be used to identify the posture of the electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 1005 can also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be repeated here.

The display unit 1006 is used to display information input by the user or information provided to the user. The display unit 1006 may include a display panel 10061, which may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.

The user input unit 1007 may be used to receive input digital or character information and generate key signal input related to user settings and function control of the electronic device. Specifically, the user input unit 1007 includes a touch panel 10071 and other input devices 10072. The touch panel 10071, also known as a touch screen, may collect user touch operations on or near it. Operation (such as the user uses any suitable object or accessory such as a finger, stylus, etc. on the touch panel 10071 or near the touch panel 10071). The touch panel 10071 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, and converts it into the contact point coordinates, and then sends it to the processor 1010, receives the command sent by the processor 1010 and executes it. In addition, the touch panel 10071 can be implemented in various types such as resistive, capacitive, infrared and surface acoustic wave. In addition to the touch panel 10071, the user input unit 1007 may also include other input devices 10072. Specifically, other input devices 10072 may include but are not limited to physical keyboards, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, and joysticks, which will not be repeated here.

Further, the touch panel 10071 may be covered on the display panel 10061. When the touch panel 10071 detects a touch operation on or near it, it is transmitted to the processor 1010 to determine the type of the touch event, and then the processor 1010 provides a corresponding visual output on the display panel 10061 according to the type of the touch event. It can be understood that in one embodiment, the touch panel 10071 and the display panel 10061 are used as two independent components to implement the input and output functions of the electronic device, but in some embodiments, the touch panel 10071 and the display panel 10061 can be integrated to implement the input and output functions of the electronic device, which is not limited here.

The interface unit 1008 is an interface for connecting an external device to the electronic device 1000. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device with an identification module, an audio input/output (I/O) port, a video I/O port, a headphone port, etc. The interface unit 1008 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic device 1000 or may be used to transmit data between the electronic device 1000 and an external device.

The memory 1009 can be used to store software programs and various data. The memory 1009 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area can store data created according to the use of the mobile phone (such as audio data, a phone book, etc.), etc. In addition, the memory 1009 can include a high-speed random access memory, and can also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.

The processor 1010 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire electronic device. By running or executing software programs and/or modules stored in the memory 1009 and calling data stored in the memory 1009, it performs various functions of the electronic device and processes data, thereby monitoring the electronic device as a whole. The processor 1010 may include one or more processing units; preferably, the processor 1010 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 1010.

The electronic device 1000 may also include a power supply 1011 (such as a battery) for supplying power to each component. Preferably, the power supply 1011 may be logically connected to the processor 1010 via a power management system, thereby implementing functions such as managing charging, discharging, and power consumption management through the power management system.

In addition, the electronic device 1000 includes some functional modules not shown, which will not be described in detail here.

It should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of more restrictions, an element limited by the sentence "comprises a..." does not exclude the inclusion of that element. There are other identical elements in the process, method, article or device.

Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods of each embodiment of the present application.

The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms without departing from the purpose of the present application and the scope of protection of the claims, all of which are within the protection of the present application.

Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed in the embodiments of the present application can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

In the embodiments provided in the present application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes, such as USB flash drives, mobile hard drives, ROM, RAM, magnetic disks, or optical disks.

The above are only specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A method for correcting text errors in an image, characterized in that it is applied to a multimodal text error correction system, the multimodal text error correction system at least comprising a text feature correction module, an error correction vector accessor and an error correction decoder, the method comprising:

In response to an input operation for an image and text, image information and original text information corresponding to the input operation are acquired, and feature extraction is performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;

Performing feature concatenation of the image feature and the original text feature to obtain a comprehensive coding feature, and performing feature interception based on the comprehensive coding feature and the original text feature to obtain a text coding feature;

The text encoding feature is corrected by the text feature correction module to generate a text correction feature, and the text correction feature is fused with the text encoding feature by the error correction vector accessor to obtain a target text feature;

The error correction decoder uses the target text features to perform feature replacement on the original text features, and outputs corresponding target text information.
The method according to claim 1 is characterized in that the step of performing feature correction on the text encoding feature by the text feature correction module to generate a text correction feature comprises:

The text encoding feature is self-attention encoded by the text feature correction module to obtain a corresponding initial self-attention vector, and the initial self-attention vector is subjected to character prediction processing to obtain a corresponding target self-attention vector, wherein the target self-attention vector includes correlation features between the image feature and the original text feature;

Predicting the effective information amount of the text encoding feature to obtain a corresponding effective text information vector, wherein the effective text information vector represents the probability that each character in the text encoding feature contains effective information;

Bidirectionally intercepting the text encoding features to obtain front dislocation features and back dislocation features respectively, and generating adjacent feature interaction vectors according to the front dislocation features and the back dislocation features;

Performing coherence prediction processing on the adjacent feature interaction vector to obtain a corresponding adjacent text information vector, wherein the adjacent text information vector represents a probability of coherence of adjacent characters in the text encoding feature;

The target self-attention vector, the effective text information vector and the adjacent text information vector are used to perform feature correction on the text encoding feature to generate a text correction feature.
The method according to claim 2 is characterized in that the step of performing self-attention encoding on the text encoding feature through the text feature correction module to obtain a corresponding initial self-attention vector comprises:

The text encoding features are input into the self-attention layer, using the formula

Perform self-attention encoding to obtain the corresponding initial self-attention vector; where,

W q , W k , and W v are all learnable weights, and f is the text encoding feature.
The method according to claim 2 or 3, characterized in that the performing character prediction processing on the initial self-attention vector to obtain the corresponding target self-attention vector comprises:

The initial self-attention vector is input into two groups of fully connected layers for current character prediction and preprocessing respectively. Character prediction processing, obtaining the current prediction vector and the previous prediction vector;

A target self-attention vector is determined according to the current prediction vector and the previous prediction vector.
The method according to claim 4, characterized in that the determining the target self-attention vector according to the current prediction vector and the previous prediction vector comprises:

Using the current prediction vector to perform prediction processing on the text encoding feature to obtain a target current character corresponding to the text encoding feature;

Using the preceding prediction vector to perform prediction processing on the target current character to obtain a target preceding character corresponding to the target current character;

The target preceding character is concatenated with the target current character, the corresponding target character is output, and a target self-attention vector corresponding to the target character is generated.
The method according to claim 5 is characterized in that the step of using the current prediction vector to perform prediction processing on the text encoding feature to obtain the target current character corresponding to the text encoding feature comprises:

According to the current prediction vector, the text encoding feature is probability matched with each preset character in the preset dictionary to obtain the current prediction probability corresponding to each preset character, and the preset character with the maximum current prediction probability is determined as the target current character.
The method according to claim 6, characterized in that the step of using the preceding prediction vector to perform prediction processing on the target current character to obtain a target preceding character corresponding to the target current character comprises:

According to the preceding prediction vector, the target current character is probability matched with each preset character in the preset dictionary to obtain the preceding prediction probability corresponding to each preset character, and the preset character with the largest preceding prediction probability is determined as the target preceding character.
The method according to claim 2 is characterized in that the step of predicting the effective information amount of the text encoding feature to obtain the corresponding effective text information vector comprises:

Using formula

Predicting the effective information volume of the text encoding feature to obtain a corresponding effective text information vector; wherein,

Wiq , Wik , and Wiv are all transfer matrix weights, Wiw is the information prediction weight, bib is the learnable model parameter, and f is the text encoding feature.
The method according to claim 2 is characterized in that the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and generating an adjacent feature interaction vector based on the front misalignment feature and the back misalignment feature comprises:

The front misalignment feature and the back misalignment feature are subjected to vector cascade processing to generate an adjacent feature interaction vector of a size of [M-1, d×2] corresponding to the text encoding feature.
The method according to claim 2 or 9 is characterized in that the step of performing coherent prediction processing on the adjacent feature interaction vectors to obtain corresponding adjacent text information vectors comprises:

Using formula

The adjacent feature interaction vectors are processed for coherent prediction to obtain corresponding adjacent text information vectors; wherein,

W nw , W nq , W nv , and W nk are all transfer matrix weight parameters, bin is the bias vector parameter, and f nb is the adjacent feature interaction vector.
The method according to claim 2 is characterized in that the error correction vector accessor comprises at least a feature storage space, and after performing feature interception according to the comprehensive coding feature and the original text feature to obtain the text coding feature, the method further comprises:

The text encoding feature is split into a plurality of sub-text features, and each of the sub-text features is stored in the feature storage space in sequence.
The method according to claim 11 is characterized in that the error correction vector accessor includes a repair judgment gate and a feature updater, and the text correction feature is fused with the text encoding feature through the error correction vector accessor to obtain the target text feature, comprising:

Performing a repair judgment on each of the sub-text features through the repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced;

The feature updater uses the text correction feature to perform feature replacement on at least one of the replacement sub-text features to obtain the corresponding target sub-text features, and performs feature fusion on at least one of the target sub-text features to obtain the corresponding target text features.
The method according to claim 12, characterized in that the step of performing repair judgment on each of the sub-text features through the repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced comprises:

Using formula

Performing repair judgment on each of the sub-text features;

When s(x k ) is 1, the subtext feature with feature number k is determined as the replacement subtext feature to be replaced; wherein,

k represents the feature number corresponding to the sub-text feature, p ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh ifo represents the settable information quantity probability threshold, thresh nbo represents the settable fluency probability threshold, and s(x k ) represents whether the sub-text feature with feature number k needs to be replaced.
The method according to claim 12, characterized in that the step of replacing at least one of the replacement sub-text features with the text correction feature by the feature updater to obtain the respective corresponding target sub-text features comprises:

According to the text correction features, the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm

Calculate the text feature value corresponding to the replacement subtext feature, and replace the subtext feature according to the text feature value. The text features are replaced to obtain the corresponding target sub-text features; among them,

fk is the sub-text feature with feature number k, oemlm is the target self-attention vector, and θ and μ are both preset parameters with a size of 0 to 1.
The method according to claim 14 is characterized in that the step of performing feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature comprises:

The original text feature value of the replacement sub-text feature is replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature.
The method according to claim 1 is characterized in that the step of performing feature concatenation of the image feature and the original text feature to obtain a comprehensive coding feature comprises:

The image features are concatenated with the original text features, and cross-modal encoding is performed to obtain comprehensive encoding features.
The method according to claim 1 is characterized in that the step of performing feature interception based on the comprehensive coding feature and the original text feature to obtain the text coding feature comprises:

The feature corresponding to the original text feature position in the comprehensive coding feature is intercepted to obtain the text coding feature corresponding to the original text feature.
A text error correction device for an image, characterized in that it is applied to a multimodal text error correction system, the multimodal text error correction system at least comprising a text feature correction module, an error correction vector accessor and an error correction decoder, the device comprising:

A feature extraction module, configured to, in response to an input operation for an image and text, obtain image information and original text information corresponding to the input operation, and perform feature extraction on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;

A text coding feature generation module is used to perform feature splicing on the image feature and the original text feature to obtain a comprehensive coding feature, and perform feature interception based on the comprehensive coding feature and the original text feature to obtain a text coding feature;

A target text feature generation module, used for performing feature correction on the text encoding feature through the text feature correction module to generate text correction features, and performing feature fusion of the text correction features with the text encoding features through the error correction vector accessor to obtain target text features;

The text feature replacement module is used to replace the original text features with the target text features through the error correction decoder, and output the corresponding target text information.
An electronic device, characterized in that it comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus;

The memory is used to store computer programs;

The processor is used to implement the method according to any one of claims 1 to 17 when executing the program stored in the memory.
A computer non-volatile readable storage medium having instructions stored thereon, which, when executed by one or more processors, causes the processors to execute the method according to any one of claims 1 to 17.