WO2024139307A1 - Sentence correction method and apparatus for image, and electronic device and storage medium - Google Patents
Sentence correction method and apparatus for image, and electronic device and storage medium Download PDFInfo
- Publication number
- WO2024139307A1 WO2024139307A1 PCT/CN2023/115054 CN2023115054W WO2024139307A1 WO 2024139307 A1 WO2024139307 A1 WO 2024139307A1 CN 2023115054 W CN2023115054 W CN 2023115054W WO 2024139307 A1 WO2024139307 A1 WO 2024139307A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- text
- vector
- features
- target
- Prior art date
Links
- 238000012937 correction Methods 0.000 title claims abstract description 240
- 238000000034 method Methods 0.000 title claims abstract description 98
- 239000013598 vector Substances 0.000 claims abstract description 292
- 238000000605 extraction Methods 0.000 claims abstract description 23
- 230000004927 fusion Effects 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims description 51
- 230000003993 interaction Effects 0.000 claims description 39
- 230000008439 repair process Effects 0.000 claims description 29
- 230000001427 coherent effect Effects 0.000 claims description 14
- 238000004891 communication Methods 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 11
- 238000012546 transfer Methods 0.000 claims description 10
- 230000004044 response Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims 1
- 230000000875 corresponding effect Effects 0.000 description 183
- 230000008569 process Effects 0.000 description 27
- 230000006870 function Effects 0.000 description 23
- 238000010586 diagram Methods 0.000 description 17
- 230000000007 visual effect Effects 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000002457 bidirectional effect Effects 0.000 description 7
- 238000013473 artificial intelligence Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000011160 research Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 241000282412 Homo Species 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 3
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000001965 increasing effect Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
Definitions
- the present application relates to the field of artificial intelligence technology, and in particular to a method for correcting text errors in images, a device for correcting text errors in images, an electronic device, and a computer non-volatile readable storage medium.
- Multimodal artificial intelligence has become one of the important research directions in the field of AI (Artificial Intelligence).
- Multimodal research aims to integrate multiple modal inputs such as images, videos, audio, text, sensor signals, and comprehensively understand or generate information that can be used by humans.
- VQA Visual Question Answering
- Visual Grounding multimodal relationships such as images and texts are included.
- the Transformer deep learning model based on self-attention mechanism
- the Transformer-based multimodal network structure has become more and more popular in multimodal tasks such as Visual Question Answering (VQA), Image Caption, and Visual Dialog.
- the embodiments of the present application provide a method, device, electronic device and computer non-volatile readable storage medium for text correction of images to solve or partially solve the problem that the algorithm is prone to errors when completing multimodal tasks due to the inability to strictly match text and image.
- the embodiment of the present application discloses a method for text error correction of an image, which is applied to a multimodal text error correction system.
- the multimodal text error correction system at least includes a text feature correction module, an error correction vector accessor, and an error correction decoder.
- the method includes:
- image information and original text information corresponding to the input operations are acquired, and feature extraction is performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;
- the text encoding features are corrected by the text feature correction module to generate text correction features, and the text correction features are fused with the text encoding features by the error correction vector accessor to obtain the target text features;
- the error correction decoder uses the target text features to replace the original text features and outputs the corresponding target text information.
- the text encoding feature is corrected by a text feature correction module to generate a text correction feature, including:
- the text encoding features are self-attention encoded through the text feature correction module to obtain the corresponding initial self-attention vector, and the initial self-attention vector is processed by character prediction to obtain the corresponding target self-attention vector, which contains the correlation features between the image features and the original text features;
- the target self-attention vector is determined based on the current prediction vector and the previous prediction vector.
- the error correction vector accessor includes a repair judgment gate and a feature updater, and the text correction feature and the text encoding feature are fused through the error correction vector accessor to obtain the target text feature, including:
- the image features are concatenated with the original text features to obtain comprehensive coding features, including:
- the initial self-attention vector generation module is used to input the text encoding features into the self-attention layer using the formula
- W q , W k , and W v are all learnable weights, and f is the text encoding feature.
- the target self-attention vector generation module includes:
- a character prediction processing module is used to input the initial self-attention vector into two groups of fully connected layers to perform current character prediction processing and previous character prediction processing respectively, and obtain a current prediction vector and a previous prediction vector;
- the target self-attention vector determination submodule includes:
- a target current character generation module is used to use the current prediction vector to predict the text encoding feature to obtain the target current character corresponding to the text encoding feature;
- a target preceding character generating module is used to predict the target current character using the preceding prediction vector to obtain the target preceding character corresponding to the target current character;
- the target character output module is used to concatenate the target preceding character with the target current character, output the corresponding target character, and generate a target self-attention vector corresponding to the target character.
- the target current character generation module is specifically used for:
- the text encoding feature is probability matched with each preset character in the preset dictionary to obtain the current prediction probability corresponding to each preset character, and the preset character with the largest current prediction probability is determined as the target current character.
- the target current character is probability matched with each preset character in the preset dictionary to obtain the pre-prediction probability corresponding to each preset character, and the preset character with the largest pre-prediction probability is determined as the target pre-prediction character.
- the effective text information vector generation module is specifically used for:
- Wiq , Wik , and Wiv are all transfer matrix weights
- Wiw is the information prediction weight
- bib is the learnable model parameter
- f is the text encoding feature.
- the size of the text encoding feature is [M, d]
- the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d]
- the adjacent feature interaction vector generation module is specifically used for:
- the front misaligned features and the back misaligned features are vector-concatenated to generate an adjacent feature interaction vector of size [M-1, d ⁇ 2] corresponding to the text encoding features.
- W nw , W nq , W nv , and W nk are all transfer matrix weight parameters, bin is the bias vector parameter, and f nb is the adjacent feature interaction vector.
- the error correction vector accessor includes a repair judgment gate and a feature updater
- the target text feature generation module includes:
- the replacement subtext feature determination module is specifically used for:
- the subtext feature with feature number k is determined as the replacement subtext feature to be replaced;
- the target subtext feature determination module includes:
- fk is the sub-text feature with feature number k
- oemlm is the target self-attention vector
- ⁇ and ⁇ are both preset parameters with a size of 0 to 1.
- the text feature value calculation module is specifically used for:
- the original text feature value of the replacement sub-text feature is replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature.
- the text encoding feature generation module includes:
- the cross-modal coding processing module is used to concatenate image features with original text features and perform cross-modal coding processing to obtain comprehensive coding features.
- the corresponding feature position interception module is used to intercept the features corresponding to the original text feature positions in the comprehensive coding features to obtain text coding features corresponding to the original text features.
- the embodiment of the present application also discloses an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
- Memory used to store computer programs
- the embodiment of the present application also discloses a computer non-volatile readable storage medium having instructions stored thereon, which, when executed by one or more processors, enables the processors to execute the method as in the embodiment of the present application.
- an image-based text error correction method for a multimodal text error correction system is provided.
- feature extraction is performed on an input image and an original text respectively to obtain image features and original text features.
- the image features and the original text features are generated into a comprehensive coding feature by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature.
- the text coding feature is feature corrected by a text feature correction module to generate a text correction feature
- the text correction feature is feature fused with the text coding feature by an error correction vector accessor to obtain a target text feature.
- FIG2 is a schematic diagram of a multimodal sample of image-based text error correction provided in an embodiment of the present application
- FIG4 is a flowchart of a method for correcting text errors in an image provided in an embodiment of the present application.
- FIG9 is a schematic diagram of a computer non-volatile readable medium provided in an embodiment of the present application.
- Sentence Correction Detect text errors and make corresponding corrections.
- Multi Modal refers to the collaborative reasoning of multiple heterogeneous modal data. In the field of artificial intelligence, it often refers to the collaboration of perceptual information, such as images, text, video, audio, etc., to help artificial intelligence understand the external world more accurately.
- FIG1 a schematic diagram of the interference of a current non-strictly matched multimodal sample on an existing method is shown.
- the non-strictly matched multimodal sample is shown, in which the display scenario is presented in the form of a question and answer, specifically: the party asking the question asks “What is the red pickup truck with a tall bucket attached to it doing?”, and the party answering the question replies “Stop the car”.
- FIG. 2 a schematic diagram of a multimodal sample of image-based text correction provided in an embodiment of the present application is shown.
- the image still uses the image in Figure 1, and the input text is "a red pickup truck with a tall bucket attached is driving on the road".
- the body color of the pickup truck is obviously wrong, and the erroneous information in the input text needs to be corrected according to the information displayed in the image, such as "red” should be corrected to "white”, so that the input text is corrected through an image-based text correction method provided in the present application, and the corresponding output text should be "a white pickup truck with a tall bucket attached is driving on the road", achieving a strict match between the image and the text, and obtaining the correct text content.
- the visual elastic mask submodule can be used to correct indefinite length sentences. For example, if the number of characters corresponding to the wrong word in the input text is 2, the number of characters corresponding to the corrected word is actually 3, so that the corrected word can be corrected.
- the variable length error correction of input text makes text correction more accurate and more reliable.
- error correction may also shorten the original sentence.
- some characters in the original sentence should be deleted, such as changing from 3 to 2.
- the present application designs an information prediction network sub-module in the text feature correction module 303, so that the information prediction network sub-module can enable the features of the corresponding position to predict whether the characters at that position contain effective information.
- the present application also designs an adjacent word relationship prediction submodule in the text feature correction module 303 to predict the text coherence of the features corresponding to the input text.
- the target text features can be input into the error correction decoder 305, and the target text features can be used by the error correction decoder 305 to replace the original text features, and output the corresponding target text information.
- the correct output text "a white pickup truck with a tall bucket attached is driving on the road” can be obtained.
- a multimodal text error correction system which may include at least an image/text encoding module, a feature extraction module, a text feature correction module, an error correction vector accessor, and an error correction decoder.
- the text error correction system provided by the present application uses the currently popular Transformer network structure as the backbone network, and implements the model's ability to correct text errors by designing submodules such as visual elastic mask, information volume prediction network, and adjacent vocabulary relationship prediction.
- the input image and the original text can be respectively feature extracted to obtain image features and original text features, and then the image features and the original text features are extracted through A comprehensive coding feature is generated by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is corrected by a text feature correction module to generate a text correction feature, and the text correction feature is fused with the text coding feature by an error correction vector accessor to obtain the target text feature. Finally, the target text feature is used to replace the original text feature by the error correction decoder, and the corresponding target text information is output.
- the above-mentioned text error correction system combined with the image-based text error correction method, can realize the identification and correction of fine-grained errors in the original text, which greatly reduces the error rate when performing multimodal tasks.
- the method can be applied to a multimodal text error correction system.
- the multimodal text error correction system includes at least a text feature correction module, an error correction vector accessor, and an error correction decoder.
- the method may specifically include the following steps:
- the two after obtaining the image features and the original text features, the two can be feature-concatenated to obtain comprehensive coding features. Further, the image features and the original text features can be feature-concatenated to obtain comprehensive coding features. Specifically, the image features and the original text features can be feature-concatenated and cross-modal encoding processing can be performed to obtain comprehensive coding features.
- the text encoding feature is corrected by a text feature correction module to generate a text correction feature, which may include the following sub-steps:
- the repair judgment gate 702 can be used to judge whether each text correction feature needs to be repaired to determine whether the corresponding vector in the feature storage space 701 needs to be updated.
- the target text feature is used to replace the original text feature by an error correction decoder, and the corresponding target text information is output, thereby realizing the recognition and correction of fine-grained errors in the original text according to the image, greatly reducing the error rate when performing multimodal tasks.
- the adjacent text information vector generation module is specifically used to:
- the subtext feature with feature number k is determined as the replacement subtext feature to be replaced;
- the cross-modal coding processing module is used to concatenate image features with original text features and perform cross-modal coding processing to obtain comprehensive coding features.
- the embodiment of the present application further provides a computer non-volatile readable storage medium 901, on which a computer program is stored.
- a computer program is executed by a processor, each process of the above-mentioned text error correction method embodiment of the image is implemented, and the same technical effect can be achieved. To avoid repetition, it is not repeated here.
- the computer non-volatile readable storage medium 901 is, for example, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.
- the RF unit 1001 can be used for receiving and sending signals during information transmission or calls. Specifically, after receiving downlink data from the base station, it is sent to the processor 1010 for processing; in addition, uplink data is sent to the base station.
- the RF unit 1001 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, etc.
- the RF unit 1001 can also communicate with the network and other devices through a wireless communication system.
- the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), and can detect the magnitude and direction of gravity when stationary, which can be used to identify the posture of the electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 1005 can also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be repeated here.
- the display unit 1006 is used to display information input by the user or information provided to the user.
- the display unit 1006 may include a display panel 10061, which may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
- LCD liquid crystal display
- OLED organic light-emitting diode
- the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, and converts it into the contact point coordinates, and then sends it to the processor 1010, receives the command sent by the processor 1010 and executes it.
- the touch panel 10071 can be implemented in various types such as resistive, capacitive, infrared and surface acoustic wave.
- the user input unit 1007 may also include other input devices 10072.
- other input devices 10072 may include but are not limited to physical keyboards, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, and joysticks, which will not be repeated here.
- the touch panel 10071 may be covered on the display panel 10061.
- the touch panel 10071 detects a touch operation on or near it, it is transmitted to the processor 1010 to determine the type of the touch event, and then the processor 1010 provides a corresponding visual output on the display panel 10061 according to the type of the touch event.
- the touch panel 10071 and the display panel 10061 are used as two independent components to implement the input and output functions of the electronic device, but in some embodiments, the touch panel 10071 and the display panel 10061 can be integrated to implement the input and output functions of the electronic device, which is not limited here.
- the memory 1009 can be used to store software programs and various data.
- the memory 1009 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area can store data created according to the use of the mobile phone (such as audio data, a phone book, etc.), etc.
- the memory 1009 can include a high-speed random access memory, and can also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.
- the electronic device 1000 may also include a power supply 1011 (such as a battery) for supplying power to each component.
- a power supply 1011 (such as a battery) for supplying power to each component.
- the power supply 1011 may be logically connected to the processor 1010 via a power management system, thereby implementing functions such as managing charging, discharging, and power consumption management through the power management system.
- the electronic device 1000 includes some functional modules not shown, which will not be described in detail here.
- the technical solution of the present application can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods of each embodiment of the present application.
- a storage medium such as ROM/RAM, a disk, or an optical disk
- a terminal which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.
- the disclosed devices and methods can be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of the present application or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present application.
- the aforementioned storage medium includes: various media that can store program codes, such as USB flash drives, mobile hard drives, ROM, RAM, magnetic disks, or optical disks.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
Provided in the embodiments of the present application are a sentence correction method and apparatus for an image, and an electronic device and a non-volatile readable storage medium, which are applied to a multi-modal sentence correction system. The method comprises: firstly, performing feature extraction on both an input image and an input original sentence, so as to obtain an image feature and an original sentence feature; then, generating a comprehensive encoded feature by means of feature splicing, and obtaining an encoded sentence feature by means of capturing from the comprehensive encoded feature a feature corresponding to the position of the original sentence feature; then, performing feature correction on the encoded sentence feature by means of a sentence feature correction module, so as to generate a corrected sentence feature, and performing feature fusion on the corrected sentence feature and the encoded sentence feature by means of a correction vector accessor, so as to obtain a target sentence feature; and finally, performing feature replacement on the original sentence feature by means of a correction decoder and by using the target sentence feature, and outputting target sentence information. In this way, the identification and correction of a fine-grained error in an original sentence are realized, thereby greatly reducing the error rate when a multi-modal task is performed.
Description
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年12月27日提交中国专利局,申请号为202211680372.4,申请名称为“图像的文本纠错方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the China Patent Office on December 27, 2022, with application number 202211680372.4 and application name “Text Error Correction Method, Device, Electronic Device and Storage Medium for Images”, all contents of which are incorporated by reference into this application.
本申请涉及人工智能技术领域,特别是涉及一种图像的文本纠错方法、一种图像的文本纠错装置、一种电子设备以及一种计算机非易失性可读存储介质。The present application relates to the field of artificial intelligence technology, and in particular to a method for correcting text errors in images, a device for correcting text errors in images, an electronic device, and a computer non-volatile readable storage medium.
近年来,多模态人工智能成为AI(Artificial Intelligence,人工智能)领域中重要的研究方向之一。多模态研究,旨在综合诸如图像、视频、音频、文本、传感器信号等多种模态输入,并综合理解或生成人类可用的信息的科学,如在视觉问答(Visual Question Answering,VQA)、视觉定位(Visual Grounding)等领域均包含图像、文本等多模态关系。随着Transformer(基于自注意力机制的深度学习模型)结构的广泛应用,从而在诸如视觉问答VQA、图片描述(Image Caption)、视觉对话(Visual Dialog)等多模态任务中,基于Transformer的多模态网络结构也越来越受到人们的青睐。In recent years, multimodal artificial intelligence has become one of the important research directions in the field of AI (Artificial Intelligence). Multimodal research aims to integrate multiple modal inputs such as images, videos, audio, text, sensor signals, and comprehensively understand or generate information that can be used by humans. For example, in the fields of Visual Question Answering (VQA) and Visual Grounding, multimodal relationships such as images and texts are included. With the widespread application of the Transformer (deep learning model based on self-attention mechanism) structure, the Transformer-based multimodal network structure has become more and more popular in multimodal tasks such as Visual Question Answering (VQA), Image Caption, and Visual Dialog.
在现实世界中,人类的语言往往存在口误、比喻等常见语言现象,这些现象难以被现有的计算机语言技术掌握,从而在进行文本与图像间的匹配时,往往无法将这些带有口误或者比喻修辞手法的词语与图像进行对应匹配,也就是说,现阶段多模态理论研究无法精细地区分文本中的微小错误,例如一段文字可能错了某个字或某个词语,从而导致算法在完成多模态任务时出现错误,例如,在基于视觉问答的任务中,极有可能会遇到因带有刻意比喻的文字内容,导致算法无法理解人类实际想要描述的问题的情况,使得基于Transformer的多模态结构无法通过算法给出正确应答,从而给出错误答案。In the real world, human language often contains common language phenomena such as slips of the tongue and metaphors, which are difficult to be grasped by existing computer language technology. Therefore, when matching text and images, it is often impossible to match these words with slips of the tongue or metaphors with the images. In other words, the current multimodal theoretical research cannot finely distinguish minor errors in the text. For example, a paragraph of text may have a wrong word or phrase, which will cause the algorithm to make mistakes when completing multimodal tasks. For example, in tasks based on visual question answering, it is very likely that the algorithm will not be able to understand the problem that humans actually want to describe due to deliberate metaphors in the text content, making it impossible for the Transformer-based multimodal structure to give a correct response through the algorithm, and thus giving a wrong answer.
发明内容Summary of the invention
本申请实施例是提供一种图像的文本纠错方法、装置、电子设备以及计算机非易失性可读存储介质,以解决或部分解决因文本与图像无法严格匹配,导致算法在完成多模态任务时容易出现错误的问题。The embodiments of the present application provide a method, device, electronic device and computer non-volatile readable storage medium for text correction of images to solve or partially solve the problem that the algorithm is prone to errors when completing multimodal tasks due to the inability to strictly match text and image.
本申请实施例公开了一种图像的文本纠错方法,应用于多模态文本纠错系统,多模态文本纠错系统至少包括文本特征修正模块、纠错向量存取器以及纠错解码器,方法包括:The embodiment of the present application discloses a method for text error correction of an image, which is applied to a multimodal text error correction system. The multimodal text error correction system at least includes a text feature correction module, an error correction vector accessor, and an error correction decoder. The method includes:
响应于针对图像与文本的输入操作,获取输入操作对应的图像信息与原始文本信息,并分别对图像信息与原始文本信息进行特征提取,获得与图像信息对应的图像特征,以及与原始文本信息对应的原始文本特征;In response to input operations for images and texts, image information and original text information corresponding to the input operations are acquired, and feature extraction is performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;
将图像特征与原始文本特征进行特征拼接,获得综合编码特征,并根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征;The image features are concatenated with the original text features to obtain comprehensive coding features, and the comprehensive coding features and the original text features are intercepted to obtain text coding features;
通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征;The text encoding features are corrected by the text feature correction module to generate text correction features, and the text correction features are fused with the text encoding features by the error correction vector accessor to obtain the target text features;
通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息。
The error correction decoder uses the target text features to replace the original text features and outputs the corresponding target text information.
可选地,通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,包括:Optionally, the text encoding feature is corrected by a text feature correction module to generate a text correction feature, including:
通过文本特征修正模块对文本编码特征进行自注意力编码,获得对应的初始自注意力向量,并对初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,目标自注意力向量包含图像特征与原始文本特征的关联特征;The text encoding features are self-attention encoded through the text feature correction module to obtain the corresponding initial self-attention vector, and the initial self-attention vector is processed by character prediction to obtain the corresponding target self-attention vector, which contains the correlation features between the image features and the original text features;
对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,有效文本信息向量表示文本编码特征中每个字符包含有效信息的概率;Predict the effective information amount of the text encoding feature to obtain the corresponding effective text information vector, which represents the probability that each character in the text encoding feature contains effective information;
对文本编码特征进行双向截取,分别获得前错位特征与后错位特征,并根据前错位特征与后错位特征,生成相邻特征交互向量;The text encoding features are bidirectionally intercepted to obtain the front dislocation features and the back dislocation features respectively, and the adjacent feature interaction vector is generated according to the front dislocation features and the back dislocation features;
对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,相邻文本信息向量表示文本编码特征中相邻字符连贯的概率;Performing coherence prediction processing on the adjacent feature interaction vectors to obtain the corresponding adjacent text information vectors, where the adjacent text information vectors represent the probability of adjacent characters being coherent in the text encoding features;
采用目标自注意力向量、有效文本信息向量以及相邻文本信息向量对文本编码特征进行特征纠正,生成文本纠正特征。The target self-attention vector, the effective text information vector and the adjacent text information vector are used to perform feature correction on the text encoding features to generate text correction features.
可选地,通过文本特征修正模块对文本编码特征进行自注意力编码,获得对应的初始自注意力向量,包括:Optionally, the text encoding feature is self-attention encoded by the text feature correction module to obtain a corresponding initial self-attention vector, including:
将文本编码特征输入至自注意力层中,采用公式
Input the text encoding features into the self-attention layer, using the formula
Input the text encoding features into the self-attention layer, using the formula
进行自注意力编码,获得对应的初始自注意力向量;其中,Perform self-attention encoding to obtain the corresponding initial self-attention vector; where,
Wq、Wk、Wv均为可学习权重,f为文本编码特征。W q , W k , and W v are all learnable weights, and f is the text encoding feature.
可选地,对初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,包括:Optionally, character prediction processing is performed on the initial self-attention vector to obtain a corresponding target self-attention vector, including:
将初始自注意力向量输入至两组全连接层中分别进行当前字符预测处理与前置字符预测处理,获得当前预测向量与前置预测向量;The initial self-attention vector is input into two groups of fully connected layers to perform current character prediction processing and previous character prediction processing respectively, and the current prediction vector and the previous prediction vector are obtained;
根据当前预测向量与前置预测向量确定目标自注意力向量。The target self-attention vector is determined based on the current prediction vector and the previous prediction vector.
可选地,根据当前预测向量与前置预测向量确定目标自注意力向量,包括:Optionally, determining a target self-attention vector according to a current prediction vector and a previous prediction vector includes:
采用当前预测向量对文本编码特征进行预测处理,获得文本编码特征对应的目标当前字符;The current prediction vector is used to predict the text encoding feature to obtain the target current character corresponding to the text encoding feature;
采用前置预测向量对目标当前字符进行预测处理,获得目标当前字符对应的目标前置字符;Using the pre-prediction vector to predict the target current character, obtaining the target pre-prediction character corresponding to the target current character;
将目标前置字符与目标当前字符进行拼接,输出对应的目标字符,并生成目标字符对应的目标自注意力向量。The target preceding character is concatenated with the target current character, the corresponding target character is output, and the target self-attention vector corresponding to the target character is generated.
可选地,采用当前预测向量对文本编码特征进行预测处理,获得文本编码特征对应的目标当前字符,包括:Optionally, the current prediction vector is used to perform prediction processing on the text encoding feature to obtain a target current character corresponding to the text encoding feature, including:
根据当前预测向量,将文本编码特征与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的当前预测概率,并将当前预测概率最大的预设字符确定为目标当前字符。According to the current prediction vector, the text encoding feature is probability matched with each preset character in the preset dictionary to obtain the current prediction probability corresponding to each preset character, and the preset character with the largest current prediction probability is determined as the target current character.
可选地,采用前置预测向量对目标当前字符进行预测处理,获得目标当前字符对应的目
标前置字符,包括:Optionally, the target current character is predicted using the pre-prediction vector to obtain the target current character corresponding to the target Prefix characters include:
根据前置预测向量,将目标当前字符与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的前置预测概率,并将前置预测概率最大的预设字符确定为目标前置字符。According to the pre-prediction vector, the target current character is probability matched with each preset character in the preset dictionary to obtain the pre-prediction probability corresponding to each preset character, and the preset character with the largest pre-prediction probability is determined as the target pre-prediction character.
可选地,对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,包括:Optionally, predicting the effective information amount of the text encoding feature to obtain a corresponding effective text information vector includes:
采用公式
Using formula
Using formula
对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量;其中,Predict the effective information volume of the text encoding features to obtain the corresponding effective text information vector; where:
Wiq、Wik、Wiv均为转移矩阵权重,Wiw为信息量预测权重,bib为可学习模型参数,f为文本编码特征。 Wiq , Wik , and Wiv are all transfer matrix weights, Wiw is the information prediction weight, bib is the learnable model parameter, and f is the text encoding feature.
可选地,文本编码特征的大小为[M,d],前错位特征与后错位特征的大小均为[M-1,d],根据前错位特征与后错位特征,生成相邻特征交互向量,包括:Optionally, the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and the adjacent feature interaction vector is generated according to the front misalignment feature and the back misalignment feature, including:
将前错位特征与后错位特征进行向量级联处理,生成与文本编码特征对应的大小为[M-1,d×2]的相邻特征交互向量。The front misaligned features and the back misaligned features are vector-concatenated to generate an adjacent feature interaction vector of size [M-1, d×2] corresponding to the text encoding features.
可选地,对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,包括:Optionally, performing coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors includes:
采用公式
Using formula
Using formula
对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量;其中,Perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors;
Wnw、Wnq、Wnv、Wnk均为转移矩阵权重参数,bin为偏置向量参数,fnb为相邻特征交互向量。W nw , W nq , W nv , and W nk are all transfer matrix weight parameters, bin is the bias vector parameter, and f nb is the adjacent feature interaction vector.
可选地,纠错向量存取器至少包括特征存储空间,在根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征之后,方法还包括:Optionally, the error correction vector accessor includes at least a feature storage space. After performing feature interception according to the comprehensive coding feature and the original text feature to obtain the text coding feature, the method further includes:
将文本编码特征拆分为若干个子文本特征,并将各个子文本特征依次存储至特征存储空间。The text encoding feature is split into several sub-text features, and each sub-text feature is stored in the feature storage space in sequence.
可选地,纠错向量存取器包括修复判断门以及特征更新器,通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征,包括:Optionally, the error correction vector accessor includes a repair judgment gate and a feature updater, and the text correction feature and the text encoding feature are fused through the error correction vector accessor to obtain the target text feature, including:
通过修复判断门对各个子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征;Performing repair judgment on each sub-text feature through a repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced;
通过特征更新器采用文本纠正特征对至少一个替换子文本特征进行特征替换,获得各自对应的目标子文本特征,并将至少一个目标子文本特征进行特征融合,获得对应的目标文本特征。The feature updater uses the text correction feature to perform feature replacement on at least one replacement sub-text feature to obtain respective corresponding target sub-text features, and performs feature fusion on at least one target sub-text feature to obtain corresponding target text features.
可选地,通过修复判断门对各个子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征,包括:Optionally, performing a repair judgment on each sub-text feature through a repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced includes:
采用公式
Using formula
Using formula
对各个子文本特征进行修复判断;Perform repair judgment on each sub-text feature;
当s(xk)为1时,将特征序号为k的子文本特征确定为需进行特征替换的替换子文本特征;其中,When s(x k ) is 1, the subtext feature with feature number k is determined as the replacement subtext feature to be replaced; wherein,
k表示子文本特征对应的特征序号,pifok为特征序号为k的子文本特征对应的有效文本信息向量,pnbok为特征序号为k的子文本特征对应的相邻文本信息向量,threshifo表示可设定信息量概率阈值,threshnbo表示可设定通顺概率阈值,s(xk)表示特征序号为k的子文本特征是否需要进行特征替换。k represents the feature number corresponding to the sub-text feature, p ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh ifo represents the settable information quantity probability threshold, thresh nbo represents the settable fluency probability threshold, and s(x k ) represents whether the sub-text feature with feature number k needs to be replaced.
可选地,通过特征更新器采用文本纠正特征对至少一个替换子文本特征进行特征替换,获得各自对应的目标子文本特征,包括:Optionally, the feature updater uses the text correction feature to perform feature replacement on at least one replacement sub-text feature to obtain respective corresponding target sub-text features, including:
根据文本纠正特征,采用公式
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm According to the text correction features, the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm According to the text correction features, the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
计算替换子文本特征对应的文本特征值,并根据文本特征值对替换子文本特征进行特征替换,获得对应的目标文本特征;其中,Calculate the text feature value corresponding to the replacement sub-text feature, and perform feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target text feature; wherein,
fk为特征序号为k的子文本特征,oemlm为目标自注意力向量,θ与μ均为大小为0~1的预设参数。 fk is the sub-text feature with feature number k, oemlm is the target self-attention vector, and θ and μ are both preset parameters with a size of 0 to 1.
可选地,根据文本特征值对替换子文本特征进行特征替换,获得对应的目标子文本特征,包括:Optionally, feature replacement is performed on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature, including:
采用文本特征值通过覆盖原值方式,对替换子文本特征的原有文本特征值进行替换,获得对应的目标子文本特征。The original text feature value of the replacement sub-text feature is replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature.
可选地,将图像特征与原始文本特征进行特征拼接,获得综合编码特征,包括:Optionally, the image features are concatenated with the original text features to obtain comprehensive coding features, including:
将图像特征与原始文本特征进行特征拼接,并进行跨模态编码处理,获得综合编码特征。The image features are concatenated with the original text features, and cross-modal encoding is performed to obtain comprehensive encoding features.
可选地,根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征,包括:Optionally, feature extraction is performed based on the comprehensive coding feature and the original text feature to obtain the text coding feature, including:
对综合编码特征中与原始文本特征位置对应的特征进行截取,获得与原始文本特征对应的文本编码特征。The features in the comprehensive coding features corresponding to the original text feature positions are intercepted to obtain text coding features corresponding to the original text features.
本申请实施例还公开了一种图像的文本纠错装置,应用于多模态文本纠错系统,多模态文本纠错系统至少包括文本特征修正模块、纠错向量存取器以及纠错解码器,装置包括:The embodiment of the present application also discloses a text error correction device for an image, which is applied to a multimodal text error correction system. The multimodal text error correction system at least includes a text feature correction module, an error correction vector accessor, and an error correction decoder. The device includes:
特征提取模块,用于响应于针对图像与文本的输入操作,获取输入操作对应的图像信息与原始文本信息,并分别对图像信息与原始文本信息进行特征提取,获得与图像信息对应的图像特征,以及与原始文本信息对应的原始文本特征;A feature extraction module, for obtaining image information and original text information corresponding to the input operation in response to the input operation for the image and the text, and performing feature extraction on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;
文本编码特征生成模块,用于将图像特征与原始文本特征进行特征拼接,获得综合编码特征,并根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征;The text coding feature generation module is used to perform feature concatenation of image features and original text features to obtain comprehensive coding features, and perform feature interception based on the comprehensive coding features and original text features to obtain text coding features;
目标文本特征生成模块,用于通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获
得目标文本特征;The target text feature generation module is used to perform feature correction on the text encoding feature through the text feature correction module to generate text correction features, and to fuse the text correction features with the text encoding features through the error correction vector accessor to obtain Get the target text features;
文本特征替换模块,用于通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息。The text feature replacement module is used to replace the original text features with the target text features through the error correction decoder and output the corresponding target text information.
可选地,文本编码特征生成模块包括:Optionally, the text encoding feature generation module includes:
目标自注意力向量生成模块,用于通过文本特征修正模块对文本编码特征进行自注意力编码,获得对应的初始自注意力向量,并对初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,目标自注意力向量包含图像特征与原始文本特征的关联特征;A target self-attention vector generation module is used to perform self-attention encoding on the text encoding features through the text feature correction module to obtain the corresponding initial self-attention vector, and perform character prediction processing on the initial self-attention vector to obtain the corresponding target self-attention vector, where the target self-attention vector contains the correlation features between the image features and the original text features;
有效文本信息向量生成模块,用于对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,有效文本信息向量表示文本编码特征中每个字符包含有效信息的概率;An effective text information vector generation module is used to predict the effective information amount of the text encoding feature and obtain the corresponding effective text information vector. The effective text information vector represents the probability that each character in the text encoding feature contains effective information.
相邻特征交互向量生成模块,用于对文本编码特征进行双向截取,分别获得前错位特征与后错位特征,并根据前错位特征与后错位特征,生成相邻特征交互向量;The adjacent feature interaction vector generation module is used to perform bidirectional interception on the text encoding features to obtain the front dislocation features and the back dislocation features respectively, and generate the adjacent feature interaction vector according to the front dislocation features and the back dislocation features;
相邻文本信息向量生成模块,用于对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,相邻文本信息向量表示文本编码特征中相邻字符连贯的概率;An adjacent text information vector generation module is used to perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors, where the adjacent text information vectors represent the probability of adjacent characters being coherent in text encoding features;
文本纠正特征生成模块,用于采用目标自注意力向量、有效文本信息向量以及相邻文本信息向量对文本编码特征进行特征纠正,生成文本纠正特征。The text correction feature generation module is used to use the target self-attention vector, the effective text information vector and the adjacent text information vector to perform feature correction on the text encoding feature to generate text correction features.
可选地,目标自注意力向量生成模块包括:Optionally, the target self-attention vector generation module includes:
初始自注意力向量生成模块,用于将文本编码特征输入至自注意力层中,采用公式
The initial self-attention vector generation module is used to input the text encoding features into the self-attention layer using the formula
The initial self-attention vector generation module is used to input the text encoding features into the self-attention layer using the formula
进行自注意力编码,获得对应的初始自注意力向量;其中,Perform self-attention encoding to obtain the corresponding initial self-attention vector; where,
Wq、Wk、Wv均为可学习权重,f为文本编码特征。W q , W k , and W v are all learnable weights, and f is the text encoding feature.
可选地,目标自注意力向量生成模块包括:Optionally, the target self-attention vector generation module includes:
字符预测处理模块,用于将初始自注意力向量输入至两组全连接层中分别进行当前字符预测处理与前置字符预测处理,获得当前预测向量与前置预测向量;A character prediction processing module is used to input the initial self-attention vector into two groups of fully connected layers to perform current character prediction processing and previous character prediction processing respectively, and obtain a current prediction vector and a previous prediction vector;
目标自注意力向量确定子模块,用于根据当前预测向量与前置预测向量确定目标自注意力向量。The target self-attention vector determination submodule is used to determine the target self-attention vector based on the current prediction vector and the previous prediction vector.
可选地,目标自注意力向量确定子模块包括:Optionally, the target self-attention vector determination submodule includes:
目标当前字符生成模块,用于采用当前预测向量对文本编码特征进行预测处理,获得文本编码特征对应的目标当前字符;A target current character generation module is used to use the current prediction vector to predict the text encoding feature to obtain the target current character corresponding to the text encoding feature;
目标前置字符生成模块,用于采用前置预测向量对目标当前字符进行预测处理,获得目标当前字符对应的目标前置字符;A target preceding character generating module is used to predict the target current character using the preceding prediction vector to obtain the target preceding character corresponding to the target current character;
目标字符输出模块,用于将目标前置字符与目标当前字符进行拼接,输出对应的目标字符,并生成目标字符对应的目标自注意力向量。The target character output module is used to concatenate the target preceding character with the target current character, output the corresponding target character, and generate a target self-attention vector corresponding to the target character.
可选地,目标当前字符生成模块具体用于:Optionally, the target current character generation module is specifically used for:
根据当前预测向量,将文本编码特征与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的当前预测概率,并将当前预测概率最大的预设字符确定为目标当前字符。According to the current prediction vector, the text encoding feature is probability matched with each preset character in the preset dictionary to obtain the current prediction probability corresponding to each preset character, and the preset character with the largest current prediction probability is determined as the target current character.
可选地,目标前置字符生成模块具体用于包括:
Optionally, the target prefix character generating module is specifically configured to include:
根据前置预测向量,将目标当前字符与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的前置预测概率,并将前置预测概率最大的预设字符确定为目标前置字符。According to the pre-prediction vector, the target current character is probability matched with each preset character in the preset dictionary to obtain the pre-prediction probability corresponding to each preset character, and the preset character with the largest pre-prediction probability is determined as the target pre-prediction character.
可选地,有效文本信息向量生成模块具体用于:Optionally, the effective text information vector generation module is specifically used for:
采用公式
Using formula
Using formula
对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量;其中,Predict the effective information volume of the text encoding features to obtain the corresponding effective text information vector;
Wiq、Wik、Wiv均为转移矩阵权重,Wiw为信息量预测权重,bib为可学习模型参数,f为文本编码特征。 Wiq , Wik , and Wiv are all transfer matrix weights, Wiw is the information prediction weight, bib is the learnable model parameter, and f is the text encoding feature.
可选地,文本编码特征的大小为[M,d],前错位特征与后错位特征的大小均为[M-1,d],相邻特征交互向量生成模块具体用于:Optionally, the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and the adjacent feature interaction vector generation module is specifically used for:
将前错位特征与后错位特征进行向量级联处理,生成与文本编码特征对应的大小为[M-1,d×2]的相邻特征交互向量。The front misaligned features and the back misaligned features are vector-concatenated to generate an adjacent feature interaction vector of size [M-1, d×2] corresponding to the text encoding features.
可选地,相邻文本信息向量生成模块具体用于:Optionally, the adjacent text information vector generation module is specifically used for:
采用公式
Using formula
Using formula
对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量;其中,Perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors;
Wnw、Wnq、Wnv、Wnk均为转移矩阵权重参数,bin为偏置向量参数,fnb为相邻特征交互向量。W nw , W nq , W nv , and W nk are all transfer matrix weight parameters, bin is the bias vector parameter, and f nb is the adjacent feature interaction vector.
可选地,纠错向量存取器至少包括特征存储空间,装置还包括:Optionally, the error correction vector accessor includes at least a feature storage space, and the device further includes:
子文本特征拆分模块,用于将文本编码特征拆分为若干个子文本特征,并将各个子文本特征依次存储至特征存储空间。The sub-text feature splitting module is used to split the text encoding feature into several sub-text features, and store each sub-text feature in the feature storage space in turn.
可选地,纠错向量存取器包括修复判断门以及特征更新器,目标文本特征生成模块包括:Optionally, the error correction vector accessor includes a repair judgment gate and a feature updater, and the target text feature generation module includes:
替换子文本特征确定模块,用于通过修复判断门对各个子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征;A replacement subtext feature determination module is used to perform repair judgment on each subtext feature through a repair judgment gate to determine at least one replacement subtext feature that needs to be replaced;
目标子文本特征确定模块,用于通过特征更新器采用文本纠正特征对至少一个替换子文本特征进行特征替换,获得各自对应的目标子文本特征,并将至少一个目标子文本特征进行特征融合,获得对应的目标文本特征。The target sub-text feature determination module is used to replace at least one replacement sub-text feature with a text correction feature through a feature updater to obtain the corresponding target sub-text features, and to fuse at least one target sub-text feature to obtain the corresponding target text feature.
可选地,替换子文本特征确定模块具体用于:Optionally, the replacement subtext feature determination module is specifically used for:
采用公式
Using formula
Using formula
对各个子文本特征进行修复判断;
Perform repair judgment on each sub-text feature;
当s(xk)为1时,将特征序号为k的子文本特征确定为需进行特征替换的替换子文本特征;其中,When s(x k ) is 1, the subtext feature with feature number k is determined as the replacement subtext feature to be replaced; wherein,
k表示子文本特征对应的特征序号,pifok为特征序号为k的子文本特征对应的有效文本信息向量,pnbok为特征序号为k的子文本特征对应的相邻文本信息向量,threshifo表示可设定信息量概率阈值,threshnbo表示可设定通顺概率阈值,s(xk)表示特征序号为k的子文本特征是否需要进行特征替换。k represents the feature number corresponding to the sub-text feature, p ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh ifo represents the settable information quantity probability threshold, thresh nbo represents the settable fluency probability threshold, and s(x k ) represents whether the sub-text feature with feature number k needs to be replaced.
可选地,目标子文本特征确定模块包括:Optionally, the target subtext feature determination module includes:
文本特征值计算模块,用于根据文本纠正特征,采用公式
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm The text feature value calculation module is used to correct the features according to the text, using the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm The text feature value calculation module is used to correct the features according to the text, using the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
计算替换子文本特征对应的文本特征值,并根据文本特征值对替换子文本特征进行特征替换,获得对应的目标子文本特征;其中,Calculate the text feature value corresponding to the replacement sub-text feature, and perform feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature; wherein,
fk为特征序号为k的子文本特征,oemlm为目标自注意力向量,θ与μ均为大小为0~1的预设参数。 fk is the sub-text feature with feature number k, oemlm is the target self-attention vector, and θ and μ are both preset parameters with a size of 0 to 1.
可选地,文本特征值计算模块具体用于:Optionally, the text feature value calculation module is specifically used for:
采用文本特征值通过覆盖原值方式,对替换子文本特征的原有文本特征值进行替换,获得对应的目标子文本特征。The original text feature value of the replacement sub-text feature is replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature.
可选地,文本编码特征生成模块包括:Optionally, the text encoding feature generation module includes:
跨模态编码处理模块,用于将图像特征与原始文本特征进行特征拼接,并进行跨模态编码处理,获得综合编码特征。The cross-modal coding processing module is used to concatenate image features with original text features and perform cross-modal coding processing to obtain comprehensive coding features.
可选地,文本编码特征生成模块包括:Optionally, the text encoding feature generation module includes:
对应特征位置截取模块,用于对综合编码特征中与原始文本特征位置对应的特征进行截取,获得与原始文本特征对应的文本编码特征。The corresponding feature position interception module is used to intercept the features corresponding to the original text feature positions in the comprehensive coding features to obtain text coding features corresponding to the original text features.
本申请实施例还公开了一种电子设备,包括处理器、通信接口、存储器和通信总线,其中,处理器、通信接口以及存储器通过通信总线完成相互间的通信;The embodiment of the present application also discloses an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
存储器,用于存放计算机程序;Memory, used to store computer programs;
处理器,用于执行存储器上所存放的程序时,实现如本申请实施例的方法。The processor is used to implement the method of the embodiment of the present application when executing the program stored in the memory.
本申请实施例还公开了一种计算机非易失性可读存储介质,其上存储有指令,当由一个或多个处理器执行时,使得处理器执行如本申请实施例的方法。The embodiment of the present application also discloses a computer non-volatile readable storage medium having instructions stored thereon, which, when executed by one or more processors, enables the processors to execute the method as in the embodiment of the present application.
本申请实施例包括以下优点:The embodiments of the present application include the following advantages:
在本申请实施例中,提供了一种应用于多模态文本纠错系统的基于图像的文本纠错方法,首先将输入的图像与原始文本分别进行特征提取,获得图像特征以及原始文本特征,接着将图像特征以及原始文本特征通过特征拼接方式生成综合编码特征,并通过截取综合编码特征中对应于原始文本特征位置的特征,获得文本编码特征,接着通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征,最后通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息,从而根据图像实现了对原始文本中细粒度错误进行识别并纠正,大大降低了在进行多模态任务时的出错率。In an embodiment of the present application, an image-based text error correction method for a multimodal text error correction system is provided. First, feature extraction is performed on an input image and an original text respectively to obtain image features and original text features. Then, the image features and the original text features are generated into a comprehensive coding feature by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is feature corrected by a text feature correction module to generate a text correction feature, and the text correction feature is feature fused with the text coding feature by an error correction vector accessor to obtain a target text feature. Finally, the target text feature is used to replace the original text feature by an error correction decoder, and the corresponding target text information is output, thereby realizing the recognition and correction of fine-grained errors in the original text according to the image, greatly reducing the error rate when performing multimodal tasks.
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.
图1是一种当前不严格匹配的多模态样本对现有方法的干扰示意图;FIG1 is a schematic diagram of the interference of the current non-strictly matched multimodal samples on the existing methods;
图2是本申请实施例中提供的一种基于图像的文本纠错多模态样本示意图;FIG2 is a schematic diagram of a multimodal sample of image-based text error correction provided in an embodiment of the present application;
图3是本申请实施例中提供的一种基于视觉弹性掩膜的文本纠错系统示意图;FIG3 is a schematic diagram of a text error correction system based on a visual elastic mask provided in an embodiment of the present application;
图4是本申请实施例中提供的一种图像的文本纠错方法的步骤流程图;FIG4 is a flowchart of a method for correcting text errors in an image provided in an embodiment of the present application;
图5是本申请实施例中提供的一种视觉弹性掩膜示意图;FIG5 is a schematic diagram of a visual elastic mask provided in an embodiment of the present application;
图6是本申请实施例中提供的一种相邻词汇关系预测示意图;FIG6 is a schematic diagram of adjacent word relationship prediction provided in an embodiment of the present application;
图7是本申请实施例中提供的一种纠错向量存取器的结构框架示意图;FIG7 is a schematic diagram of a structural framework of an error correction vector accessor provided in an embodiment of the present application;
图8是本申请实施例中提供的一种图像的文本纠错装置的结构框图;FIG8 is a structural block diagram of a text error correction device for an image provided in an embodiment of the present application;
图9是本申请实施例中提供的一种计算机非易失性可读介质的示意图;FIG9 is a schematic diagram of a computer non-volatile readable medium provided in an embodiment of the present application;
图10是本申请实施例中提供的一种电子设备的框图。FIG. 10 is a block diagram of an electronic device provided in an embodiment of the present application.
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present application more obvious and easy to understand, the present application is further described in detail below in conjunction with the accompanying drawings and specific implementation methods.
为了使本领域技术人员更好地理解本申请实施例中的技术方案,下面对本申请实施例中涉及的部分技术特征进行解释、说明:In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present application, some technical features involved in the embodiments of the present application are explained and illustrated below:
文本纠错(Sentence Correction,SC):检测出文本错误并进行对应纠正。Sentence Correction (SC): Detect text errors and make corresponding corrections.
多模态(Multi Modal,MM):即多种异构模态数据协同推理,在人工智能领域中,往往指感知信息,如图像、文本、视频、音频等协同,帮助人工智能更准确地理解外部世界。Multi Modal (MM): refers to the collaborative reasoning of multiple heterogeneous modal data. In the field of artificial intelligence, it often refers to the collaboration of perceptual information, such as images, text, video, audio, etc., to help artificial intelligence understand the external world more accurately.
掩膜文本预测(Masked Language Modeling,MLM):掩膜,常用于图像处理场景,可以通过掩膜技术实现对图像中对应的文本进行预测。Masked Language Modeling (MLM): Mask is often used in image processing scenarios. Mask technology can be used to predict the corresponding text in the image.
弹性掩膜文本预测(Masked Language Modeling Elastic,EMLM):比掩膜文本预测更为精确的文本预测方式,与现有掩膜方式不同的是,本申请中采用弹性掩膜文本预测可以实现不定长文本的词汇替换。Masked Language Modeling Elastic (EMLM): A more accurate text prediction method than masked text prediction. Different from the existing masking method, the elastic masked text prediction used in this application can realize vocabulary replacement of text of indefinite length.
作为一种示例,在现实世界中,人类的语言往往存在口误、比喻等常见语言现象,这些现象难以被现有的计算机语言技术掌握,从而在进行文本与图像间的匹配时,往往无法将这些带有口误或者比喻修辞手法的词语与图像进行对应匹配,也就是说,现阶段多模态理论研究无法精细地区分文本中的微小错误,例如一段文字可能错了某个字或某个词语,从而导致算法在完成多模态任务时出现错误,例如,在基于视觉问答的任务中,极有可能会遇到因带有刻意比喻的文字内容,导致算法无法理解人类实际想要描述的问题的情况,使得基于Transformer的多模态结构无法通过算法给出正确应答,从而给出错误答案。As an example, in the real world, human language often contains common language phenomena such as slips of the tongue and metaphors, which are difficult to be grasped by existing computer language technology. Therefore, when matching text and images, it is often impossible to match these words with slips of the tongue or metaphors with the images. In other words, the current multimodal theoretical research cannot finely distinguish minor errors in the text. For example, a paragraph of text may have a wrong character or a word, which may cause the algorithm to make errors when completing multimodal tasks. For example, in tasks based on visual question answering, it is very likely that the algorithm will not be able to understand the actual problem that humans want to describe due to deliberate metaphors in the text content, making it impossible for the Transformer-based multimodal structure to give a correct response through the algorithm, and thus giving a wrong answer.
为更好地进行说明,参照图1,示出了一种当前不严格匹配的多模态样本对现有方法的干扰示意图,如图1的右边框内为不严格匹配的多模态样本,其中显示情景以问答形式展示,具体为:问的一方提出“连着一个高高的桶的红色皮卡车在干什么?”,答的一方则回复“停车”,而由图像内容,可以得出实际情况是:连着一个高高的桶的皮卡车的车身颜色实为白色,且该白色皮卡车在红绿灯下,且前方有一辆车,则很明显可以看出,该白色皮卡车实
际正在行驶,而不是停车,从而该图像与文本的匹配关系应当如图1左边框内所示的情景,图1左边框内则为严格匹配的多模态样本,同样是以问答形式展示,具体为:问的一方提出“连着一个高高的桶的白色皮卡车在干什么?”,答的一方则回复“行驶”,从而可以得出,在进行图像与文本之间的匹配时,如果不能将图像与文本进行严格匹配,容易因信息判断错误导致输出错误答案。For better explanation, referring to FIG1 , a schematic diagram of the interference of a current non-strictly matched multimodal sample on an existing method is shown. As shown in the right frame of FIG1 , the non-strictly matched multimodal sample is shown, in which the display scenario is presented in the form of a question and answer, specifically: the party asking the question asks “What is the red pickup truck with a tall bucket attached to it doing?”, and the party answering the question replies “Stop the car”. From the image content, it can be concluded that the actual situation is: the body color of the pickup truck with a tall bucket attached to it is actually white, and the white pickup truck is at a traffic light, and there is a car in front of it, then it can be clearly seen that the white pickup truck is actually The vehicle is actually driving, not parked, so the matching relationship between the image and the text should be as shown in the left frame of Figure 1. The left frame of Figure 1 is a strictly matched multimodal sample, which is also presented in the form of question and answer. Specifically, the party asking the question asks, "What is the white pickup truck with a tall bucket attached to it doing?", and the party answering replies, "Driving". It can be concluded that when matching between images and texts, if the image and text cannot be strictly matched, it is easy to output an incorrect answer due to incorrect information judgment.
进一步地,参照图2,示出了本申请实施例中提供的一种基于图像的文本纠错多模态样本示意图,如图所示,图像依然采用的是图1中的图像,其输入文本为“连着一个高高的桶的红色皮卡车行驶在马路上”,可以看出,皮卡车的车身颜色明显是错误的,则需要根据图像中所显示的信息对输入文本中错误信息进行对应纠正,如应将“红色”纠正为“白色”,从而通过本申请提供的一种基于图像的文本纠错方法对该输入文本进行纠错,可以获得对应的输出文本应当为“连着一个高高的桶的白色皮卡车行驶在马路上”,实现对应于图像与文本之间的严格匹配,得出正确的文本内容。Further, referring to Figure 2, a schematic diagram of a multimodal sample of image-based text correction provided in an embodiment of the present application is shown. As shown in the figure, the image still uses the image in Figure 1, and the input text is "a red pickup truck with a tall bucket attached is driving on the road". It can be seen that the body color of the pickup truck is obviously wrong, and the erroneous information in the input text needs to be corrected according to the information displayed in the image, such as "red" should be corrected to "white", so that the input text is corrected through an image-based text correction method provided in the present application, and the corresponding output text should be "a white pickup truck with a tall bucket attached is driving on the road", achieving a strict match between the image and the text, and obtaining the correct text content.
因此,本申请实施例的核心申请点之一在于:提供一种应用于多模态文本纠错系统的基于图像的文本纠错方法,首先将输入的图像与原始文本分别进行特征提取,获得图像特征以及原始文本特征,接着将图像特征以及原始文本特征通过特征拼接方式生成综合编码特征,并通过截取综合编码特征中对应于原始文本特征位置的特征,获得文本编码特征,接着通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征,最后通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息,从而根据图像实现对原始文本中细粒度错误进行识别并纠正,以期大大降低在进行多模态任务时的出错率。Therefore, one of the core application points of the embodiment of the present application is to provide an image-based text correction method for a multimodal text correction system. First, the input image and the original text are respectively feature extracted to obtain image features and original text features. Then, the image features and the original text features are generated into a comprehensive coding feature by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is feature corrected by the text feature correction module to generate a text correction feature, and the text correction feature is feature fused with the text coding feature by the error correction vector accessor to obtain the target text feature. Finally, the target text feature is used to replace the original text feature by the error correction decoder, and the corresponding target text information is output, so as to identify and correct the fine-grained errors in the original text according to the image, so as to greatly reduce the error rate when performing multimodal tasks.
参照图3,示出了本申请实施例中提供的一种基于视觉弹性掩膜的文本纠错系统示意图,通过该文本纠错系统,结合本申请所提供的图像的文本纠错方法,可以将有误或可能有误的输入文本进行进一步判断并纠错,获得正确的与图像严格匹配的输出文本。Referring to Figure 3, a schematic diagram of a text correction system based on a visual elastic mask provided in an embodiment of the present application is shown. Through this text correction system, combined with the image text correction method provided in the present application, erroneous or potentially erroneous input text can be further judged and corrected to obtain correct output text that strictly matches the image.
如图所示,该文本纠错系统至少可以包括图像/文本编码模块301、特征截取模块302、文本特征修正模块303、纠错向量存取器304以及纠错解码器305。As shown in the figure, the text error correction system may at least include an image/text encoding module 301 , a feature extraction module 302 , a text feature correction module 303 , an error correction vector accessor 304 and an error correction decoder 305 .
其中,首先可以通过图像/文本编码模块301对需要进行纠错的图像以及文本分别进行编码,示例性地,可以在接收图像以及对应的输入文本“连着一个高高的桶的红色皮卡车行驶在马路上”之后,可以分别对图像以及输入文本进行编码,获得图像对应的图像编码,以及输入文本对应的文本编码。Among them, first, the image and text that need to be corrected can be encoded separately through the image/text encoding module 301. For example, after receiving the image and the corresponding input text "a red pickup truck with a tall bucket attached is driving on the road", the image and the input text can be encoded separately to obtain the image code corresponding to the image and the text code corresponding to the input text.
接着可以通过特征截取模块302对图像编码以及文本编码先进行特征合并,再进行特征编码,获得对应的综合编码特征,并对综合编码特征进行文本特征段截取,以获得文本编码对应的文本编码特征。Then, the feature extraction module 302 can first merge the features of the image code and the text code, then perform feature encoding to obtain the corresponding comprehensive coding features, and then extract the text feature segments of the comprehensive coding features to obtain the text coding features corresponding to the text coding.
然后可以将文本编码特征输入至文本特征修正模块303进行特征纠正,获得对应的文本纠正特征,其中,针对文本特征的纠错,文本特征修正模块303中设置有3个子模块,分别为视觉弹性掩膜子模块、信息量预测网络子模块以及相邻词汇关系预测子模块。The text encoding features can then be input into the text feature correction module 303 for feature correction to obtain corresponding text correction features. Specifically, for error correction of text features, the text feature correction module 303 is provided with three submodules, namely, a visual elastic mask submodule, an information prediction network submodule, and an adjacent vocabulary relationship prediction submodule.
具体地,可以通过视觉弹性掩膜子模块进行不定长句子的纠错,如输入文本中出现错误的词所对应的字符数有2个,经纠错后的词所对应的字符数实际为3个,从而可以实现针对
输入文本的不定长纠错,使文本纠错更加准确,可信度更高。Specifically, the visual elastic mask submodule can be used to correct indefinite length sentences. For example, if the number of characters corresponding to the wrong word in the input text is 2, the number of characters corresponding to the corrected word is actually 3, so that the corrected word can be corrected. The variable length error correction of input text makes text correction more accurate and more reliable.
进一步地,纠错除了可能使原句的长度变长之外,还可能导致原句的长度变短,换言之,原句中某些字符应该被删除,如从3个变为2个,则为使文本纠错系统得模型能力更全面,本申请在文本特征修正模块303内设计了信息量预测网络子模块,从而通过信息量预测网络子模块可以使对应位置的特征能够预测其位置字符是否包含有效信息量。Furthermore, in addition to possibly lengthening the original sentence, error correction may also shorten the original sentence. In other words, some characters in the original sentence should be deleted, such as changing from 3 to 2. In order to make the model capability of the text error correction system more comprehensive, the present application designs an information prediction network sub-module in the text feature correction module 303, so that the information prediction network sub-module can enable the features of the corresponding position to predict whether the characters at that position contain effective information.
如果输入文本中存在错误,则相邻的文字可能是不连贯、不通顺的,因此,本申请还在文本特征修正模块303内设计了相邻词汇关系预测子模块,以对输入文本对应的特征进行文本通顺性预测。If there are errors in the input text, the adjacent words may be incoherent and incoherent. Therefore, the present application also designs an adjacent word relationship prediction submodule in the text feature correction module 303 to predict the text coherence of the features corresponding to the input text.
当经过文本特征修正模块303中的视觉弹性掩膜子模块、信息量预测网络子模块以及相邻词汇关系预测子模块对文本编码特征进行特征纠正,获得对应的文本纠正特征后,可以将文本纠正特征输入至纠错向量存取器304,同时可以将文本编码特征也一并输入至纠错向量存取器304,并可以通过纠错向量存取器304将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征。After the text encoding features are corrected by the visual elastic mask submodule, the information prediction network submodule and the adjacent vocabulary relationship prediction submodule in the text feature correction module 303 and the corresponding text correction features are obtained, the text correction features can be input into the error correction vector accessor 304. At the same time, the text encoding features can also be input into the error correction vector accessor 304. The text correction features and the text encoding features can be fused through the error correction vector accessor 304 to obtain the target text features.
最后可以将目标文本特征输入至纠错解码器305,并可以通过纠错解码器305采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息,示例性地,针对输入文本“连着一个高高的桶的红色皮卡车行驶在马路上”,在经过本申请的文本纠错系统进行纠错之后,可以得出正确的输出文本“连着一个高高的桶的白色皮卡车行驶在马路上”。Finally, the target text features can be input into the error correction decoder 305, and the target text features can be used by the error correction decoder 305 to replace the original text features, and output the corresponding target text information. For example, for the input text "a red pickup truck with a tall bucket attached is driving on the road", after being corrected by the text error correction system of the present application, the correct output text "a white pickup truck with a tall bucket attached is driving on the road" can be obtained.
需要指出的是,为更好地进行辅助说明,本实施例中采用上述图2中基于图像的文本纠错多模态样本进行示例性说明,且本实施例中对于采用文本纠错系统结合图像的文本纠错方法相关过程描述地较为简单,仅作为实现原理的简单性说明,较为具体的实现步骤可以在下方内容中对图4的详细说明中获得,可以理解的是,本申请对此不作限制。It should be pointed out that, in order to better assist in the explanation, the image-based text correction multimodal sample in FIG. 2 is used for exemplary explanation in this embodiment, and the description of the related processes of the text correction method using a text correction system combined with an image in this embodiment is relatively simple, which is only used as a simple explanation of the implementation principle. The more specific implementation steps can be obtained in the detailed description of FIG. 4 in the following content. It can be understood that this application does not limit this.
需要说明的是,本申请实施例包括但不限于上述示例,可以理解的是,本领域技术人员在本申请实施例的思想指导下,还可以根据实际需求进行设置,本申请对此不作限制。It should be noted that the embodiments of the present application include but are not limited to the above examples. It is understandable that those skilled in the art can also make settings according to actual needs under the guidance of the ideas of the embodiments of the present application, and the present application does not impose any restrictions on this.
在本申请实施例中,提供了一种多模态文本纠错系统,该文本纠错系统至少可以包括图像/文本编码模块、特征截取模块、文本特征修正模块、纠错向量存取器以及纠错解码器,本申请所提供的文本纠错系统以当前热门的Transformer网络结构作为骨干网络,并通过设计视觉弹性掩膜、信息量预测网络、相邻词汇关系预测等子模块实现了模型对文本错误的纠正能力,从而在基于图像的文本纠错过程中,可以将输入的图像与原始文本分别进行特征提取,获得图像特征以及原始文本特征,接着将图像特征以及原始文本特征通过特征拼接方式生成综合编码特征,并通过截取综合编码特征中对应于原始文本特征位置的特征,获得文本编码特征,接着通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征,最后通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息,从而通过上述文本纠错系统,结合基于图像的文本纠错方法,实现了对原始文本中细粒度错误进行识别并纠正,大大降低了在进行多模态任务时的出错率。In an embodiment of the present application, a multimodal text error correction system is provided, which may include at least an image/text encoding module, a feature extraction module, a text feature correction module, an error correction vector accessor, and an error correction decoder. The text error correction system provided by the present application uses the currently popular Transformer network structure as the backbone network, and implements the model's ability to correct text errors by designing submodules such as visual elastic mask, information volume prediction network, and adjacent vocabulary relationship prediction. Therefore, in the image-based text error correction process, the input image and the original text can be respectively feature extracted to obtain image features and original text features, and then the image features and the original text features are extracted through A comprehensive coding feature is generated by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is corrected by a text feature correction module to generate a text correction feature, and the text correction feature is fused with the text coding feature by an error correction vector accessor to obtain the target text feature. Finally, the target text feature is used to replace the original text feature by the error correction decoder, and the corresponding target text information is output. Thus, the above-mentioned text error correction system, combined with the image-based text error correction method, can realize the identification and correction of fine-grained errors in the original text, which greatly reduces the error rate when performing multimodal tasks.
参照图4,示出了本申请实施例中提供的一种图像的文本纠错方法的步骤流程图,方法可以应用于多模态文本纠错系统,多模态文本纠错系统至少包括文本特征修正模块、纠错向量存取器以及纠错解码器,方法具体可以包括如下步骤:
4, a flowchart of a method for correcting text errors in an image provided in an embodiment of the present application is shown. The method can be applied to a multimodal text error correction system. The multimodal text error correction system includes at least a text feature correction module, an error correction vector accessor, and an error correction decoder. The method may specifically include the following steps:
步骤401,响应于针对图像与文本的输入操作,获取输入操作对应的图像信息与原始文本信息,并分别对图像信息与原始文本信息进行特征提取,获得与图像信息对应的图像特征,以及与原始文本信息对应的原始文本特征;Step 401, in response to an input operation for an image and text, image information and original text information corresponding to the input operation are obtained, and feature extraction is performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;
Transformer,一个N进N出的结构,也就是说,每个Transformer单元相当于一层的RNN(Recursive Neural Network,递归神经网络)层,可以接收一整个句子所有词作为输入,接着为句子中每个词都做出一个输出。但与RNN不同的是,Transformer能够同时处理句子中的所有词,并且任意两个词之间的操作距离均为1。Transformer, an N-input and N-output structure, that is, each Transformer unit is equivalent to a layer of RNN (Recursive Neural Network), which can receive all the words in a sentence as input, and then make an output for each word in the sentence. But unlike RNN, Transformer can process all the words in a sentence at the same time, and the operation distance between any two words is 1.
本申请在Transformer多模态网络结构的基础上,可以实现基于图像的文本纠错,具体地,响应于针对图像与文本的输入操作,可以获取输入操作对应的图像信息与原始文本信息,并分别对图像信息与原始文本信息进行特征提取,获得与图像信息对应的图像特征,以及与原始文本信息对应的原始文本特征,从而可以通过特征提取方式,分别提取图像与文本中的特征,以便后续过程中基于提取出的特征进行更精细的文本纠错。Based on the Transformer multimodal network structure, the present application can realize image-based text correction. Specifically, in response to input operations on images and texts, image information and original text information corresponding to the input operations can be obtained, and feature extraction can be performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information. In this way, features in the image and text can be extracted respectively through feature extraction, so that more refined text correction can be performed based on the extracted features in the subsequent process.
示例性地,对于输入大小为N的图像以及大小为M的文本,在分别进行编码之后,获得对应的图像编码以及文本编码,接着可以分别采用现有的编码器模型进行特征提取,得到图像对应的大小为[N,d]的图像特征,以及文本对应的大小为[M,d]的文本特征,其中,“d”具体可以指特征的维度,即每个特征由多少个数组成。而对于特征提取方式,均采用当前主流的特征提取模型,如卷积神经网络(Convolutional Neural Networks,CNN)以及BERT(Bidirectional Encoder Representation from Transformers,双向语言模型)编码器进行提取,因此不作赘述,本领域技术人员可以采用其他类似的编码器或者图像/文本模型进行特征提取,本申请对此不作限制。Exemplarily, for an image with an input size of N and a text with a size of M, after encoding them respectively, the corresponding image encoding and text encoding are obtained, and then the existing encoder model can be used for feature extraction to obtain image features of size [N, d] corresponding to the image and text features of size [M, d] corresponding to the text, where "d" specifically refers to the dimension of the feature, that is, how many numbers each feature consists of. As for the feature extraction method, the current mainstream feature extraction models, such as Convolutional Neural Networks (CNN) and BERT (Bidirectional Encoder Representation from Transformers, bidirectional language model) encoders are used for extraction, so they are not elaborated. Those skilled in the art can use other similar encoders or image/text models for feature extraction, and this application does not limit this.
步骤402,将图像特征与原始文本特征进行特征拼接,获得综合编码特征,并根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征;Step 402, performing feature concatenation of the image features and the original text features to obtain a comprehensive coding feature, and performing feature interception based on the comprehensive coding feature and the original text features to obtain a text coding feature;
在具体的实现中,当获得图像特征以及原始文本特征之后,可以将两者进行特征拼接,以获得综合编码特征,进一步地,将图像特征与原始文本特征进行特征拼接,获得综合编码特征,具体可以为:将图像特征与原始文本特征进行特征拼接,并进行跨模态编码处理,获得综合编码特征。In a specific implementation, after obtaining the image features and the original text features, the two can be feature-concatenated to obtain comprehensive coding features. Further, the image features and the original text features can be feature-concatenated to obtain comprehensive coding features. Specifically, the image features and the original text features can be feature-concatenated and cross-modal encoding processing can be performed to obtain comprehensive coding features.
示例性地,对于大小为[N,d]的图像特征,以及大小为[M,d]的文本特征,可以将两者进行特征拼接,获得大小为[N+M,d]的综合特征,并且将该综合特征输入至Transformer结构中进行跨模态编码,得到大小为[N+M,d]的综合编码特征。其中,跨模态编码本质是利用多模态码流间的语义,以进行相关性的联合编码,是实现跨模态通信的关键技术之一,从而通过将图像特征与原始文本特征进行特征拼接的方式,可以获得图像与文本的综合编码特征,以实现跨模态交互。For example, for image features of size [N, d] and text features of size [M, d], the two can be concatenated to obtain a comprehensive feature of size [N+M, d], and the comprehensive feature is input into the Transformer structure for cross-modal encoding to obtain a comprehensive coded feature of size [N+M, d]. Among them, cross-modal coding essentially utilizes the semantics between multimodal code streams to perform joint encoding of correlations, and is one of the key technologies for realizing cross-modal communication. Therefore, by concatenating image features with original text features, the comprehensive coded features of image and text can be obtained to realize cross-modal interaction.
当获得综合编码特征之后,可以根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征,具体可以为:对综合编码特征中与原始文本特征位置对应的特征进行截取,获得与原始文本特征对应的文本编码特征。After obtaining the comprehensive coding features, feature extraction can be performed based on the comprehensive coding features and the original text features to obtain text coding features. Specifically, the features corresponding to the original text feature positions in the comprehensive coding features are extracted to obtain text coding features corresponding to the original text features.
即可以将大小为[N+M,d]的综合编码特征中与大小为[M,d]的文本特征对应的位置进行截取,从而得出大小为[M,d]的文本编码特征,并可以将该文本编码特征存储进纠错向量存取器中,需要说明的是,虽然文本特征与文本编码特征大小均为[M,d],但两者包含的内容完全不同,文本特征仅代表了文本对应的特征,而文本编码特征由于是从综合编码特征中
截取出来的,因此其除了具有文本对应的特征之外,还与图像对应的特征具有相关性,因此在后续针对文本的特征纠正过程中,不能忽略与图像特征对应的相关性。That is, the position corresponding to the text feature of size [M, d] in the comprehensive coding feature of size [N+M, d] can be intercepted to obtain the text coding feature of size [M, d], and the text coding feature can be stored in the error correction vector accessor. It should be noted that although the size of the text feature and the text coding feature are both [M, d], the contents contained in the two are completely different. The text feature only represents the features corresponding to the text, and the text coding feature is obtained from the comprehensive coding feature. It is extracted, so in addition to the features corresponding to the text, it is also correlated with the features corresponding to the image. Therefore, in the subsequent feature correction process for the text, the correlation with the image features cannot be ignored.
步骤403,通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征;Step 403, performing feature correction on the text encoding feature through the text feature correction module to generate a text correction feature, and performing feature fusion on the text correction feature and the text encoding feature through the error correction vector accessor to obtain a target text feature;
在具体的实现中,通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,可以包括如下子步骤:In a specific implementation, the text encoding feature is corrected by a text feature correction module to generate a text correction feature, which may include the following sub-steps:
子步骤4031,通过文本特征修正模块对文本编码特征进行自注意力编码,获得对应的初始自注意力向量,并对初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,其中,目标自注意力向量包含图像特征与原始文本特征的关联特征。Sub-step 4031, performs self-attention encoding on the text encoding features through the text feature correction module to obtain the corresponding initial self-attention vector, and performs character prediction processing on the initial self-attention vector to obtain the corresponding target self-attention vector, wherein the target self-attention vector contains the correlation features between the image features and the original text features.
在自然语言处理领域中,离不开掩膜文本预测的任务,通过掩膜文本预测可以实现对遮挡或错误字的纠正过程,但采用该方法只能实现对应字符数的句子纠正,无法改变原有句子的长度,如原有句子字符数为10,经纠正后输出的纠正句子,其字符数也对应为10。In the field of natural language processing, the task of masked text prediction is indispensable. Masked text prediction can be used to correct occluded or erroneous words. However, this method can only correct sentences with a corresponding number of characters and cannot change the length of the original sentence. For example, if the original sentence has 10 characters, the corrected sentence output after correction also has 10 characters.
然而在实际使用过程中,并无法保证纠正内容的长度一定与原句长度一样,例如,如果对于一个句子“操场上一个男孩在打棒球”,如果句子中的“棒球”实际上应该是“篮球”,那么掩膜文本预测可以实现该纠错过程,但是,如果“棒球”对应的正确内容应该为“曲棍球”(即需要被纠正的内容造成了原有句子长度的变化),则掩膜文本预测在这种情况下就会失效,从而为克服该问题,本申请设计了一种弹性掩膜文本预测方式,用于实现不定长句子的纠错。However, in actual use, it is not possible to guarantee that the length of the corrected content must be the same as the length of the original sentence. For example, if for a sentence "A boy is playing baseball on the playground", if the "baseball" in the sentence should actually be "basketball", then the masked text prediction can realize the error correction process. However, if the correct content corresponding to "baseball" should be "hockey" (that is, the content to be corrected causes the change in the original sentence length), the masked text prediction will fail in this case. In order to overcome this problem, the present application designs a flexible masked text prediction method for realizing error correction of sentences of variable length.
为更好地进行说明,参照图5,示出了本申请实施例中提供的一种视觉弹性掩膜示意图,从图中可以看出,如(a)中对应的不定长字符掩膜预测中,与图像对应的输入文本为“一个男孩在打篮球他很开心”,但实际对应的文本应该为“一个男孩在打曲棍球他很开心”,从而字数为2的“篮球”与字数为3的“曲棍球”是不等长的,在这种情况下,如(b)的当前字符预测以及(c)的前置字符预测,首先,对于从“篮”(sk)到“曲棍”(tk-1,tk)而言,“棍”(tk)为“篮”(sk)的当前字符,“曲”(tk-1)为“蓝”(sk)的前置字符,假设“篮”(sk)字最终对应的特征为768维的文本编码特征f,则可以采用两组全连接层分别对文本编码特征f进行前传,从而得到两个新的向量,假设预设字典中共有3000个字,则这两个向量大小均为1000×1,之后可以采用第一个向量预测当前字符,采用另一个预测前置字符,预测方法为找出3000个数中最大值所在位置,将预设字典中对应的字输出即可,从而在这个过程中引入了一种可以同时预测多个字的机制,从而增强所谓的弹性,实现基于视觉弹性掩膜的文本预测,进一步实现不定长句子的纠错。For better explanation, referring to FIG. 5 , a schematic diagram of a visual elastic mask provided in an embodiment of the present application is shown. It can be seen from the figure that, in the indefinite-length character mask prediction corresponding to (a), the input text corresponding to the image is “A boy is playing basketball and he is very happy”, but the actual corresponding text should be “A boy is playing hockey and he is very happy”, so that the word “basketball” with 2 characters and the word “hockey” with 3 characters are not of the same length. In this case, as shown in the current character prediction of (b) and the preceding character prediction of (c), first, for “basket” (sk) to “hockey” (tk-1, tk), “stick” (tk) is the current character of “basket” (sk), and “song” (tk-1) is “blue” ( sk), assuming that the feature finally corresponding to the word "篮" (sk) is a 768-dimensional text encoding feature f, then two sets of fully connected layers can be used to forward the text encoding feature f respectively, so as to obtain two new vectors. Assuming that there are 3000 words in the preset dictionary, the sizes of these two vectors are both 1000×1. After that, the first vector can be used to predict the current character, and the other one can be used to predict the preceding character. The prediction method is to find the position of the maximum value in the 3000 numbers and output the corresponding word in the preset dictionary. In this process, a mechanism that can predict multiple words at the same time is introduced, thereby enhancing the so-called elasticity, realizing text prediction based on visual elastic mask, and further realizing error correction of indefinite length sentences.
作为一种可选实施例,通过文本特征修正模块对文本编码特征进行自注意力编码,获得对应的初始自注意力向量,可以包括:将大小为[M,d]的文本编码特征输入至自注意力层中,采用公式
As an optional embodiment, the text encoding feature is self-attention encoded by the text feature correction module to obtain the corresponding initial self-attention vector, which may include: inputting the text encoding feature of size [M, d] into the self-attention layer, using the formula
As an optional embodiment, the text encoding feature is self-attention encoded by the text feature correction module to obtain the corresponding initial self-attention vector, which may include: inputting the text encoding feature of size [M, d] into the self-attention layer, using the formula
进行自注意力编码,获得对应的初始自注意力向量;其中,Perform self-attention encoding to obtain the corresponding initial self-attention vector; where,
iemlm为初始自注意力向量,Wq、Wk、Wv均为可学习权重,f为文本编码特征,size表示文本编码特征的大小,T表示转置矩阵。
i emlm is the initial self-attention vector, W q , W k , W v are all learnable weights, f is the text encoding feature, size represents the size of the text encoding feature, and T represents the transposed matrix.
softmax函数,又称归一化指数函数,为二分类函数sigmoid在多分类上的推广,目的是将多分类的结果以概率形式进行展现。自注意力指注意力模型中注意力完全基于特征向量进行计算,因自注意力机制为当前图像/文本处理中较为常见的手段,因而此处不作赘述。The softmax function, also known as the normalized exponential function, is a generalization of the binary classification function sigmoid in multi-classification. Its purpose is to present the results of multi-classification in the form of probability. Self-attention refers to the attention model in which attention is calculated entirely based on feature vectors. Since the self-attention mechanism is a common method in current image/text processing, it will not be described here.
进一步地,对初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,可以包括:将初始自注意力向量iemlm输入至两组全连接层中分别进行当前字符预测处理与前置字符预测处理,获得当前预测向量与前置预测向量,并根据当前预测向量与前置预测向量确定目标自注意力向量oemlm。Furthermore, performing character prediction processing on the initial self-attention vector to obtain a corresponding target self-attention vector may include: inputting the initial self-attention vector i emlm into two groups of fully connected layers to perform current character prediction processing and preceding character prediction processing respectively, obtaining a current prediction vector and a preceding prediction vector, and determining a target self-attention vector o emlm according to the current prediction vector and the preceding prediction vector.
其中,上述的两组全连接层为新的全连接层,需单独进行训练调优,具体地,在模型训练过程中,可以通过计算全连接层输出与真实字符之间的交叉熵以进行调优训练,从而在模型投入使用过程中,可以通过训练好的模型进行计算,从而输出目标自注意力向量oemlm。Among them, the above two groups of fully connected layers are new fully connected layers, which need to be trained and tuned separately. Specifically, during the model training process, the cross entropy between the fully connected layer output and the real character can be calculated to perform tuning training, so that when the model is put into use, the trained model can be used for calculation to output the target self-attention vector o emlm .
作为一种实施例,根据当前预测向量与前置预测向量确定目标自注意力向量,可以包括:采用当前预测向量对文本编码特征进行预测处理,获得文本编码特征对应的目标当前字符,进一步地,该过程具体可以为根据当前预测向量,将文本编码特征与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的当前预测概率,并将当前预测概率最大的预设字符确定为目标当前字符,从而通过对文本编码特征进行当前预测处理,可以获得对应的目标当前字符,以便后续进行前置字符预测,一定程度上增加了掩膜文本预测的弹性。As an embodiment, determining the target self-attention vector based on the current prediction vector and the previous prediction vector may include: using the current prediction vector to perform prediction processing on the text encoding feature to obtain the target current character corresponding to the text encoding feature. Furthermore, the process may specifically be to probability match the text encoding feature with each preset character in a preset dictionary based on the current prediction vector to obtain the current prediction probability corresponding to each preset character, and determine the preset character with the largest current prediction probability as the target current character, so that by performing current prediction processing on the text encoding feature, the corresponding target current character can be obtained for subsequent previous character prediction, which increases the flexibility of mask text prediction to a certain extent.
接着可以采用前置预测向量对目标当前字符进行预测处理,获得目标当前字符对应的目标前置字符,具体可以为根据前置预测向量,将目标当前字符与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的前置预测概率,并将前置预测概率最大的预设字符确定为目标前置字符,从而可以通过对目标当前字符的前置预测处理,确定对应的目标前置字符,增加了掩膜文本预测的弹性,实现了对于文本的不定长纠错。Then, the target current character can be predicted using the preceding prediction vector to obtain the target preceding character corresponding to the target current character. Specifically, the target current character can be probability matched with each preset character in a preset dictionary according to the preceding prediction vector to obtain the preceding prediction probability corresponding to each preset character, and the preset character with the largest preceding prediction probability can be determined as the target preceding character. Thus, the corresponding target preceding character can be determined through preceding prediction processing of the target current character, thereby increasing the flexibility of mask text prediction and realizing error correction for text of indefinite length.
如前述图5中,“篮”字对应的特征在当前字符预测的过程中负责预测出字符“棍”,在前置字符预测过程则需要预测出字符“曲”,因具体的预测示例在对图5的分析过程中进行了详细的描述,此处不再赘述。As shown in Figure 5 above, the feature corresponding to the character "篮" is responsible for predicting the character "棒" in the current character prediction process, and the character "曲" needs to be predicted in the previous character prediction process. Since the specific prediction example is described in detail in the analysis process of Figure 5, it will not be repeated here.
然后可以将目标前置字符与目标当前字符进行拼接,输出对应的目标字符,并生成目标字符对应的目标自注意力向量,如将“蓝”对应的目标前置字符“曲”与目标当前字符“棍”进行拼接,获得目标字符“曲棍”,并生成“曲棍”对应的目标自注意力向量,以便进行后续的处理流程。Then the target preceding character can be concatenated with the target current character to output the corresponding target character, and a target self-attention vector corresponding to the target character can be generated. For example, the target preceding character "曲" corresponding to "蓝" can be concatenated with the target current character "棒" to obtain the target character "曲棍", and a target self-attention vector corresponding to "曲棍" can be generated for subsequent processing steps.
子步骤4032,对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,有效文本信息向量表示文本编码特征中每个字符包含有效信息的概率。Sub-step 4032, predicting the effective information amount of the text encoding feature to obtain a corresponding effective text information vector, where the effective text information vector represents the probability that each character in the text encoding feature contains effective information.
由前述实施例中内容可知,纠错除了可能使原句的长度变长之外,还可能导致原句的长度变短,为使文本纠错系统得模型能力更全面,本申请在文本特征修正模块内设计了信息量预测网络子模块,从而通过信息量预测网络子模块可以使对应位置的特征能够预测其位置字符是否包含有效信息量。It can be seen from the contents of the aforementioned embodiments that in addition to lengthening the length of the original sentence, error correction may also shorten the length of the original sentence. In order to make the model capability of the text error correction system more comprehensive, the present application designs an information prediction network sub-module in the text feature correction module, so that the information prediction network sub-module can enable the features of the corresponding position to predict whether the characters at that position contain effective information.
在具体的实现中,对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,可以包括:采用公式
In a specific implementation, predicting the effective information amount of the text encoding feature to obtain the corresponding effective text information vector may include: using the formula
In a specific implementation, predicting the effective information amount of the text encoding feature to obtain the corresponding effective text information vector may include: using the formula
对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量;其中,pifo为大小为[M,1]的有效文本信息向量,表示文本编码特征中每个字符包含有效信息的概率,Wiq、Wik、Wiv均为大小为[d,d]的转移矩阵权重,Wiw为大小为[d,1]的信息量预测权重,bib为可学习模型参数,f为文本编码特征。需要说明的是,实施例中所涉及到的参数均为随机初始化并基于实际的数据集训练调优得出,且各个公式中各个参数符号的下标,如q、w、k等,仅作为便于区分各个参数,并无特殊含义,本领域技术人员可以根据实际情况或者实际需求进行设定,可以理解的是,本申请对此不作限制。The effective information volume of the text coding feature is predicted to obtain the corresponding effective text information vector; wherein, p ifo is an effective text information vector of size [M, 1], indicating the probability that each character in the text coding feature contains effective information, Wiq , Wik , Wiv are all transfer matrix weights of size [d, d], Wiw is the information volume prediction weight of size [d, 1], bib is a learnable model parameter, and f is a text coding feature. It should be noted that the parameters involved in the embodiment are all randomly initialized and obtained based on actual data set training and tuning, and the subscripts of the parameter symbols in each formula, such as q, w, k, etc., are only used to facilitate the distinction between the parameters and have no special meaning. Those skilled in the art can set them according to the actual situation or actual needs. It can be understood that this application does not limit this.
其中,sigmoid函数是一个在生物学中常见的S型函数,也称为S型生长曲线,在信息科学中,由于sigmoid函数的单增以及反函数单增等性质,从而常被用作神经网络的激活函数,将变量映射至[0,1]之间。Among them, the sigmoid function is a common S-shaped function in biology, also known as the S-shaped growth curve. In information science, due to the monotonic increasing properties of the sigmoid function and the monotonic increasing properties of the inverse function, it is often used as an activation function of a neural network to map variables to between [0, 1].
为更好地进行说明,参照图6,示出了本申请实施例中提供的一种相邻词汇关系预测示意图,其中,(a)表示原始特征,阴影部分表示原始特征中需要进行文本纠错的部分,(b)表示进行特征双向截取之后的前错位特征以及后错位特征,(c)表示将前错位特征与后错位特征进行向量级联之后获得的相邻特征交互向量,其中,阴影部分可以表示为文本纠错对应的相邻预测内容。具体地,本实施例将结合下述子步骤4033至子步骤4034对相邻特征交互向量的生成过程进行说明:For better explanation, refer to FIG6 , which shows a schematic diagram of adjacent word relationship prediction provided in an embodiment of the present application, wherein (a) represents the original feature, the shaded portion represents the portion of the original feature that needs to be corrected for text errors, (b) represents the front misalignment feature and the back misalignment feature after bidirectional feature interception, and (c) represents the adjacent feature interaction vector obtained after vector concatenation of the front misalignment feature and the back misalignment feature, wherein the shaded portion can be represented as the adjacent prediction content corresponding to the text error correction. Specifically, this embodiment will describe the generation process of the adjacent feature interaction vector in conjunction with the following sub-steps 4033 to 4034:
子步骤4033,对文本编码特征进行双向截取,分别获得前错位特征与后错位特征,并根据前错位特征与后错位特征,生成相邻特征交互向量。Sub-step 4033, bidirectionally intercepts the text encoding features to obtain the front misalignment features and the back misalignment features respectively, and generates an adjacent feature interaction vector based on the front misalignment features and the back misalignment features.
具体地,文本编码特征的大小可以为[M,d],则进行对特征的双向截取处理之后,前错位特征与后错位特征的大小均为[M-1,d],则进一步地,根据前错位特征与后错位特征,生成相邻特征交互向量,可以包括:将前错位特征与后错位特征进行向量级联处理,生成与文本编码特征对应的大小为[M-1,d×2]的相邻特征交互向量。从而通过对特征的双向截取处理以及对错位特征的向量级联处理,可以获得相邻特征交互向量,以方便后续对相邻预测进行概率计算,进一步提高文本纠错精确度。Specifically, the size of the text encoding feature can be [M, d]. After the bidirectional interception of the feature, the size of the front misalignment feature and the back misalignment feature are both [M-1, d]. Then, further, according to the front misalignment feature and the back misalignment feature, the adjacent feature interaction vector is generated, which can include: the front misalignment feature and the back misalignment feature are vector cascaded to generate an adjacent feature interaction vector of size [M-1, d×2] corresponding to the text encoding feature. Thus, by bidirectional interception of the feature and vector cascade of the misalignment feature, the adjacent feature interaction vector can be obtained to facilitate the subsequent probability calculation of adjacent predictions and further improve the accuracy of text error correction.
子步骤4034,对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,相邻文本信息向量表示文本编码特征中相邻字符连贯的概率。Sub-step 4034, performs coherence prediction processing on the adjacent feature interaction vector to obtain a corresponding adjacent text information vector, where the adjacent text information vector represents the probability of adjacent characters in the text encoding feature being coherent.
具体地,对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,可以包括:采用公式
Specifically, performing coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors may include: using the formula
Specifically, performing coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors may include: using the formula
对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量;Perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors;
其中,pnbo为大小为[M-1,1]的相邻文本信息向量,Wnw、Wnq、Wnv、Wnk均为大小为[d,d]的转移矩阵权重参数,bin为大小为[1,d]的偏置向量参数,fnb为相邻特征交互向量。Among them, pnbo is the adjacent text information vector of size [M-1, 1], Wnw , Wnq , Wnv , Wnk are all transfer matrix weight parameters of size [d, d], bin is the bias vector parameter of size [1, d], and fnb is the adjacent feature interaction vector.
为保持所有向量大小的统一性,此后可以将pnbo均更新为一个同样大小为[M-1,1]的
向量,由于默认句子的第一个字符无需与任何前文保证通顺,从而句子的第一个字符是一定合理的,因此可以在该向量前面新增一个1。To keep the size of all vectors the same, we can then update p nbo to a vector of the same size [M-1, 1] Vector, since the first character of the default sentence does not need to be consistent with any previous text, the first character of the sentence must be reasonable, so a 1 can be added in front of the vector.
子步骤4035,采用目标自注意力向量、有效文本信息向量以及相邻文本信息向量对文本编码特征进行特征纠正,生成文本纠正特征。Sub-step 4035, uses the target self-attention vector, the effective text information vector and the adjacent text information vector to perform feature correction on the text encoding feature to generate a text correction feature.
对于文本特征修正模块而言,可以输出目标自注意力向量、有效文本信息向量以及相邻文本信息向量,从而可以在后续过程中采用目标自注意力向量、有效文本信息向量以及相邻文本信息向量对文本编码进行特征纠正,以生成文本纠正特征。For the text feature correction module, the target self-attention vector, the valid text information vector and the adjacent text information vector can be output, so that the target self-attention vector, the valid text information vector and the adjacent text information vector can be used in the subsequent process to perform feature correction on the text encoding to generate text correction features.
以上为子步骤4031至4035对应的内容,当通过文本特征修正模块对文本编码特征进行处理之后,可以将文本纠正特征输入至纠错向量存取器进行下一步的处理。The above is the content corresponding to sub-steps 4031 to 4035. After the text encoding features are processed by the text feature correction module, the text correction features can be input into the error correction vector accessor for the next step of processing.
参照图7,示出了本申请实施例中提供的一种纠错向量存取器的结构框架示意图,其中,纠错向量存取器主要可以包括三个部分,用于存储特征的特征存储空间701、用于对特征修复进行判断的修复判断门702、以及可以对特征进行更新的特征更新器703,则在根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征之后,还可以将文本编码特征拆分为若干个子文本特征,并将各个子文本特征依次存储至特征存储空间701,具体地,可以将大小为[M,d]的文本编码特征拆分为M条子文本特征。Referring to Figure 7, a schematic diagram of the structural framework of an error correction vector accessor provided in an embodiment of the present application is shown, wherein the error correction vector accessor can mainly include three parts, a feature storage space 701 for storing features, a repair judgment gate 702 for judging feature repair, and a feature updater 703 that can update features. After feature interception is performed based on the comprehensive coding feature and the original text feature to obtain the text coding feature, the text coding feature can also be split into several sub-text features, and each sub-text feature can be stored in the feature storage space 701 in sequence. Specifically, the text coding feature of size [M, d] can be split into M sub-text features.
在经过文本特征修正模块的处理之后,由于文本纠正特征新增了修复信息,因此可以通过修复判断门702判断每条文本纠正特征是否需要被修复以决定是否需要更新特征存储空间701中的对应向量。After being processed by the text feature correction module, since the text correction feature is newly added with repair information, the repair judgment gate 702 can be used to judge whether each text correction feature needs to be repaired to determine whether the corresponding vector in the feature storage space 701 needs to be updated.
在具体的实现中,通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征,可以为:通过修复判断门702对各个子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征,进一步地,通过修复判断门702对各个子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征,可以为:采用公式
In a specific implementation, the text correction feature and the text encoding feature are fused by the error correction vector accessor to obtain the target text feature, which can be: the repair judgment gate 702 is used to perform a repair judgment on each sub-text feature to determine at least one replacement sub-text feature that needs to be replaced. Further, the repair judgment gate 702 is used to perform a repair judgment on each sub-text feature to determine at least one replacement sub-text feature that needs to be replaced. The formula is used.
In a specific implementation, the text correction feature and the text encoding feature are fused by the error correction vector accessor to obtain the target text feature, which can be: the repair judgment gate 702 is used to perform a repair judgment on each sub-text feature to determine at least one replacement sub-text feature that needs to be replaced. Further, the repair judgment gate 702 is used to perform a repair judgment on each sub-text feature to determine at least one replacement sub-text feature that needs to be replaced. The formula is used.
对各个子文本特征进行修复判断,以确定是否需对子文本特征进行替换。Perform a repair judgment on each sub-text feature to determine whether the sub-text feature needs to be replaced.
其中,k表示子文本特征对应的特征序号,pifok为特征序号为k的子文本特征对应的有效文本信息向量,pnbok为特征序号为k的子文本特征对应的相邻文本信息向量,threshifo表示可设定信息量概率阈值,threshnbo表示可设定通顺概率阈值,s(xk)表示是否将特征序号为k的子文本特征确定为需要进行特征替换的替换字文本特征。示例性地,若s(xk)为1,则表示需要对特征序号为k的子文本特征进行特征替换,此时可以将特征序号为k的子文本特征确定为需进行特征替换的替换子文本特征,若s(xk)为0,则表示不需要对特征序号为k的子文本特征进行特征替换,此时则不将特征序号为k的子文本特征确定为需进行特征替换的替换子文本特征。Wherein, k represents the feature number corresponding to the sub-text feature, p ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh ifo represents the settable information quantity probability threshold, thresh nbo represents the settable fluency probability threshold, and s(x k ) represents whether the sub-text feature with feature number k is determined as the replacement character text feature that needs to be replaced. Exemplarily, if s(x k ) is 1, it means that the sub-text feature with feature number k needs to be replaced, and at this time, the sub-text feature with feature number k can be determined as the replacement sub-text feature that needs to be replaced, and if s(x k ) is 0, it means that the sub-text feature with feature number k does not need to be replaced, and at this time, the sub-text feature with feature number k is not determined as the replacement sub-text feature that needs to be replaced.
接着可以通过特征更新器703采用文本纠正特征对至少一个替换子文本特征进行特征替换,获得各自对应的目标子文本特征,并将至少一个目标子文本特征进行特征融合,获得对应的目标文本特征,从而可以通过特征替换以及特征融合方式,实现对于文本纠正特征的更新。Then, the feature updater 703 can use the text correction feature to replace the feature of at least one replacement sub-text feature to obtain the corresponding target sub-text features, and perform feature fusion on at least one target sub-text feature to obtain the corresponding target text feature, so that the text correction feature can be updated through feature replacement and feature fusion.
在具体的实现中,通过特征更新器703采用文本纠正特征对至少一个替换子文本特征进
行特征替换,获得各自对应的目标子文本特征,可以为:根据文本纠正特征,采用公式
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm In a specific implementation, the feature updater 703 uses the text correction feature to update at least one replacement sub-text feature. Replace the row features to obtain the corresponding target sub-text features, which can be: According to the text correction features, use the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm In a specific implementation, the feature updater 703 uses the text correction feature to update at least one replacement sub-text feature. Replace the row features to obtain the corresponding target sub-text features, which can be: According to the text correction features, use the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
计算替换子文本特征对应的文本特征值,并根据文本特征值对替换子文本特征进行特征替换,获得对应的目标子文本特征。其中,Calculate the text feature value corresponding to the replacement sub-text feature, and perform feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature.
fk为特征序号为k的子文本特征,oemlm为目标自注意力向量,θ与μ均为大小为0~1的预设参数。 fk is the sub-text feature with feature number k, oemlm is the target self-attention vector, and θ and μ are both preset parameters with a size of 0 to 1.
接着可以根据文本特征值对替换子文本特征进行特征替换,获得对应的目标子文本特征,具体地,可以为采用文本特征值通过覆盖原值方式,对替换子文本特征的原有文本特征值进行替换,获得对应的目标子文本特征,最后将特征替换完毕的目标子文本特征进行特征融合,获得对应的目标文本特征,实现对于文本纠正特征的更新。Then, the feature replacement of the replacement sub-text feature can be performed according to the text feature value to obtain the corresponding target sub-text feature. Specifically, the original text feature value of the replacement sub-text feature can be replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature. Finally, the target sub-text feature after feature replacement is feature fused to obtain the corresponding target text feature, thereby realizing the update of the text correction feature.
步骤404,通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息。Step 404: replace the original text features with the target text features through the error correction decoder, and output the corresponding target text information.
最后可以将目标文本特征从纠错向量存取器传送至纠错解码器,并通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并在进行解码之后,输出对应的目标文本信息,从而可以生成图像对应的正确文本,完成基于图像的文本纠错流程。Finally, the target text features can be transmitted from the error correction vector accessor to the error correction decoder, and the target text features can be used by the error correction decoder to replace the original text features. After decoding, the corresponding target text information is output, so that the correct text corresponding to the image can be generated, completing the image-based text error correction process.
示例性地,本实施例中的纠错解码器可以为一个语句生成器,可采用当前主流的GPT(Generative Pre-Training,生成式预训练语言模型)等模型实现,可以理解的是,本申请对此不作限制。Exemplarily, the error correction decoder in this embodiment can be a sentence generator, which can be implemented using current mainstream models such as GPT (Generative Pre-Training, generative pre-trained language model). It can be understood that this application does not impose any restrictions on this.
需要说明的是,本申请实施例包括但不限于上述示例,可以理解的是,本领域技术人员在本申请实施例的思想指导下,还可以根据实际需求进行设置,本申请对此不作限制。It should be noted that the embodiments of the present application include but are not limited to the above examples. It is understandable that those skilled in the art can also make settings according to actual needs under the guidance of the ideas of the embodiments of the present application, and the present application does not impose any restrictions on this.
在本申请实施例中,提供了一种应用于多模态文本纠错系统的基于图像的文本纠错方法,首先将输入的图像与原始文本分别进行特征提取,获得图像特征以及原始文本特征,接着将图像特征以及原始文本特征通过特征拼接方式生成综合编码特征,并通过截取综合编码特征中对应于原始文本特征位置的特征,获得文本编码特征,接着通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征,最后通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息,从而根据图像实现了对原始文本中细粒度错误进行识别并纠正,大大降低了在进行多模态任务时的出错率。In an embodiment of the present application, an image-based text error correction method for a multimodal text error correction system is provided. First, feature extraction is performed on an input image and an original text respectively to obtain image features and original text features. Then, the image features and the original text features are generated into a comprehensive coding feature by feature splicing, and the text coding feature is obtained by intercepting the feature corresponding to the original text feature position in the comprehensive coding feature. Then, the text coding feature is feature corrected by a text feature correction module to generate a text correction feature, and the text correction feature is feature fused with the text coding feature by an error correction vector accessor to obtain a target text feature. Finally, the target text feature is used to replace the original text feature by an error correction decoder, and the corresponding target text information is output, thereby realizing the recognition and correction of fine-grained errors in the original text according to the image, greatly reducing the error rate when performing multimodal tasks.
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。It should be noted that, for the method embodiments, for the sake of simplicity, they are all expressed as a series of action combinations, but those skilled in the art should be aware that the embodiments of the present application are not limited by the described action sequence, because according to the embodiments of the present application, certain steps can be performed in other sequences or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present application.
参照图8,示出了本申请实施例中提供的一种图像的文本纠错装置的结构框图,应用于多模态文本纠错系统,多模态文本纠错系统至少包括文本特征修正模块、纠错向量存取器以及纠错解码器,装置具体可以包括如下模块:8, a structural block diagram of a text error correction device for an image provided in an embodiment of the present application is shown, which is applied to a multimodal text error correction system. The multimodal text error correction system includes at least a text feature correction module, an error correction vector accessor, and an error correction decoder. The device may specifically include the following modules:
特征提取模块801,用于响应于针对图像与文本的输入操作,获取输入操作对应的图像信息与原始文本信息,并分别对图像信息与原始文本信息进行特征提取,获得与图像信息对
应的图像特征,以及与原始文本信息对应的原始文本特征;The feature extraction module 801 is used to obtain image information and original text information corresponding to the input operation in response to the input operation of the image and text, and to extract features from the image information and the original text information respectively to obtain features corresponding to the image information. The corresponding image features, and the original text features corresponding to the original text information;
文本编码特征生成模块802,用于将图像特征与原始文本特征进行特征拼接,获得综合编码特征,并根据综合编码特征与原始文本特征进行特征截取,获得文本编码特征;The text coding feature generation module 802 is used to perform feature splicing of the image feature and the original text feature to obtain a comprehensive coding feature, and perform feature interception based on the comprehensive coding feature and the original text feature to obtain a text coding feature;
目标文本特征生成模块803,用于通过文本特征修正模块对文本编码特征进行特征纠正,生成文本纠正特征,并通过纠错向量存取器将文本纠正特征与文本编码特征进行特征融合,获得目标文本特征;The target text feature generation module 803 is used to perform feature correction on the text encoding feature through the text feature correction module to generate text correction features, and perform feature fusion of the text correction features with the text encoding features through the error correction vector accessor to obtain the target text features;
文本特征替换模块804,用于通过纠错解码器采用目标文本特征对原始文本特征进行特征替换,并输出对应的目标文本信息。The text feature replacement module 804 is used to replace the original text features with the target text features through the error correction decoder, and output the corresponding target text information.
在一种可选实施例中,文本编码特征生成模块802包括:In an optional embodiment, the text encoding feature generation module 802 includes:
目标自注意力向量生成模块,用于通过文本特征修正模块对文本编码特征进行自注意力编码,获得对应的初始自注意力向量,并对初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,目标自注意力向量包含图像特征与原始文本特征的关联特征;A target self-attention vector generation module is used to perform self-attention encoding on the text encoding features through the text feature correction module to obtain the corresponding initial self-attention vector, and perform character prediction processing on the initial self-attention vector to obtain the corresponding target self-attention vector, where the target self-attention vector contains the correlation features between the image features and the original text features;
有效文本信息向量生成模块,用于对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,有效文本信息向量表示文本编码特征中每个字符包含有效信息的概率;An effective text information vector generation module is used to predict the effective information amount of the text encoding feature and obtain the corresponding effective text information vector. The effective text information vector represents the probability that each character in the text encoding feature contains effective information.
相邻特征交互向量生成模块,用于对文本编码特征进行双向截取,分别获得前错位特征与后错位特征,并根据前错位特征与后错位特征,生成相邻特征交互向量;The adjacent feature interaction vector generation module is used to perform bidirectional interception on the text encoding features to obtain the front dislocation features and the back dislocation features respectively, and generate the adjacent feature interaction vector according to the front dislocation features and the back dislocation features;
相邻文本信息向量生成模块,用于对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,相邻文本信息向量表示文本编码特征中相邻字符连贯的概率;An adjacent text information vector generation module is used to perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors, where the adjacent text information vectors represent the probability of adjacent characters being coherent in text encoding features;
文本纠正特征生成模块,用于采用目标自注意力向量、有效文本信息向量以及相邻文本信息向量对文本编码特征进行特征纠正,生成文本纠正特征。The text correction feature generation module is used to use the target self-attention vector, the effective text information vector and the adjacent text information vector to perform feature correction on the text encoding feature to generate text correction features.
在一种可选实施例中,目标自注意力向量生成模块包括:In an optional embodiment, the target self-attention vector generation module includes:
初始自注意力向量生成模块,用于将文本编码特征输入至自注意力层中,采用公式
The initial self-attention vector generation module is used to input the text encoding features into the self-attention layer using the formula
The initial self-attention vector generation module is used to input the text encoding features into the self-attention layer using the formula
进行自注意力编码,获得对应的初始自注意力向量;其中,Perform self-attention encoding to obtain the corresponding initial self-attention vector; where,
Wq、Wk、Wv均为可学习权重,f为文本编码特征。W q , W k , and W v are all learnable weights, and f is the text encoding feature.
在一种可选实施例中,目标自注意力向量生成模块包括:In an optional embodiment, the target self-attention vector generation module includes:
字符预测处理模块,用于将初始自注意力向量输入至两组全连接层中分别进行当前字符预测处理与前置字符预测处理,获得当前预测向量与前置预测向量;A character prediction processing module is used to input the initial self-attention vector into two groups of fully connected layers to perform current character prediction processing and previous character prediction processing respectively, and obtain a current prediction vector and a previous prediction vector;
目标自注意力向量确定子模块,用于根据当前预测向量与前置预测向量确定目标自注意力向量。The target self-attention vector determination submodule is used to determine the target self-attention vector based on the current prediction vector and the previous prediction vector.
在一种可选实施例中,目标自注意力向量确定子模块包括:In an optional embodiment, the target self-attention vector determination submodule includes:
目标当前字符生成模块,用于采用当前预测向量对文本编码特征进行预测处理,获得文本编码特征对应的目标当前字符;A target current character generation module is used to use the current prediction vector to predict the text encoding feature to obtain the target current character corresponding to the text encoding feature;
目标前置字符生成模块,用于采用前置预测向量对目标当前字符进行预测处理,获得目标当前字符对应的目标前置字符;A target preceding character generating module is used to predict the target current character using the preceding prediction vector to obtain the target preceding character corresponding to the target current character;
目标字符输出模块,用于将目标前置字符与目标当前字符进行拼接,输出对应的目标字
符,并生成目标字符对应的目标自注意力向量。The target character output module is used to concatenate the target preceding character with the target current character and output the corresponding target character. character and generates the target self-attention vector corresponding to the target character.
在一种可选实施例中,目标当前字符生成模块具体用于:In an optional embodiment, the target current character generation module is specifically used for:
根据当前预测向量,将文本编码特征与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的当前预测概率,并将当前预测概率最大的预设字符确定为目标当前字符。According to the current prediction vector, the text encoding feature is probability matched with each preset character in the preset dictionary to obtain the current prediction probability corresponding to each preset character, and the preset character with the largest current prediction probability is determined as the target current character.
在一种可选实施例中,目标前置字符生成模块具体用于包括:In an optional embodiment, the target prefix character generating module is specifically configured to include:
根据前置预测向量,将目标当前字符与预设字典中各个预设字符进行概率匹配,获得各个预设字符对应的前置预测概率,并将前置预测概率最大的预设字符确定为目标前置字符。According to the pre-prediction vector, the target current character is probability matched with each preset character in the preset dictionary to obtain the pre-prediction probability corresponding to each preset character, and the preset character with the largest pre-prediction probability is determined as the target pre-prediction character.
在一种可选实施例中,有效文本信息向量生成模块具体用于:In an optional embodiment, the effective text information vector generation module is specifically used for:
采用公式
Using formula
Using formula
对文本编码特征进行有效信息量预测,获得对应的有效文本信息向量;其中,Predict the effective information volume of the text encoding features to obtain the corresponding effective text information vector;
Wiq、Wik、Wiv均为转移矩阵权重,Wiw为信息量预测权重,bib为可学习模型参数,f为文本编码特征。 Wiq , Wik , and Wiv are all transfer matrix weights, Wiw is the information prediction weight, bib is the learnable model parameter, and f is the text encoding feature.
在一种可选实施例中,文本编码特征的大小为[M,d],前错位特征与后错位特征的大小均为[M-1,d],相邻特征交互向量生成模块具体用于:In an optional embodiment, the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and the adjacent feature interaction vector generation module is specifically used for:
将前错位特征与后错位特征进行向量级联处理,生成与文本编码特征对应的大小为[M-1,d×2]的相邻特征交互向量。The front misaligned features and the back misaligned features are vector-concatenated to generate an adjacent feature interaction vector of size [M-1, d×2] corresponding to the text encoding features.
在一种可选实施例中,相邻文本信息向量生成模块具体用于:In an optional embodiment, the adjacent text information vector generation module is specifically used to:
采用公式
Using formula
Using formula
对相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量;其中,Perform coherent prediction processing on adjacent feature interaction vectors to obtain corresponding adjacent text information vectors;
Wnw、Wnq、Wnv、Wnk均为转移矩阵权重参数,bin为偏置向量参数,fnb为相邻特征交互向量。W nw , W nq , W nv , and W nk are all transfer matrix weight parameters, bin is the bias vector parameter, and f nb is the adjacent feature interaction vector.
在一种可选实施例中,纠错向量存取器至少包括特征存储空间,装置还包括:In an optional embodiment, the error correction vector accessor includes at least a feature storage space, and the device further includes:
子文本特征拆分模块,用于将文本编码特征拆分为若干个子文本特征,并将各个子文本特征依次存储至特征存储空间。The sub-text feature splitting module is used to split the text encoding feature into several sub-text features, and store each sub-text feature in the feature storage space in turn.
在一种可选实施例中,纠错向量存取器包括修复判断门以及特征更新器,目标文本特征生成模块803包括:In an optional embodiment, the error correction vector accessor includes a repair judgment gate and a feature updater, and the target text feature generation module 803 includes:
替换子文本特征确定模块,用于通过修复判断门对各个子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征;A replacement subtext feature determination module is used to perform repair judgment on each subtext feature through a repair judgment gate to determine at least one replacement subtext feature that needs to be replaced;
目标子文本特征确定模块,用于通过特征更新器采用文本纠正特征对至少一个替换子文本特征进行特征替换,获得各自对应的目标子文本特征,并将至少一个目标子文本特征进行特征融合,获得对应的目标文本特征。The target sub-text feature determination module is used to replace at least one replacement sub-text feature with a text correction feature through a feature updater to obtain the corresponding target sub-text features, and to fuse at least one target sub-text feature to obtain the corresponding target text feature.
在一种可选实施例中,替换子文本特征确定模块具体用于:In an optional embodiment, the replacement subtext feature determination module is specifically used to:
采用公式
Using formula
Using formula
对各个子文本特征进行修复判断;Perform repair judgment on each sub-text feature;
当s(xk)为1时,将特征序号为k的子文本特征确定为需进行特征替换的替换子文本特征;其中,When s(x k ) is 1, the subtext feature with feature number k is determined as the replacement subtext feature to be replaced; wherein,
k表示子文本特征对应的特征序号,pifok为特征序号为k的子文本特征对应的有效文本信息向量,pnbok为特征序号为k的子文本特征对应的相邻文本信息向量,threshifo表示可设定信息量概率阈值,threshnbo表示可设定通顺概率阈值,s(xk)表示特征序号为k的子文本特征是否需要进行特征替换。k represents the feature number corresponding to the sub-text feature, p ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh ifo represents the settable information quantity probability threshold, thresh nbo represents the settable fluency probability threshold, and s(x k ) represents whether the sub-text feature with feature number k needs to be replaced.
在一种可选实施例中,目标子文本特征确定模块包括:In an optional embodiment, the target subtext feature determination module includes:
文本特征值计算模块,用于根据文本纠正特征,采用公式
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm The text feature value calculation module is used to correct the features according to the text, using the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm The text feature value calculation module is used to correct the features according to the text, using the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm
计算替换子文本特征对应的文本特征值,并根据文本特征值对替换子文本特征进行特征替换,获得对应的目标子文本特征;其中,Calculate the text feature value corresponding to the replacement sub-text feature, and perform feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature; wherein,
fk为特征序号为k的子文本特征,oemlm为目标自注意力向量,θ与μ均为大小为0~1的预设参数。 fk is the sub-text feature with feature number k, oemlm is the target self-attention vector, and θ and μ are both preset parameters with a size of 0 to 1.
在一种可选实施例中,文本特征值计算模块具体用于:In an optional embodiment, the text feature value calculation module is specifically used for:
采用文本特征值通过覆盖原值方式,对替换子文本特征的原有文本特征值进行替换,获得对应的目标子文本特征。The original text feature value of the replacement sub-text feature is replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature.
在一种可选实施例中,文本编码特征生成模块802包括:In an optional embodiment, the text encoding feature generation module 802 includes:
跨模态编码处理模块,用于将图像特征与原始文本特征进行特征拼接,并进行跨模态编码处理,获得综合编码特征。The cross-modal coding processing module is used to concatenate image features with original text features and perform cross-modal coding processing to obtain comprehensive coding features.
在一种可选实施例中,文本编码特征生成模块802包括:In an optional embodiment, the text encoding feature generation module 802 includes:
对应特征位置截取模块,用于对综合编码特征中与原始文本特征位置对应的特征进行截取,获得与原始文本特征对应的文本编码特征。The corresponding feature position interception module is used to intercept the features corresponding to the original text feature positions in the comprehensive coding features to obtain text coding features corresponding to the original text features.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.
另外,本申请实施例还提供了一种电子设备,包括:处理器,存储器,存储在存储器上并可在处理器上运行的计算机程序,该计算机程序被处理器执行时实现上述图像的文本纠错方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。In addition, an embodiment of the present application also provides an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, each process of the above-mentioned text error correction method embodiment of the image is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be described here.
如图9所示,本申请实施例还提供了一种计算机非易失性可读存储介质901,计算机非易失性可读存储介质901上存储有计算机程序,计算机程序被处理器执行时实现上述图像的文本纠错方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。其中,的计算机非易失性可读存储介质901,如只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等。As shown in FIG9 , the embodiment of the present application further provides a computer non-volatile readable storage medium 901, on which a computer program is stored. When the computer program is executed by a processor, each process of the above-mentioned text error correction method embodiment of the image is implemented, and the same technical effect can be achieved. To avoid repetition, it is not repeated here. Among them, the computer non-volatile readable storage medium 901 is, for example, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.
图10为实现本申请各个实施例的一种电子设备的硬件结构示意图。FIG. 10 is a schematic diagram of the hardware structure of an electronic device for implementing various embodiments of the present application.
该电子设备1000包括但不限于:射频单元1001、网络模块1002、音频输出单元
1003、输入单元1004、传感器1005、显示单元1006、用户输入单元1007、接口单元1008、存储器1009、处理器1010、以及电源1011等部件。本领域技术人员可以理解,本申请实施例中所涉及的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。在本申请实施例中,电子设备包括但不限于手机、平板电脑、笔记本电脑、掌上电脑、车载终端、可穿戴设备、以及计步器等。The electronic device 1000 includes but is not limited to: a radio frequency unit 1001, a network module 1002, an audio output unit 1003, input unit 1004, sensor 1005, display unit 1006, user input unit 1007, interface unit 1008, memory 1009, processor 1010, and power supply 1011 and other components. Those skilled in the art will understand that the electronic device structure involved in the embodiments of the present application does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown in the figure, or combine certain components, or arrange components differently. In the embodiments of the present application, the electronic device includes but is not limited to a mobile phone, a tablet computer, a laptop computer, a PDA, a vehicle-mounted terminal, a wearable device, and a pedometer.
应理解的是,本申请实施例中,射频单元1001可用于收发信息或通话过程中,信号的接收和发送,具体的,将来自基站的下行数据接收后,给处理器1010处理;另外,将上行的数据发送给基站。通常,射频单元1001包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频单元1001还可以通过无线通信系统与网络和其他设备通信。It should be understood that in the embodiment of the present application, the RF unit 1001 can be used for receiving and sending signals during information transmission or calls. Specifically, after receiving downlink data from the base station, it is sent to the processor 1010 for processing; in addition, uplink data is sent to the base station. Generally, the RF unit 1001 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, etc. In addition, the RF unit 1001 can also communicate with the network and other devices through a wireless communication system.
电子设备通过网络模块1002为用户提供了无线的宽带互联网访问,如帮助用户收发电子邮件、浏览网页和访问流式媒体等。The electronic device provides users with wireless broadband Internet access through the network module 1002, such as helping users to send and receive emails, browse web pages, and access streaming media.
音频输出单元1003可以将射频单元1001或网络模块1002接收的或者在存储器1009中存储的音频数据转换成音频信号并且输出为声音。而且,音频输出单元1003还可以提供与电子设备1000执行的特定功能相关的音频输出(例如,呼叫信号接收声音、消息接收声音等等)。音频输出单元1003包括扬声器、蜂鸣器以及受话器等。The audio output unit 1003 can convert the audio data received by the RF unit 1001 or the network module 1002 or stored in the memory 1009 into an audio signal and output it as sound. Moreover, the audio output unit 1003 can also provide audio output related to a specific function performed by the electronic device 1000 (for example, a call signal reception sound, a message reception sound, etc.). The audio output unit 1003 includes a speaker, a buzzer, a receiver, etc.
输入单元1004用于接收音频或视频信号。输入单元1004可以包括图形处理器(Graphics Processing Unit,GPU)10041和麦克风10042,图形处理器10041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。处理后的图像帧可以显示在显示单元1006上。经图形处理器10041处理后的图像帧可以存储在存储器1009(或其它存储介质)中或者经由射频单元1001或网络模块1002进行发送。麦克风10042可以接收声音,并且能够将这样的声音处理为音频数据。处理后的音频数据可以在电话通话模式的情况下转换为可经由射频单元1001发送到移动通信基站的格式输出。The input unit 1004 is used to receive audio or video signals. The input unit 1004 may include a graphics processor (GPU) 10041 and a microphone 10042, and the graphics processor 10041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode. The processed image frame can be displayed on the display unit 1006. The image frame processed by the graphics processor 10041 can be stored in the memory 1009 (or other storage medium) or sent via the radio frequency unit 1001 or the network module 1002. The microphone 10042 can receive sound and can process such sound into audio data. The processed audio data can be converted into a format output that can be sent to a mobile communication base station via the radio frequency unit 1001 in the case of a telephone call mode.
电子设备1000还包括至少一种传感器1005,比如光传感器、运动传感器以及其他传感器。具体地,光传感器包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板10061的亮度,接近传感器可在电子设备1000移动到耳边时,关闭显示面板10061和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别电子设备姿态(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;传感器1005还可以包括指纹传感器、压力传感器、虹膜传感器、分子传感器、陀螺仪、气压计、湿度计、温度计、红外线传感器等,在此不再赘述。The electronic device 1000 also includes at least one sensor 1005, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 10061 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 10061 and/or the backlight when the electronic device 1000 is moved to the ear. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), and can detect the magnitude and direction of gravity when stationary, which can be used to identify the posture of the electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 1005 can also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be repeated here.
显示单元1006用于显示由用户输入的信息或提供给用户的信息。显示单元1006可包括显示面板10061,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板10061。The display unit 1006 is used to display information input by the user or information provided to the user. The display unit 1006 may include a display panel 10061, which may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
用户输入单元1007可用于接收输入的数字或字符信息,以及产生与电子设备的用户设置以及功能控制有关的键信号输入。具体地,用户输入单元1007包括触控面板10071以及其他输入设备10072。触控面板10071,也称为触摸屏,可收集用户在其上或附近的触摸操
作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板10071上或在触控面板10071附近的操作)。触控面板10071可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1010,接收处理器1010发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板10071。除了触控面板10071,用户输入单元1007还可以包括其他输入设备10072。具体地,其他输入设备10072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。The user input unit 1007 may be used to receive input digital or character information and generate key signal input related to user settings and function control of the electronic device. Specifically, the user input unit 1007 includes a touch panel 10071 and other input devices 10072. The touch panel 10071, also known as a touch screen, may collect user touch operations on or near it. Operation (such as the user uses any suitable object or accessory such as a finger, stylus, etc. on the touch panel 10071 or near the touch panel 10071). The touch panel 10071 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, and converts it into the contact point coordinates, and then sends it to the processor 1010, receives the command sent by the processor 1010 and executes it. In addition, the touch panel 10071 can be implemented in various types such as resistive, capacitive, infrared and surface acoustic wave. In addition to the touch panel 10071, the user input unit 1007 may also include other input devices 10072. Specifically, other input devices 10072 may include but are not limited to physical keyboards, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, and joysticks, which will not be repeated here.
进一步的,触控面板10071可覆盖在显示面板10061上,当触控面板10071检测到在其上或附近的触摸操作后,传送给处理器1010以确定触摸事件的类型,随后处理器1010根据触摸事件的类型在显示面板10061上提供相应的视觉输出。可以理解的是,在一种实施例中,触控面板10071与显示面板10061是作为两个独立的部件来实现电子设备的输入和输出功能,但是在某些实施例中,可以将触控面板10071与显示面板10061集成而实现电子设备的输入和输出功能,具体此处不做限定。Further, the touch panel 10071 may be covered on the display panel 10061. When the touch panel 10071 detects a touch operation on or near it, it is transmitted to the processor 1010 to determine the type of the touch event, and then the processor 1010 provides a corresponding visual output on the display panel 10061 according to the type of the touch event. It can be understood that in one embodiment, the touch panel 10071 and the display panel 10061 are used as two independent components to implement the input and output functions of the electronic device, but in some embodiments, the touch panel 10071 and the display panel 10061 can be integrated to implement the input and output functions of the electronic device, which is not limited here.
接口单元1008为外部装置与电子设备1000连接的接口。例如,外部装置可以包括有线或无线头戴式耳机端口、外部电源(或电池充电器)端口、有线或无线数据端口、存储卡端口、用于连接具有识别模块的装置的端口、音频输入/输出(I/O)端口、视频I/O端口、耳机端口等等。接口单元1008可以用于接收来自外部装置的输入(例如,数据信息、电力等等)并且将接收到的输入传输到电子设备1000内的一个或多个元件或者可以用于在电子设备1000和外部装置之间传输数据。The interface unit 1008 is an interface for connecting an external device to the electronic device 1000. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device with an identification module, an audio input/output (I/O) port, a video I/O port, a headphone port, etc. The interface unit 1008 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic device 1000 or may be used to transmit data between the electronic device 1000 and an external device.
存储器1009可用于存储软件程序以及各种数据。存储器1009可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1009可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 1009 can be used to store software programs and various data. The memory 1009 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area can store data created according to the use of the mobile phone (such as audio data, a phone book, etc.), etc. In addition, the memory 1009 can include a high-speed random access memory, and can also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.
处理器1010是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器1009内的软件程序和/或模块,以及调用存储在存储器1009内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。处理器1010可包括一个或多个处理单元;优选的,处理器1010可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1010中。The processor 1010 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire electronic device. By running or executing software programs and/or modules stored in the memory 1009 and calling data stored in the memory 1009, it performs various functions of the electronic device and processes data, thereby monitoring the electronic device as a whole. The processor 1010 may include one or more processing units; preferably, the processor 1010 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 1010.
电子设备1000还可以包括给各个部件供电的电源1011(比如电池),优选的,电源1011可以通过电源管理系统与处理器1010逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The electronic device 1000 may also include a power supply 1011 (such as a battery) for supplying power to each component. Preferably, the power supply 1011 may be logically connected to the processor 1010 via a power management system, thereby implementing functions such as managing charging, discharging, and power consumption management through the power management system.
另外,电子设备1000包括一些未示出的功能模块,在此不再赘述。In addition, the electronic device 1000 includes some functional modules not shown, which will not be described in detail here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素
的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of more restrictions, an element limited by the sentence "comprises a..." does not exclude the inclusion of that element. There are other identical elements in the process, method, article or device.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods of each embodiment of the present application.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms without departing from the purpose of the present application and the scope of protection of the claims, all of which are within the protection of the present application.
本领域普通技术人员可以意识到,结合本申请实施例中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed in the embodiments of the present application can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the embodiments provided in the present application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present application. The aforementioned storage medium includes: various media that can store program codes, such as USB flash drives, mobile hard drives, ROM, RAM, magnetic disks, or optical disks.
以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
The above are only specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.
Claims (20)
- 一种图像的文本纠错方法,其特征在于,应用于多模态文本纠错系统,所述多模态文本纠错系统至少包括文本特征修正模块、纠错向量存取器以及纠错解码器,所述方法包括:A method for correcting text errors in an image, characterized in that it is applied to a multimodal text error correction system, the multimodal text error correction system at least comprising a text feature correction module, an error correction vector accessor and an error correction decoder, the method comprising:响应于针对图像与文本的输入操作,获取所述输入操作对应的图像信息与原始文本信息,并分别对所述图像信息与所述原始文本信息进行特征提取,获得与所述图像信息对应的图像特征,以及与所述原始文本信息对应的原始文本特征;In response to an input operation for an image and text, image information and original text information corresponding to the input operation are acquired, and feature extraction is performed on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;将所述图像特征与所述原始文本特征进行特征拼接,获得综合编码特征,并根据所述综合编码特征与所述原始文本特征进行特征截取,获得文本编码特征;Performing feature concatenation of the image feature and the original text feature to obtain a comprehensive coding feature, and performing feature interception based on the comprehensive coding feature and the original text feature to obtain a text coding feature;通过所述文本特征修正模块对所述文本编码特征进行特征纠正,生成文本纠正特征,并通过所述纠错向量存取器将所述文本纠正特征与所述文本编码特征进行特征融合,获得目标文本特征;The text encoding feature is corrected by the text feature correction module to generate a text correction feature, and the text correction feature is fused with the text encoding feature by the error correction vector accessor to obtain a target text feature;通过所述纠错解码器采用所述目标文本特征对所述原始文本特征进行特征替换,并输出对应的目标文本信息。The error correction decoder uses the target text features to perform feature replacement on the original text features, and outputs corresponding target text information.
- 根据权利要求1所述的方法,其特征在于,所述通过所述文本特征修正模块对所述文本编码特征进行特征纠正,生成文本纠正特征,包括:The method according to claim 1 is characterized in that the step of performing feature correction on the text encoding feature by the text feature correction module to generate a text correction feature comprises:通过所述文本特征修正模块对所述文本编码特征进行自注意力编码,获得对应的初始自注意力向量,并对所述初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,所述目标自注意力向量包含所述图像特征与所述原始文本特征的关联特征;The text encoding feature is self-attention encoded by the text feature correction module to obtain a corresponding initial self-attention vector, and the initial self-attention vector is subjected to character prediction processing to obtain a corresponding target self-attention vector, wherein the target self-attention vector includes correlation features between the image feature and the original text feature;对所述文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,所述有效文本信息向量表示所述文本编码特征中每个字符包含有效信息的概率;Predicting the effective information amount of the text encoding feature to obtain a corresponding effective text information vector, wherein the effective text information vector represents the probability that each character in the text encoding feature contains effective information;对所述文本编码特征进行双向截取,分别获得前错位特征与后错位特征,并根据所述前错位特征与所述后错位特征,生成相邻特征交互向量;Bidirectionally intercepting the text encoding features to obtain front dislocation features and back dislocation features respectively, and generating adjacent feature interaction vectors according to the front dislocation features and the back dislocation features;对所述相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,所述相邻文本信息向量表示所述文本编码特征中相邻字符连贯的概率;Performing coherence prediction processing on the adjacent feature interaction vector to obtain a corresponding adjacent text information vector, wherein the adjacent text information vector represents a probability of coherence of adjacent characters in the text encoding feature;采用所述目标自注意力向量、所述有效文本信息向量以及所述相邻文本信息向量对所述文本编码特征进行特征纠正,生成文本纠正特征。The target self-attention vector, the effective text information vector and the adjacent text information vector are used to perform feature correction on the text encoding feature to generate a text correction feature.
- 根据权利要求2所述的方法,其特征在于,所述通过所述文本特征修正模块对所述文本编码特征进行自注意力编码,获得对应的初始自注意力向量,包括:The method according to claim 2 is characterized in that the step of performing self-attention encoding on the text encoding feature through the text feature correction module to obtain a corresponding initial self-attention vector comprises:将所述文本编码特征输入至自注意力层中,采用公式
The text encoding features are input into the self-attention layer, using the formula
进行自注意力编码,获得对应的初始自注意力向量;其中,Perform self-attention encoding to obtain the corresponding initial self-attention vector; where,Wq、Wk、Wv均为可学习权重,f为文本编码特征。W q , W k , and W v are all learnable weights, and f is the text encoding feature. - 根据权利要求2或3所述的方法,其特征在于,所述对所述初始自注意力向量进行字符预测处理,获得对应的目标自注意力向量,包括:The method according to claim 2 or 3, characterized in that the performing character prediction processing on the initial self-attention vector to obtain the corresponding target self-attention vector comprises:将所述初始自注意力向量输入至两组全连接层中分别进行当前字符预测处理与前置 字符预测处理,获得当前预测向量与前置预测向量;The initial self-attention vector is input into two groups of fully connected layers for current character prediction and preprocessing respectively. Character prediction processing, obtaining the current prediction vector and the previous prediction vector;根据所述当前预测向量与所述前置预测向量确定目标自注意力向量。A target self-attention vector is determined according to the current prediction vector and the previous prediction vector.
- 根据权利要求4所述的方法,其特征在于,所述根据所述当前预测向量与所述前置预测向量确定目标自注意力向量,包括:The method according to claim 4, characterized in that the determining the target self-attention vector according to the current prediction vector and the previous prediction vector comprises:采用所述当前预测向量对所述文本编码特征进行预测处理,获得所述文本编码特征对应的目标当前字符;Using the current prediction vector to perform prediction processing on the text encoding feature to obtain a target current character corresponding to the text encoding feature;采用所述前置预测向量对所述目标当前字符进行预测处理,获得所述目标当前字符对应的目标前置字符;Using the preceding prediction vector to perform prediction processing on the target current character to obtain a target preceding character corresponding to the target current character;将所述目标前置字符与所述目标当前字符进行拼接,输出对应的目标字符,并生成所述目标字符对应的目标自注意力向量。The target preceding character is concatenated with the target current character, the corresponding target character is output, and a target self-attention vector corresponding to the target character is generated.
- 根据权利要求5所述的方法,其特征在于,所述采用所述当前预测向量对所述文本编码特征进行预测处理,获得所述文本编码特征对应的目标当前字符,包括:The method according to claim 5 is characterized in that the step of using the current prediction vector to perform prediction processing on the text encoding feature to obtain the target current character corresponding to the text encoding feature comprises:根据所述当前预测向量,将所述文本编码特征与预设字典中各个预设字符进行概率匹配,获得各个所述预设字符对应的当前预测概率,并将当前预测概率最大的预设字符确定为目标当前字符。According to the current prediction vector, the text encoding feature is probability matched with each preset character in the preset dictionary to obtain the current prediction probability corresponding to each preset character, and the preset character with the maximum current prediction probability is determined as the target current character.
- 根据权利要求6所述的方法,其特征在于,所述采用所述前置预测向量对所述目标当前字符进行预测处理,获得所述目标当前字符对应的目标前置字符,包括:The method according to claim 6, characterized in that the step of using the preceding prediction vector to perform prediction processing on the target current character to obtain a target preceding character corresponding to the target current character comprises:根据所述前置预测向量,将所述目标当前字符与预设字典中各个预设字符进行概率匹配,获得各个所述预设字符对应的前置预测概率,并将前置预测概率最大的预设字符确定为目标前置字符。According to the preceding prediction vector, the target current character is probability matched with each preset character in the preset dictionary to obtain the preceding prediction probability corresponding to each preset character, and the preset character with the largest preceding prediction probability is determined as the target preceding character.
- 根据权利要求2所述的方法,其特征在于,所述对所述文本编码特征进行有效信息量预测,获得对应的有效文本信息向量,包括:The method according to claim 2 is characterized in that the step of predicting the effective information amount of the text encoding feature to obtain the corresponding effective text information vector comprises:采用公式
Using formula
对所述文本编码特征进行有效信息量预测,获得对应的有效文本信息向量;其中,Predicting the effective information volume of the text encoding feature to obtain a corresponding effective text information vector; wherein,Wiq、Wik、Wiv均为转移矩阵权重,Wiw为信息量预测权重,bib为可学习模型参数,f为文本编码特征。 Wiq , Wik , and Wiv are all transfer matrix weights, Wiw is the information prediction weight, bib is the learnable model parameter, and f is the text encoding feature. - 根据权利要求2所述的方法,其特征在于,所述文本编码特征的大小为[M,d],所述前错位特征与所述后错位特征的大小均为[M-1,d],所述根据所述前错位特征与所述后错位特征,生成相邻特征交互向量,包括:The method according to claim 2 is characterized in that the size of the text encoding feature is [M, d], the sizes of the front misalignment feature and the back misalignment feature are both [M-1, d], and generating an adjacent feature interaction vector based on the front misalignment feature and the back misalignment feature comprises:将所述前错位特征与所述后错位特征进行向量级联处理,生成与所述文本编码特征对应的大小为[M-1,d×2]的相邻特征交互向量。The front misalignment feature and the back misalignment feature are subjected to vector cascade processing to generate an adjacent feature interaction vector of a size of [M-1, d×2] corresponding to the text encoding feature.
- 根据权利要求2或9所述的方法,其特征在于,所述对所述相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量,包括:The method according to claim 2 or 9 is characterized in that the step of performing coherent prediction processing on the adjacent feature interaction vectors to obtain corresponding adjacent text information vectors comprises:采用公式
Using formula
对所述相邻特征交互向量进行连贯预测处理,获得对应的相邻文本信息向量;其中,The adjacent feature interaction vectors are processed for coherent prediction to obtain corresponding adjacent text information vectors; wherein,Wnw、Wnq、Wnv、Wnk均为转移矩阵权重参数,bin为偏置向量参数,fnb为相邻特征交互向量。W nw , W nq , W nv , and W nk are all transfer matrix weight parameters, bin is the bias vector parameter, and f nb is the adjacent feature interaction vector. - 根据权利要求2所述的方法,其特征在于,所述纠错向量存取器至少包括特征存储空间,在所述根据所述综合编码特征与所述原始文本特征进行特征截取,获得文本编码特征之后,所述方法还包括:The method according to claim 2 is characterized in that the error correction vector accessor comprises at least a feature storage space, and after performing feature interception according to the comprehensive coding feature and the original text feature to obtain the text coding feature, the method further comprises:将所述文本编码特征拆分为若干个子文本特征,并将各个所述子文本特征依次存储至所述特征存储空间。The text encoding feature is split into a plurality of sub-text features, and each of the sub-text features is stored in the feature storage space in sequence.
- 根据权利要求11所述的方法,其特征在于,所述纠错向量存取器包括修复判断门以及特征更新器,所述通过所述纠错向量存取器将所述文本纠正特征与所述文本编码特征进行特征融合,获得目标文本特征,包括:The method according to claim 11 is characterized in that the error correction vector accessor includes a repair judgment gate and a feature updater, and the text correction feature is fused with the text encoding feature through the error correction vector accessor to obtain the target text feature, comprising:通过所述修复判断门对各个所述子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征;Performing a repair judgment on each of the sub-text features through the repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced;通过所述特征更新器采用所述文本纠正特征对至少一个所述替换子文本特征进行特征替换,获得各自对应的目标子文本特征,并将至少一个所述目标子文本特征进行特征融合,获得对应的目标文本特征。The feature updater uses the text correction feature to perform feature replacement on at least one of the replacement sub-text features to obtain the corresponding target sub-text features, and performs feature fusion on at least one of the target sub-text features to obtain the corresponding target text features.
- 根据权利要求12所述的方法,其特征在于,所述通过所述修复判断门对各个所述子文本特征进行修复判断,确定需进行特征替换的至少一个替换子文本特征,包括:The method according to claim 12, characterized in that the step of performing repair judgment on each of the sub-text features through the repair judgment gate to determine at least one replacement sub-text feature that needs to be replaced comprises:采用公式
Using formula
对各个所述子文本特征进行修复判断;Performing repair judgment on each of the sub-text features;当所述s(xk)为1时,将特征序号为k的子文本特征确定为需进行特征替换的替换子文本特征;其中,When s(x k ) is 1, the subtext feature with feature number k is determined as the replacement subtext feature to be replaced; wherein,k表示子文本特征对应的特征序号,pifok为特征序号为k的子文本特征对应的有效文本信息向量,pnbok为特征序号为k的子文本特征对应的相邻文本信息向量,threshifo表示可设定信息量概率阈值,threshnbo表示可设定通顺概率阈值,s(xk)表示特征序号为k的子文本特征是否需要进行特征替换。k represents the feature number corresponding to the sub-text feature, p ifok is the valid text information vector corresponding to the sub-text feature with feature number k, p nbok is the adjacent text information vector corresponding to the sub-text feature with feature number k, thresh ifo represents the settable information quantity probability threshold, thresh nbo represents the settable fluency probability threshold, and s(x k ) represents whether the sub-text feature with feature number k needs to be replaced. - 根据权利要求12所述的方法,其特征在于,所述通过所述特征更新器采用所述文本纠正特征对至少一个所述替换子文本特征进行特征替换,获得各自对应的目标子文本特征,包括:The method according to claim 12, characterized in that the step of replacing at least one of the replacement sub-text features with the text correction feature by the feature updater to obtain the respective corresponding target sub-text features comprises:根据所述文本纠正特征,采用公式
fko=fk×(1-μ)+(pifok×θ+pnbok×(1-θ))×μ×oemlm According to the text correction features, the formula
f ko =f k ×(1-μ)+(p ifok ×θ+p nbok ×(1-θ))×μ×o emlm计算所述替换子文本特征对应的文本特征值,并根据所述文本特征值对所述替换子 文本特征进行特征替换,获得对应的目标子文本特征;其中,Calculate the text feature value corresponding to the replacement subtext feature, and replace the subtext feature according to the text feature value. The text features are replaced to obtain the corresponding target sub-text features; among them,fk为特征序号为k的子文本特征,oemlm为目标自注意力向量,θ与μ均为大小为0~1的预设参数。 fk is the sub-text feature with feature number k, oemlm is the target self-attention vector, and θ and μ are both preset parameters with a size of 0 to 1. - 根据权利要求14所述的方法,其特征在于,所述根据所述文本特征值对所述替换子文本特征进行特征替换,获得对应的目标子文本特征,包括:The method according to claim 14 is characterized in that the step of performing feature replacement on the replacement sub-text feature according to the text feature value to obtain the corresponding target sub-text feature comprises:采用所述文本特征值通过覆盖原值方式,对所述替换子文本特征的原有文本特征值进行替换,获得对应的目标子文本特征。The original text feature value of the replacement sub-text feature is replaced by overwriting the original value using the text feature value to obtain the corresponding target sub-text feature.
- 根据权利要求1所述的方法,其特征在于,所述将所述图像特征与所述原始文本特征进行特征拼接,获得综合编码特征,包括:The method according to claim 1 is characterized in that the step of performing feature concatenation of the image feature and the original text feature to obtain a comprehensive coding feature comprises:将所述图像特征与所述原始文本特征进行特征拼接,并进行跨模态编码处理,获得综合编码特征。The image features are concatenated with the original text features, and cross-modal encoding is performed to obtain comprehensive encoding features.
- 根据权利要求1所述的方法,其特征在于,所述根据所述综合编码特征与所述原始文本特征进行特征截取,获得文本编码特征,包括:The method according to claim 1 is characterized in that the step of performing feature interception based on the comprehensive coding feature and the original text feature to obtain the text coding feature comprises:对所述综合编码特征中与所述原始文本特征位置对应的特征进行截取,获得与所述原始文本特征对应的文本编码特征。The feature corresponding to the original text feature position in the comprehensive coding feature is intercepted to obtain the text coding feature corresponding to the original text feature.
- 一种图像的文本纠错装置,其特征在于,应用于多模态文本纠错系统,所述多模态文本纠错系统至少包括文本特征修正模块、纠错向量存取器以及纠错解码器,所述装置包括:A text error correction device for an image, characterized in that it is applied to a multimodal text error correction system, the multimodal text error correction system at least comprising a text feature correction module, an error correction vector accessor and an error correction decoder, the device comprising:特征提取模块,用于响应于针对图像与文本的输入操作,获取所述输入操作对应的图像信息与原始文本信息,并分别对所述图像信息与所述原始文本信息进行特征提取,获得与所述图像信息对应的图像特征,以及与所述原始文本信息对应的原始文本特征;A feature extraction module, configured to, in response to an input operation for an image and text, obtain image information and original text information corresponding to the input operation, and perform feature extraction on the image information and the original text information respectively to obtain image features corresponding to the image information and original text features corresponding to the original text information;文本编码特征生成模块,用于将所述图像特征与所述原始文本特征进行特征拼接,获得综合编码特征,并根据所述综合编码特征与所述原始文本特征进行特征截取,获得文本编码特征;A text coding feature generation module is used to perform feature splicing on the image feature and the original text feature to obtain a comprehensive coding feature, and perform feature interception based on the comprehensive coding feature and the original text feature to obtain a text coding feature;目标文本特征生成模块,用于通过所述文本特征修正模块对所述文本编码特征进行特征纠正,生成文本纠正特征,并通过所述纠错向量存取器将所述文本纠正特征与所述文本编码特征进行特征融合,获得目标文本特征;A target text feature generation module, used for performing feature correction on the text encoding feature through the text feature correction module to generate text correction features, and performing feature fusion of the text correction features with the text encoding features through the error correction vector accessor to obtain target text features;文本特征替换模块,用于通过所述纠错解码器采用所述目标文本特征对所述原始文本特征进行特征替换,并输出对应的目标文本信息。The text feature replacement module is used to replace the original text features with the target text features through the error correction decoder, and output the corresponding target text information.
- 一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,所述处理器、所述通信接口以及所述存储器通过所述通信总线完成相互间的通信;An electronic device, characterized in that it comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other through the communication bus;所述存储器,用于存放计算机程序;The memory is used to store computer programs;所述处理器,用于执行存储器上所存放的程序时,实现如权利要求1-17任一项所述的方法。The processor is used to implement the method according to any one of claims 1 to 17 when executing the program stored in the memory.
- 一种计算机非易失性可读存储介质,其上存储有指令,当由一个或多个处理器执行时,使得所述处理器执行如权利要求1-17任一项所述的方法。 A computer non-volatile readable storage medium having instructions stored thereon, which, when executed by one or more processors, causes the processors to execute the method according to any one of claims 1 to 17.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211680372.4A CN115659959B (en) | 2022-12-27 | 2022-12-27 | Image text error correction method and device, electronic equipment and storage medium |
CN202211680372.4 | 2022-12-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024139307A1 true WO2024139307A1 (en) | 2024-07-04 |
Family
ID=85022954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/115054 WO2024139307A1 (en) | 2022-12-27 | 2023-08-25 | Sentence correction method and apparatus for image, and electronic device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115659959B (en) |
WO (1) | WO2024139307A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115659959B (en) * | 2022-12-27 | 2023-03-21 | 苏州浪潮智能科技有限公司 | Image text error correction method and device, electronic equipment and storage medium |
CN116701734B (en) * | 2023-08-07 | 2024-04-02 | 深圳市智慧城市科技发展集团有限公司 | Address text processing method and device and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112686030A (en) * | 2020-12-29 | 2021-04-20 | 科大讯飞股份有限公司 | Grammar error correction method, grammar error correction device, electronic equipment and storage medium |
CN114241279A (en) * | 2021-12-30 | 2022-03-25 | 中科讯飞互联(北京)信息科技有限公司 | Image-text combined error correction method and device, storage medium and computer equipment |
CN114462356A (en) * | 2022-04-11 | 2022-05-10 | 苏州浪潮智能科技有限公司 | Text error correction method, text error correction device, electronic equipment and medium |
WO2022095345A1 (en) * | 2020-11-05 | 2022-05-12 | 苏州浪潮智能科技有限公司 | Multi-modal model training method, apparatus, device, and storage medium |
CN115659959A (en) * | 2022-12-27 | 2023-01-31 | 苏州浪潮智能科技有限公司 | Image text error correction method and device, electronic equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114444479B (en) * | 2022-04-11 | 2022-06-24 | 南京云问网络技术有限公司 | End-to-end Chinese speech text error correction method, device and storage medium |
CN115017890A (en) * | 2022-05-26 | 2022-09-06 | 深圳价值在线信息科技股份有限公司 | Text error correction method and device based on character pronunciation and character font similarity |
-
2022
- 2022-12-27 CN CN202211680372.4A patent/CN115659959B/en active Active
-
2023
- 2023-08-25 WO PCT/CN2023/115054 patent/WO2024139307A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022095345A1 (en) * | 2020-11-05 | 2022-05-12 | 苏州浪潮智能科技有限公司 | Multi-modal model training method, apparatus, device, and storage medium |
CN112686030A (en) * | 2020-12-29 | 2021-04-20 | 科大讯飞股份有限公司 | Grammar error correction method, grammar error correction device, electronic equipment and storage medium |
CN114241279A (en) * | 2021-12-30 | 2022-03-25 | 中科讯飞互联(北京)信息科技有限公司 | Image-text combined error correction method and device, storage medium and computer equipment |
CN114462356A (en) * | 2022-04-11 | 2022-05-10 | 苏州浪潮智能科技有限公司 | Text error correction method, text error correction device, electronic equipment and medium |
CN115659959A (en) * | 2022-12-27 | 2023-01-31 | 苏州浪潮智能科技有限公司 | Image text error correction method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115659959B (en) | 2023-03-21 |
CN115659959A (en) | 2023-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2024139307A1 (en) | Sentence correction method and apparatus for image, and electronic device and storage medium | |
KR102270394B1 (en) | Method, terminal, and storage medium for recognizing an image | |
US20210174170A1 (en) | Sequence model processing method and apparatus | |
CN111402866B (en) | Semantic recognition method and device and electronic equipment | |
US10769253B2 (en) | Method and device for realizing verification code | |
US11308694B2 (en) | Image processing apparatus and image processing method | |
CN111612093A (en) | Video classification method, video classification device, electronic equipment and storage medium | |
CN108763317B (en) | Method for assisting in selecting picture and terminal equipment | |
CN111061383B (en) | Text detection method and electronic equipment | |
CN109495616B (en) | Photographing method and terminal equipment | |
CN107918496A (en) | It is a kind of to input error correction method and device, a kind of device for being used to input error correction | |
US20210158031A1 (en) | Gesture Recognition Method, and Electronic Device and Storage Medium | |
CN110837734A (en) | Text information processing method and mobile terminal | |
CN110808019A (en) | Song generation method and electronic equipment | |
CN111382748A (en) | Image translation method, device and storage medium | |
CN112488157B (en) | Dialogue state tracking method and device, electronic equipment and storage medium | |
CN112464831B (en) | Video classification method, training method of video classification model and related equipment | |
CN111144065B (en) | Display control method and electronic equipment | |
CN107329584A (en) | A kind of word input processing method, mobile terminal and computer-readable recording medium | |
CN115661727B (en) | Video behavior positioning method and device, electronic equipment and storage medium | |
CN110389666A (en) | A kind of input error correction method and device | |
CN112653789A (en) | Voice mode switching method, terminal and storage medium | |
CN108471549B (en) | Remote control method and terminal | |
CN117726003A (en) | Response defense method, device, equipment and storage medium based on large model reasoning | |
CN115240250A (en) | Model training method and device, computer equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23909247 Country of ref document: EP Kind code of ref document: A1 |