CN115393864A

CN115393864A - Text recognition method, electronic device and computer storage medium

Info

Publication number: CN115393864A
Application number: CN202211034977.6A
Authority: CN
Inventors: 王鹏; 达铖; 姚聪
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-11-25

Abstract

The embodiment of the application provides a text recognition method, electronic equipment and a computer storage medium, wherein the text recognition method comprises the following steps: obtaining an image block sequence corresponding to a text image to be identified; performing image feature extraction on the image block sequence through a machine learning model based on an attention mechanism to obtain corresponding image features; performing multi-granular label resolution based on the image features, wherein the multi-granular label resolution comprises: the method comprises the steps of character granularity mark analysis, sub-word granularity mark analysis and whole word granularity mark analysis; and performing text recognition on the text image according to the multi-granularity mark analysis result. Through the embodiment of the application, the text recognition performance is improved to a higher level.

Description

Text recognition method, electronic device and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of image recognition, in particular to a text recognition method, electronic equipment and a computer storage medium.

Background

Text recognition is a technique for detecting and recognizing text in an image to obtain corresponding text information. With the widespread use of the Transformer model in the field of natural language processing, more and more technicians try to replace the convolutional neural network model to realize the application of the Transformer model to the field of image recognition so as to perform text recognition on images containing text.

However, since images containing text generally lack semantic information, when applying a Transformer model to images containing text for text recognition, recognition is often inaccurate, and particularly, for low-quality images, recognition accuracy is low.

Disclosure of Invention

In view of the above, embodiments of the present application provide a text recognition scheme to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided a text recognition method, including: obtaining an image block sequence corresponding to a text image to be identified; performing image feature extraction on the image block sequence through a machine learning model based on an attention mechanism to obtain corresponding image features; performing multi-granular label resolution based on the image features, wherein the multi-granular label resolution comprises: the method comprises the steps of character granularity mark analysis, sub-word granularity mark analysis and whole word granularity mark analysis; and performing text recognition on the text image according to the multi-granularity mark analysis result.

According to a second aspect of embodiments of the present application, there is provided an electronic device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the corresponding operation of the method according to the first aspect.

According to a third aspect of embodiments herein, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method as described in the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the method according to the first aspect.

According to the scheme provided by the embodiment of the application, the semantic information can be implicitly injected into the model for processing the text image by performing multi-granularity marking analysis on the image characteristics, so that the model can simultaneously combine the image characteristics and the semantic information to perform text recognition, and the recognition efficiency and accuracy are improved. On the other hand, the semantic information is characterized from the character granularity, the sub-word granularity and the whole word granularity from the multi-granularity, so that the image characteristics and the semantic information can be obtained from the multi-granularity, and the text recognition performance is improved to a higher level.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an exemplary system to which embodiments of the present application may be applied;

FIG. 2A is a flowchart illustrating steps of a method for text recognition according to an embodiment of the present application;

FIG. 2B is a diagram illustrating a structure of a text recognition model in the embodiment shown in FIG. 2A;

FIG. 2C is a schematic illustration of a multi-granular prediction result obtained using the text recognition model shown in FIG. 2B;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

Fig. 1 illustrates an exemplary system to which embodiments of the present application may be applied. As shown in fig. 1, the system 100 may include a cloud server 102, a communication network 104, and/or one or more user devices 106, illustrated in fig. 1 as a plurality of user devices.

The cloud server 102 may be any suitable device for storing information, data, programs, and/or any other suitable type of content, including but not limited to distributed storage system devices, server clusters, computing cloud server clusters, and the like. In some embodiments, cloud services 102 may perform any suitable functions. For example, in some embodiments, the cloud service 102 may perform text recognition on a text image. As an optional example, in some embodiments, the cloud service 102 may perform multi-granularity markup parsing based on image features corresponding to text images, so as to blend text semantic information into the image features for text recognition. As an alternative example, in some embodiments, the cloud service 102 may set a text recognition model, and perform text recognition on the text image through the text recognition model. As another example, in some embodiments, the cloud service 102 may receive a text recognition request sent by the user device 106, and perform text recognition on a text image requested by the request based on the text recognition request. Further, in some embodiments, the cloud service 102 may also send the text recognition result back to the user device 106.

In some embodiments, the communication network 104 may be any suitable combination of one or more wired and/or wireless networks. For example, the communication network 104 can include any one or more of the following: the network may include, but is not limited to, the internet, an intranet, a Wide Area Network (WAN), a Local Area Network (LAN), a wireless network, a Digital Subscriber Line (DSL) network, a frame relay network, an Asynchronous Transfer Mode (ATM) network, a Virtual Private Network (VPN), and/or any other suitable communication network. The user device 106 can be connected to the communication network 104 by one or more communication links (e.g., communication link 112), and the communication network 104 can be linked to the cloud server 102 via one or more communication links (e.g., communication link 114). The communication link may be any communication link suitable for communicating data between the user device 106 and the cloud service 102, such as a network link, dial-up link, wireless link, hardwired link, any other suitable communication link, or any suitable combination of such links.

The user device 106 may include any one or more user devices suitable for interaction, such as with a user or the cloud server 102. In some embodiments, the user device 106 may send the cloud server 102 a text recognition request, so that the cloud server 102 obtains a text image requested by the request according to the request to perform text recognition on the text image. In some embodiments, the text identification request sent by the user equipment 106 to the cloud service end 102 carries a corresponding text image or information of an acquisition address of the corresponding text image, so that the cloud service end 102 can acquire the text image. In some embodiments, user devices 106 may comprise any suitable type of device. For example, in some embodiments, the user device 106 may include a mobile device, a tablet computer, a laptop computer, a desktop computer, a wearable computer, a game console, a media player, a vehicle entertainment system, and/or any other suitable type of user device.

Based on the above system, the text recognition scheme of the present application is explained below by way of an embodiment.

Referring to fig. 2A, a flow chart of steps of a method of text recognition according to an embodiment of the present application is shown.

The text recognition method of the embodiment comprises the following steps:

step S202: and obtaining an image block sequence corresponding to the text image to be recognized.

In the embodiment of the present application, the text image includes a plain text image and an image in which the image includes both a text image element and a non-text image element, which are all applicable to the scheme of the embodiment of the present application.

In addition, in the embodiment of the present application, in order to facilitate the subsequent attention-based machine learning model to perform image feature extraction, the text image is firstly divided into a series of non-overlapping continuous image blocks, that is, an image block sequence. The image blocks in these image block sequences, plus their position information, serve as input for a subsequent attention-based machine learning model.

For example, a text image to be recognized may be segmented by convolution, each block is flattened into a sequence, and then a position code and a CLS token (class label) are added to the sequence of image blocks, and the sequence is input into a subsequent machine learning model based on an attention mechanism, such as a transform structure encoder.

Step S204: and performing image feature extraction on the image block sequence through a machine learning model based on an attention mechanism to obtain corresponding image features.

Different from the conventional method for extracting the features of the image by using the convolutional neural network, in the embodiment of the application, because of the good performance of the attention mechanism in the aspect of feature extraction, a machine learning model based on the attention mechanism is adopted to extract the features of the image block sequence. For example, the machine learning model may employ an encoder structure in a Transformer.

In order to use the transform encoder for image feature extraction without changing the structure of the transform encoder, as described above, after processing an image into an image block sequence, a position code and a CLS token are also added thereto so that the transform encoder can directly perform processing.

Image feature extraction is performed on the image block sequence through a machine learning model based on an attention mechanism, such as a Transformer encoder, so that corresponding image features can be obtained and are characterized in a corresponding feature token (mark) form.

Step S206: and performing multi-granularity mark analysis based on the image characteristics.

Wherein, the multi-granularity mark analysis comprises: token parsing of character granularity, token parsing of sub-word granularity, and token parsing of whole word granularity.

Markup parsing, also known as tokenize, can generate individual character sub-strings based on a feature representation, each character sub-string having relatively complete semantics. In the embodiment of the application, the text semantic information is introduced into the processing of the image through mark analysis, so that the text recognition of the text image can be carried out based on the image characteristics and the semantic information at the same time, and a more accurate recognition effect is obtained.

Further, in the embodiment of the present application, the label resolution is implemented as three granularities, which are: character granularity, sub-word granularity, and whole-word granularity.

The character granularity, i.e. character-level, can analyze and predict the most basic characters, such as "a", "b" and "c" in english, or "you", "i", "he" in chinese, and the like, by performing character granularity markup analysis on image features. The whole word granularity, i.e. wordpiece-level, can analyze and predict natural language units such as 'Transformer' and 'coffee' in English, or 'the dish made by the restaurant is good' and 'the clothes is beautiful' in Chinese by performing label analysis of the whole word granularity on the image characteristics. The subword granularity is the granularity between the character granularity and the whole word granularity, the subword granularity is considered by dividing words by common combinations, and the subwords between the characters and the whole word can be analyzed and predicted by performing label analysis on the image characteristics, for example, "Transform" in English can be analyzed and predicted as "Transform" or "er"; the coffees are analyzed and predicted to be co, ff, ee and the like; or, chinese ' the restaurant makes a good dish ' is analyzed and predicted to be ' the ', ' the restaurant ', ' the dish made ', ' good; the "clothes are really beautiful" is analyzed and predicted as "clothes", "true", "beautiful", and the like.

In one possible approach, the token parsing of the subword granularity of the embodiment of the present application may be implemented as byte-to-byte token parsing of the encoded BPE granularity. The BPE granularity mark analysis can divide sub-words according to the frequency of the most frequently combined characters, so that the sub-words predicted by analysis are more reasonable. In fact, the markup parsing of word-level can also be regarded as a variation of BPE, which is a kind of coarser granularity coding format based on subword.

The multi-granularity mark analysis can be realized by adopting a mode of attention processing and classification. For example, the character-granular markup parsing based on image features can be implemented as: selecting image features related to the ith character from the image features by using a spatial attention function according to the single character granularity, wherein i is the number of text characters in the text image, and if "coffee" is 6 characters; further, the selected image features are aggregated to generate a vector corresponding to the ith character; and classifying and identifying through a classifier to obtain a character text corresponding to the ith character, such as 'c' and the like.

The label resolution of sub-word granularity based on image features can be implemented as: and performing space attention calculation and classification processing on the image features based on a BPE algorithm, and taking the processing result as a marking analysis result of the particle size of the subword. The BPE algorithm counts the frequency of occurrence of each successive byte pair and selects the highest frequency byte pair to be merged into a new highest frequency byte pair. Based on the above, through the space attention function obtained after training, the image features related to the jth sub-word in the image features can be selected according to the particle size of the sub-words, and are aggregated to generate a vector corresponding to the jth sub-word; and classifying and identifying through a classifier to obtain a character text corresponding to the jth sub-word. Wherein j is the number of text sub-words in the text image, for example, 2 sub-words are included in "coffee", and the obtained character text corresponding to the j-th sub-word may be "co", or "coffee".

The whole word granularity marking analysis based on the image characteristics can be realized as follows: and performing space attention calculation and classification processing on the image features by using a WordPiece algorithm, and taking a processing result as a mark analysis result of the whole word granularity. The WordPiece algorithm can be viewed as a variation of the BPE algorithm, except that the WordPiece algorithm generates a new subword based on probability instead of the next highest frequency byte pair. Based on the above, through the space attention function obtained after training, the image features related to the mth whole word in the image features can be selected according to the whole word granularity, and are aggregated to generate a vector corresponding to the mth whole word; and classifying and identifying through a classifier to obtain a character text corresponding to the mth whole word. Wherein m is the number of the whole words in the text image, for example, if the "I like coffee" includes 3 whole words, the obtained character text corresponding to the mth sub-word may be "I", or "like", or "coffee".

Through the multi-granularity mark analysis, on one hand, more and richer image characteristics and semantic information can be obtained through analysis and prediction of multiple levels, and the text recognition performance for text images is improved; on the other hand, accurate analysis and prediction of different levels can be performed by using a space attention mechanism and classification processing, and the accuracy and efficiency of analysis and prediction are improved.

Step S208: and performing text recognition on the text image according to the multi-granularity mark analysis result.

After the multi-granularity mark analysis results are obtained, text recognition can be carried out based on the results. The method comprises the following steps: obtaining corresponding text prediction results with multiple granularities according to the multi-granularity mark analysis result; and fusing the text prediction results of a plurality of granularities, and performing text recognition on the text image according to the fusion result. It should be noted that, the text prediction result is obtained based on the tag parsing result and in combination with a preset vocabulary. The label analysis results with different granularities correspond to different word lists. For example, the vocabulary of character granularity includes other symbols such as #, @, etc. in addition to characters such as a, b, c, d, etc. In the embodiment of the present application, the size and specific implementation of the vocabulary of the characters, or the sub-words, or the whole words, etc. in the vocabulary of different granularities are not limited. However, for convenience of explanation, the following description will be given by taking 256 elements as an example.

The text prediction results of multiple granularities are fused, and the text recognition of the text image according to the fusion result can be realized as follows:

the method I comprises the following steps: respectively averaging the probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain the probability distribution average corresponding to the text prediction results of each granularity; determining a text prediction result corresponding to the maximum mean value in a plurality of probability distribution mean values corresponding to a plurality of granularities as a target text prediction result; and performing text recognition on the text image according to the target text prediction result.

Still taking the text in the text image as "coffee" as an example, each character in "coffee" corresponds to a corresponding probability based on 256 characters in the vocabulary, and the probability distributions corresponding to the characters are formed, such as [ 0,0.9,0, 0.05,0, \82300 ], 82300.05, 0 82300, \82308230 ], which indicate that "c" is the character c with a probability of 0.9, the character e with a probability of 0.02, the character o with a probability of 0.05, and the other characters with probabilities of 0. Similarly, each of the other characters also has a similar probability distribution. Then, at the character granularity, the probability distribution of all characters corresponding to "coffee" is averaged, and the average of the probability distribution at the character granularity can be obtained and is denoted as P1.

Similarly, at the subword granularity, "coffee" also has a probability distribution of a corresponding plurality of subwords based on the word list of the subword granularity. Then, under the particle size of the subword, the probability distribution of all the subwords corresponding to "coffee" is averaged, so as to obtain the average value of the probability distribution under the particle size of the subword, which is denoted as P2.

Under the whole word granularity, "coffee" is based on the word list of the whole word granularity, and has corresponding probability distribution. Then, under the whole word granularity, the probability distribution corresponding to "coffee" is averaged, and the probability distribution average under the whole word granularity can be obtained and is marked as P3.

In the above example, if P3> P2> P1, the text prediction result corresponding to P3 is determined as the target text prediction result, and full word recognition is performed based on the result, so that the character corresponding to the "coffee" image part in the text image is recognized as coffee.

By means of the probability distribution mean value, the prediction condition of the text prediction result of each granularity can be represented in a balanced manner, and a relatively objective and accurate prediction result can be obtained.

The second method comprises the following steps: respectively multiplying the probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain probability product results corresponding to the text prediction results of each granularity; determining a text prediction result corresponding to the maximum product in a plurality of probability product results corresponding to a plurality of granularities as a target text prediction result; and performing text recognition on the text image according to the target text prediction result.

Still taking the text in the text image as "cofee" as an example, each character in "cofee" corresponds to a corresponding probability based on 256 characters in the vocabulary, and the probability distribution corresponding to the character is formed, such as [ 0,0.9,0, 0.05,0, \8230;, 0.05,0 8230; \8230; ], the probability distribution indicates that "c" is a character c with a probability of 0.9, a character e with a probability of 0.02, a character o with a probability of 0.05, and other characters with probabilities of 0. Similarly, each of the other characters has a similar probability distribution. Then, under the character granularity, the probability distribution products of all characters corresponding to "coffee" are multiplied, and the probability distribution product under the character granularity can be obtained and is marked as M1.

Similarly, at sub-word granularity, "coffee" is based on a vocabulary of sub-word granularity, also having a probability distribution of a corresponding plurality of sub-words. Then, under the particle size of the subword, the product of the probability distributions of all the subwords corresponding to "coffee" is obtained, and the product of the probability distributions under the particle size of the subword can be obtained and recorded as M2.

Under the whole word granularity, "coffee" is based on the word list of the whole word granularity, and has corresponding probability distribution. Then, under the whole word granularity, the product of the probability distribution corresponding to "coffee" is obtained, and the product of the probability distribution under the whole word granularity can be obtained and is marked as M3.

In the above example, if M1> M2> M3, the text prediction result corresponding to M1 is determined as the target text prediction result, character recognition is performed based on the result, characters corresponding to "coffee" image portions in the text image are recognized as c, o, f, e, and e, respectively, and the text image is then combined to form coffee.

Through the mode of probability distribution product, the text prediction result of the granularity which is predicted more accurately is more obvious, and the accurate prediction result can be obtained quickly and efficiently.

The third method comprises the following steps: respectively averaging the probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain the probability distribution average corresponding to the text prediction results of each granularity; determining a text prediction result corresponding to the maximum mean value in a plurality of probability distribution mean values corresponding to a plurality of granularities as a first text prediction result; respectively multiplying probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain probability product results corresponding to the text prediction results of each granularity; determining a text prediction result corresponding to the maximum product in a plurality of probability product results corresponding to a plurality of granularities as a second text prediction result; determining a target text prediction result according to the confidence coefficient corresponding to the first text prediction result and the confidence coefficient corresponding to the second text prediction result; and performing text recognition on the text image according to the target text prediction result.

In this embodiment, based on the results obtained in the first and second embodiments, the prediction results corresponding to the better one of the two embodiments are selected by comprehensively considering the prediction results obtained in the two embodiments. The text prediction result corresponding to the first mode is the first text prediction result in the present mode, and the text prediction result corresponding to the second mode is the second text prediction result in the present mode. On the basis, the confidence degrees of the two text images are determined, the text prediction result in the mode with the higher confidence degree is selected as the target text prediction result according to the confidence degrees corresponding to the two text images, and text recognition is carried out on the text images based on the target text prediction result. The specific obtaining mode of the confidence degrees corresponding to the first text prediction result and the second text prediction result can be obtained by a conventional mode, and details are not described here.

By the method, a better target text prediction result can be selected to provide accurate basis for subsequent text recognition.

By the embodiment, the image features are subjected to multi-granularity marking analysis, and the semantic information can be implicitly injected into the model for processing the text image, so that the model can simultaneously combine the image features and the semantic information to perform text recognition, and the recognition efficiency and accuracy are improved. On the other hand, the semantic information is characterized from character granularity, sub-word granularity and whole word granularity and from multi-granularity, so that the image characteristics and the semantic information can be acquired from the multi-granularity, and the text recognition performance is improved to a higher level.

In practical applications, the text recognition method can also be implemented by a text recognition model. In one possible implementation of the text recognition model, the text recognition model may include: linear projection layer, attention-based encoder, adaptive addressing and aggregation layer, fusion output layer.

Wherein:

and the linear projection layer is used for projecting the image block sequence of the text image into a vector with preset dimensionality.

An attention-based encoder, such as a Transformer encoder, is configured to implement the function of the aforementioned attention-based machine learning model, that is, perform attention calculation on the vector of the preset dimension to extract and output corresponding image features from the vector of the preset dimension.

The self-adaptive addressing and aggregation layer is used for carrying out multi-granularity mark analysis based on image characteristics to obtain corresponding mark analysis results of multiple granularities; and obtaining corresponding text prediction results of a plurality of granularities according to the mark analysis results of the plurality of granularities.

And the fusion output layer is used for determining and outputting a target text prediction result according to the text prediction results of the multiple granularities so as to obtain a text recognition result of the text image according to the target text prediction result.

The adaptive addressing and aggregation layer may include: the method comprises a character granularity self-adaptive addressing and aggregation layer, a sub-word granularity self-adaptive addressing and aggregation layer and a whole-word granularity self-adaptive addressing and aggregation layer.

An exemplary embodiment of the text recognition model described above is shown in fig. 2B, in which a W × H RGB image is segmented by a patch operation into a series of non-overlapping image blocks forming a sequence of image blocks, schematically indicated as P × P Patches. Where P × P denotes the resolution of each image block.

The image block sequence is linearly projected as an image block vector in D dimension by a Linear Projection layer (Linear Projection). Wherein D is set by the person skilled in the art according to the actual requirement.

After obtaining the image block vector of the D dimension, the text recognition model further adds a learnable [ CLS ] token to the head of the vector, and further adds the Position information corresponding to the image block to the [ CLS ] token and the image block vector corresponding to each image block, so as to form Position + Patch embedding with the [ CLS ] token added, so as to input the encoder based on the attention machine system. In fig. 2B, the vector is shown as the vector between the linear projection layer and the transform encoder. In the partial graphic vector, 0, 1, \8230 @ 8230; 10 denotes a position vector, "+" denotes [ CLS ] token, and an open ellipse taken together with each position vector of 1, 2, \8230; 10 denotes an image block vector.

The processed vector is output to an attention-based Encoder, schematically shown as a transform Encoder (transform Encoder) in fig. 2B, which performs image feature extraction based on the vector to obtain corresponding image features.

Next, the image features are input into an Adaptive Addressing and Aggregation layer (A3 layer) for convenience of description. The A3 layer is subjected to label analysis of a plurality of granularities. Because multi-granularity mark analysis is required, the A3 layer is realized by three mutually independent modules, which are respectively: a3 Module of Character granularity (Character A3 Module), A3 Module of BPE granularity (BPE A3 Module), and an A3 Module of whole word granularity (WordPiece A3 Module).

In the traditional mode, when a Transformer encoder performs text recognition of a text image, the first 27 tokens of 256 output sequences are directly taken as final output, while other tokens are not fully utilized effectively, and much useful information is discarded. In order to fully utilize the information of the output sequence of the transform encoder for text sequence prediction, in the solution of the embodiment of the present application, all the output tokens of the transform encoder are integrated into one sequence with a preset length (for example, the preset length may be 27, that is, it is assumed that the length of the longest character is 27) through a plurality of granularity A3 modules. Suppose that the element output after the A3 module of each granularity is attentively processed is y _i When i represents the ith element, the output of the transform encoder is z, and the aggregation function is a, the conversion formula used by the A3 module is: y is _i ＝A _i (z _L )。

In one possible way, y _i ＝A _i (z _L )＝softmax(α _i (z _L )) ^T (z _L U) ^T

Wherein alpha is _i (. Cndot.) represents a group convolution of 1 x 1 convolution kernels (also known as a block convolution); z is a radical of _L Representing token sequence output by the Transformer encoder; u denotes a learnable linear mapping matrix.

Based on this, for a certain A3 module, the vector Y output after the above calculation is represented as:

Y＝[y ₁ y ₂ ；...；y _T ]＝[A ₁ (z _L )；A ₂ (z _L )；...；A _T (z _L )]

where T is the preset text length, as shown above, 27 in this embodiment.

And then, classifying the text sequence based on the vector Y through a classifier corresponding to the A3 module of each granularity so as to realize text sequence prediction.

In one example, the classifier is denoted as G = YW ^T Where W represents a linear mapping matrix. However, as described above, the A3 modules with different granularities correspond to different classifiers, and in this embodiment, G is used for unified representation and is not distinguished any more. However, those skilled in the art will appreciate that the classifiers for different granularities have different values of Y and W, and thus, the classifier G is also different.

After the processing of the A3 module, multiple granularities of text predictors can be obtained, and an example of a multiple granularity predictor is shown in fig. 2C. As can be seen in the figure, for character granularity, it can split a word into single characters of the finest granularity. For BPE granularity, it splits words into common character substrings, e.g., for "methodist," it is split into "method" and "ist"; for "univorsity", it is split into "un" and "iversity"; for "41km," it is split into "41" and "km. For full word granularity, a whole word is predicted.

Specifically, in the example shown in fig. 2B, the character granularity prediction result of "coffee" is a single character, the BPE granularity prediction results are "co" and "free", and the whole word granularity prediction result is "coffee".

Furthermore, the multi-granularity prediction result output by the A3 module is fused and output through the fusion output layer, and the fusion output may be performed in any one of the first mode, the second mode, and the third mode of the fusion output part described above, which is not described herein again.

From the above, in the text recognition model example shown in fig. 2B, the backsbone (linear projection layer and attention-based encoder) is completely shared for image processing and semantic processing of the text image, and there is no separate semantic processing module; and, the semantic information is implicitly injected into the text recognition model through the A3 module; moreover, the A3 module is able to automatically aggregate and select all tokens output by the Transformer encoder by a spatial attention mechanism; in addition, the A3 module comprises a plurality of granularity modules which respectively predict character granularity, sub-word granularity and whole word granularity, so that image characteristics and semantic information can be acquired from more granularities, and more accurate and efficient text recognition is realized.

Referring to fig. 3, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, and a specific implementation of the electronic device is not limited in the specific embodiment of the present application.

As shown in fig. 3, the electronic device may include: a processor (processor) 302, a communication Interface 304, a memory 306, and a communication bus 308.

Wherein:

the processor 302, communication interface 304, and memory 306 communicate with each other via a communication bus 308.

A communication interface 304 for communicating with other electronic devices or servers.

The processor 302 is configured to execute the program 310, and may specifically execute the relevant steps in the foregoing text recognition method embodiment.

In particular, program 310 may include program code comprising computer operating instructions.

The processor 302 may be a CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application. The one or more processors included in the smart device may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And a memory 306 for storing a program 310. Memory 306 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 310 may be specifically configured to enable the processor 302 to execute operations corresponding to the text recognition method described in any of the foregoing method embodiments.

For specific implementation of each step in the program 310, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing method embodiments, and corresponding beneficial effects are provided, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The embodiment of the present application further provides a computer program product, which includes computer instructions, where the computer instructions instruct a computing device to execute an operation corresponding to any text recognition method in the foregoing method embodiments.

It should be noted that, according to implementation requirements, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Furthermore, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of the embodiments of the present application should be defined by the claims.

Claims

1. A text recognition method, comprising:

obtaining an image block sequence corresponding to a text image to be recognized;

performing image feature extraction on the image block sequence through a machine learning model based on an attention mechanism to obtain corresponding image features;

performing multi-granular label resolution based on the image features, wherein the multi-granular label resolution comprises: the method comprises the steps of character granularity mark analysis, sub-word granularity mark analysis and whole word granularity mark analysis;

and performing text recognition on the text image according to the multi-granularity mark analysis result.

2. The method of claim 1, wherein the token resolution of subword granularity is a token resolution of byte-to-code granularity.

3. The method of claim 1 or 2, wherein performing token parsing of subword granularity based on the image features comprises:

and performing space attention calculation and classification processing on the image features based on a BPE algorithm, and taking a processing result as a marking analysis result of the particle size of the subword.

4. The method of claim 1 or 2, wherein performing token parsing of whole word granularity based on the image features comprises:

and performing space attention calculation and classification processing on the image features based on a WordPiece algorithm, and taking a processing result as a mark analysis result of the whole word granularity.

5. The method according to claim 1 or 2, wherein the text recognition of the text image according to the multi-particle label parsing result comprises:

obtaining text prediction results of a plurality of corresponding granularities according to the multi-granularity mark analysis result;

and fusing the text prediction results of the multiple granularities, and performing text recognition on the text image according to the fusion result.

6. The method of claim 5, wherein the fusing the text predictions of the plurality of granularities and the text recognition of the text image according to the fused results comprises:

respectively averaging the probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain the probability distribution average corresponding to the text prediction results of each granularity;

determining a text prediction result corresponding to the maximum mean value in a plurality of probability distribution mean values corresponding to a plurality of granularities as a target text prediction result;

and performing text recognition on the text image according to the target text prediction result.

7. The method of claim 5, wherein the fusing the text prediction results of the plurality of granularities and the text recognition of the text image according to the fusion result comprises:

respectively multiplying probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain probability product results corresponding to the text prediction results of each granularity;

determining a text prediction result corresponding to the maximum product in a plurality of probability product results corresponding to a plurality of granularities as a target text prediction result;

8. The method of claim 5, wherein the fusing the text prediction results of the plurality of granularities and the text recognition of the text image according to the fusion result comprises:

respectively averaging the probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain the probability distribution average corresponding to the text prediction results of each granularity; determining a text prediction result corresponding to the maximum mean value in a plurality of probability distribution mean values corresponding to a plurality of granularities as a first text prediction result;

respectively multiplying probability distribution indicated by the text prediction results of each granularity in the multiple granularities to obtain probability product results corresponding to the text prediction results of each granularity; determining a text prediction result corresponding to the maximum product in a plurality of probability product results corresponding to a plurality of granularities as a second text prediction result;

determining a target text prediction result according to the confidence coefficient corresponding to the first text prediction result and the confidence coefficient corresponding to the second text prediction result;

9. The method of claim 1 or 2, wherein the text recognition method is performed by a text recognition model, the attention-based machine learning model being an attention-based encoder;

the text recognition model includes: a linear projection layer, the encoder, an adaptive addressing and aggregation layer, a fusion output layer;

wherein:

the linear projection layer is used for projecting the image block sequence into a vector with a preset dimension;

the encoder is used for performing attention calculation on the vector so as to extract and output corresponding image features from the vector;

the self-adaptive addressing and aggregation layer is used for performing multi-granularity label analysis based on the image characteristics to obtain corresponding label analysis results of multiple granularities; according to the mark analysis results of multiple granularities, obtaining corresponding text prediction results of multiple granularities;

10. The method of claim 9, wherein the adaptive addressing and aggregation layer comprises: the system comprises a character granularity adaptive addressing and aggregation layer, a sub-word granularity adaptive addressing and aggregation layer and a whole-word granularity adaptive addressing and aggregation layer.

11. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method according to any one of claims 1-10.

12. A computer storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1-10.

13. A computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the method of any one of claims 1-10.