Detailed Description
First, an application scenario of the present disclosure is explained, and the present disclosure may be applied to a scenario of character recognition, in which a character recognition algorithm mainly includes two steps of character detection and character recognition. At present, character detection can be divided into two modes, namely single character detection and text line extraction, wherein the single character detection is to directly detect a single character in a target image, and the text line extraction mainly extracts character areas distributed in lines. Aiming at the two modes, the condition of missing detection is easily caused by single character detection, namely one or more characters in the target image are not detected, so that the accuracy rate of character recognition is influenced; the text line extraction is to take the characters distributed in lines as a whole, so that missing detection is not easy to occur, but each character in the text line needs to be segmented after the text line is detected, so that the segmentation accuracy is high. For the different character detection methods, the character recognition methods are also different: when single character detection is adopted, the extracted single characters can be directly and respectively identified, and all the single characters are arranged and combined according to the character position information of the single characters, so that a final identification result is generated; when text line extraction is adopted, the characters in each text line are divided firstly, then the divided characters are identified, and the character identification results of each text line are arranged and combined according to the position information of each text line to generate a final identification result.
The current text image can be divided into a document image and a scene image, wherein the document image generally comprises a large number of characters, regular character distribution and single image background; unlike document images, scene images generally include a small number of characters, rich character types, randomly distributed characters, and complex image backgrounds. For the document image and the scene image, because of the different image characteristics, the current character recognition algorithm cannot perform character recognition on the document image and the scene image at the same time, and needs to perform character recognition respectively through different character recognition algorithms, thereby causing poor universality of the character recognition algorithm.
In order to solve the above problem, the present disclosure provides a character recognition method, an apparatus, a storage medium, and an electronic device, where an image type of a target image may be determined, a correction processing manner corresponding to the target image may be determined according to the image type, the target image may be corrected according to the correction processing manner corresponding to the target image, at least one text line image may be extracted from the corrected target image, and a character to be recognized in the at least one text line image may be recognized according to a character recognition model. Due to the fact that different image types correspond to different correction processing modes, images of different image types can be corrected according to the corresponding correction processing modes, and character recognition is conducted on the corrected images.
The present disclosure is described in detail below with reference to specific examples.
FIG. 1 is a flow diagram illustrating a character recognition method according to an exemplary embodiment. As shown in fig. 1, the method includes:
s101, determining an image category corresponding to a target image including characters to be recognized.
In this step, the image category may include a document image and a scene image, where the document image generally includes a large number of characters, regular character distribution, and a single image background; different from the document image, the scene image generally includes a small number of characters, rich types of characters, randomly distributed characters, and complex image background, and in consideration of the fact that the document image and the scene image have the different image characteristics, different image types correspond to different correction processing manners, and the image types are only for example and are not limited by the disclosure.
In a possible implementation manner, an image sample of a determined image class may be obtained, and an image class corresponding to the target image may be determined according to the image sample, further, the image sample may include a document image sample and a scene image sample, and a difference between the number of the document image samples and the number of the scene image samples is less than or equal to a preset threshold, so that a target classifier may be obtained by training a preset classifier through the document image sample and the scene image sample based on a deep learning method, and when the target image is input into the target classifier, the target classifier may output the image class corresponding to the target image.
And S102, performing correction processing on the target image according to the correction processing mode corresponding to the image type.
When the image type is a document image, because the characters to be recognized in the document image are usually in dense distribution, if the characters to be recognized in the document image have inclination and/or distortion, the accuracy of character recognition may be affected, and to avoid this problem, the present disclosure may perform correction processing on the document image, where the correction processing manner includes direction correction processing and/or distortion correction processing, and at this time, performing correction processing on the target image by using the correction processing manner corresponding to the image type may include the following steps:
and S11, acquiring a first inclination angle between the character to be recognized in the document image and a horizontal axis.
In a possible implementation manner, the first inclination angle may be obtained by a projection analysis method or a Hough transform method, and of course, the document image may also be subjected to threshold segmentation to obtain a binary document image, and the first inclination angle is obtained according to pixel point information of a character to be recognized in the binary document image, and the specific process may refer to the prior art and is not repeated.
And S12, determining whether the first inclination angle is larger than or equal to a preset angle.
When the first inclination angle is greater than or equal to the preset angle, performing steps S13 and S14;
when the first inclination angle is smaller than the preset angle, step S14 is executed.
And S13, performing direction correction processing on the document image.
The direction correction processing may be to continuously rotate the target image until a first inclination angle between the character to be recognized in the text image and the horizontal axis is smaller than the preset angle.
And S14, determining whether the character to be recognized in the document image has distortion.
When a scanner or a camera is used for collecting a text image, if the text is inclined and bent, or the shooting angle of view is inclined, the text image is distorted, so that the text lines which are originally horizontal or vertical are bent, interference exists among the text lines in the text image, and the final recognition result of the character to be recognized is influenced.
When there is distortion in the character to be recognized in the document image, step S15 is executed;
when there is no distortion in the character to be recognized in the document image, it is determined that the correction processing is completed.
S15, distortion correction processing is performed on the document image.
The distortion correction processing may be performed by using blank positions between the text lines to correct the text lines, so that the text lines are restored to be horizontally distributed or vertically distributed, and the specific process may refer to the prior art and is not described again.
It should be noted that, for the above method embodiments, for the sake of simplicity, all of them are expressed as a series of action combinations, but those skilled in the art should understand that the present disclosure is not limited by the described action sequence, because some steps may be performed in other sequences or simultaneously according to the present disclosure, for example, steps S14 and S15 may be performed before step S11, and at this time, the distortion correction process may be performed before the direction correction process; further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure.
In summary, based on the image features of the text image, the steps S11 to S15 may correct the first inclination angle and distortion of the character to be recognized in the text image, thereby improving the accuracy of character recognition in the subsequent steps.
When the image category is a scene image, because characters to be recognized in the scene image are usually sparsely distributed and a small number of randomly distributed text lines often exist, so that the influence between the text lines in the scene image is small, and distortion correction processing is not required, for the scene image, the corresponding correction processing mode is direction correction processing, and specifically, performing correction processing on the target image through the correction processing mode corresponding to the image category includes the following steps:
and S21, detecting the character area of the scene image to obtain at least one character area.
The text region detection may include any one of edge detection, region detection, texture detection, or learning detection, and of course, two, three, or four detection methods may be combined, and the above examples are only examples, and the disclosure does not limit this.
And S22, sequentially acquiring a second inclination angle between the character to be recognized in at least one text area and the horizontal axis.
Similarly, the second inclination angle may be obtained by a projection analysis method or a Hough transform method, and of course, the scene image may also be subjected to threshold segmentation to obtain a binary scene image, and the second inclination angle may be obtained according to pixel point information of a character to be recognized in the binary scene image, and the specific process may refer to the prior art and is not repeated.
When the second inclination angle is greater than or equal to the preset angle, performing step S23;
and when the second inclination angle is smaller than the preset angle, determining that the correction processing is finished.
S23, a direction correction process is performed on at least one of the character areas.
The direction correction processing may be to continuously rotate the text region until a second inclination angle between the character to be recognized in the text region and the horizontal axis is smaller than the preset angle.
In summary, based on the image features of the scene image, the steps S21 to S23 may correct the second inclination angle of the character to be recognized in the scene image, thereby improving the accuracy of character recognition in the subsequent steps.
S103, extracting at least one text line image from the corrected target image.
In this step, at least one text line image may be extracted based on a deep learning method, and specifically, the following steps may be included:
and S31, extracting the spatial features of the target image through the multilayer convolution layer in the text line detection model.
Wherein the spatial feature may be a correlation between pixels in the target image.
And S32, inputting the spatial features to a recurrent neural network layer in a text line detection model to obtain the sequence features of the target image.
In this step, the Recurrent neural Network layer may be LSTM (Long Short term memory Network), BLSTM (bidirectional Long Short term memory Network), or GRU (Gated redundant Unit, LSTM variant), and the above examples are only examples, and the disclosure does not limit this.
And S33, acquiring a candidate text box in the target image according to a preset rule, and classifying the candidate text box based on the sequence feature.
In a possible implementation manner, a sliding window with a preset size and a preset proportion may be adopted to slide in the target image to intercept the candidate text box, and the specific process refers to the prior art, which is not described in detail in this disclosure.
The classification process can be completed through a classification layer in the text line detection model, for example, the classification layer can be a softmax layer, and the input and output dimensions of the softmax layer are consistent, when the input and output dimensions of the softmax layer are inconsistent, a full connection layer needs to be added in front of the softmax layer, so that the input and output dimensions of the softmax layer are consistent.
S34, text box position information of the candidate text box is obtained by using the regression convolution layer in the text line detection model.
And S35, screening the candidate text box according to the text box position information and the classification result by using an NMS (non maximum suppression) method to obtain a text line image.
And S104, identifying the character to be identified in at least one text line image through a preset character identification model.
Generally, a character recognition step is carried out by taking characters as units, and then character prediction is carried out by adopting a character classifier, however, when a text line image is complex, character segmentation is difficult, a character structure may be damaged, the final recognition result of the characters is directly influenced by the precision of the character segmentation, and in order to avoid the problem of low recognition accuracy rate caused by the character segmentation, the text line image can be taken as a whole, characters to be recognized in the text line image are not cut, all characters to be recognized in the text line image are directly recognized, and therefore the character context relationship can be fully utilized for recognition.
Before this step, the method further includes: acquiring position information of at least one text line image, wherein after the text line image is determined in step S103, the position information corresponding to the text line image can be determined according to the text frame position information, and at this time, the character to be recognized in at least one text line image is recognized through the preset character recognition model and the position information, and the preset character recognition model includes a deep learning layer, a loop network layer, and an encoding layer, and specifically, the character recognition process may include the following steps:
and S41, extracting character features of at least one text line image according to the deep learning layer.
The deep learning layer may be a Convolutional Neural Network (CNN), so that at least one text line image may be formed into a plurality of slices along a horizontal direction through the CNN, each slice corresponds to a character feature, and the character features include a certain context relationship due to possible overlap between adjacent slices.
And S42, inputting the extracted character features into the circulating network layer to obtain at least one feature vector corresponding to the text line image.
The recurrent neural network layer may be LSTM, BLSTM, GRU, or the like, so that the character features may be further learned by the neural network layer to obtain feature vectors corresponding to slices, and the above examples are only illustrative, and the disclosure is not limited thereto.
And S43, inputting the feature vector into the coding layer to obtain a coding result of at least one text line image, and obtaining text information of at least one text line image according to the coding result.
In this step, the coding layer may be a CTC (connection temperature classification) layer, so that a coding result may be obtained according to the CTC layer, and since the text line image may include a plurality of characters to be recognized, the coding result may include a plurality of codes, so that each code in the coding result is matched with a preset code corresponding relationship to obtain a character corresponding to each code, and the characters corresponding to each code are sequentially arranged according to a coding sequence of the plurality of codes to obtain text information of the text line image, where the preset code corresponding relationship is a corresponding relationship between a coding sample and a character sample, and the above example is merely an example, and the disclosure does not limit this.
And S44, orderly arranging the text information of at least one text line image according to the position information to obtain the target recognition result of the target image.
In this step, the order of at least one text line image in the text line image can be obtained according to the position information, so that the text information of at least one text line image is sorted according to the order to obtain the target recognition result.
It should be noted that, the present disclosure is described by taking the characters to be recognized in the target image as an example of horizontal arrangement, when the characters to be recognized are vertically arranged, at least one text column image in the target image may be extracted, and the characters to be recognized in at least one text column image may be recognized by using a preset character recognition model, and the specific process may refer to the description of the text row image and is not repeated.
By adopting the method, firstly, the image type of the target image can be determined, then, the correction processing mode corresponding to the target image is determined according to the image type, then, the target image is corrected according to the correction processing mode corresponding to the target image, secondly, at least one text line image can be extracted from the corrected target image, and finally, the character to be recognized in the at least one text line image is recognized according to the character recognition model. Due to the fact that different image types correspond to different correction processing modes, images of different image types can be corrected according to the corresponding correction processing modes, and character recognition is conducted on the corrected images.
Fig. 2 is a block diagram illustrating a character recognition apparatus 20 according to an exemplary embodiment, as shown in fig. 2, including:
a determining module 201, configured to determine an image category corresponding to a target image including a character to be recognized; wherein, different image types correspond to different correction processing modes;
a correction module 202, configured to perform correction processing on the target image in a correction processing manner corresponding to the image type;
an extracting module 203, configured to extract at least one text line image from the corrected target image;
the recognition module 204 is configured to recognize the character to be recognized in at least one of the text line images through a preset character recognition model.
Optionally, the image category includes a document image and a scene image.
Fig. 3 is a block diagram illustrating the determination module 201 according to an exemplary embodiment, and as shown in fig. 3, the determination module 201 includes:
a first obtaining sub-module 2011, configured to obtain an image sample of the determined image category;
the first determining sub-module 2012 is configured to determine an image category corresponding to the target image according to the image sample.
FIG. 4 is a block diagram of the correction module 202 shown in accordance with an exemplary embodiment, as shown in FIG. 4, when the image category is a document image, the correction processing mode includes a direction correction processing and/or a distortion correction processing; when the correction processing manner includes the direction correction processing and the distortion correction processing, the correction module 202 includes:
the second obtaining sub-module 2021 is configured to obtain a first inclination angle between the character to be recognized in the text image and a horizontal axis;
the first correction submodule 2022 is configured to perform direction correction processing on the text image when the first inclination angle is greater than or equal to a preset angle;
a second determining sub-module 2023, configured to determine whether the character to be recognized in the text image has distortion;
the second correction sub-module 2024 is configured to perform distortion correction processing on the text image when the character to be recognized in the text image has distortion.
FIG. 5 is a block diagram of the correction module 202 according to an exemplary embodiment, as shown in FIG. 5, when the image category is a scene image, the correction processing mode includes direction correction processing; the calibration module 202 includes:
the detection submodule 2025 is configured to perform text region detection on the scene image to obtain at least one text region;
the third obtaining sub-module 2026 is configured to sequentially obtain a second inclination angle between the character to be recognized and the horizontal axis in at least one of the text regions;
the third correction submodule 2027 is configured to perform direction correction processing on at least one text region when the second inclination angle in the at least one text region is greater than or equal to a preset angle.
Fig. 6 is a block diagram illustrating the character recognition apparatus 20 according to an exemplary embodiment, as shown in fig. 6, further including:
an obtaining module 305, configured to obtain position information of at least one text line image before the character to be recognized in the at least one text line image is recognized through a preset character recognition model;
the recognition module 304 is configured to recognize the character to be recognized in at least one of the text line images through the preset character recognition model and the position information.
Fig. 7 is a block diagram illustrating a recognition module 304 according to an exemplary embodiment, where the preset character recognition model includes a deep learning layer, a loop network layer, and an encoding layer, as shown in fig. 7, the recognition module 304 includes:
an extracting submodule 3041, configured to perform character feature extraction on at least one text line image according to the deep learning layer;
a fourth obtaining submodule 3042, configured to input the extracted character features to the loop network layer to obtain at least one feature vector corresponding to the text line image;
a fifth obtaining submodule 3043, configured to input the feature vector to the coding layer to obtain a coding result of at least one text line image, and obtain text information of at least one text line image according to the coding result;
the sixth obtaining sub-module 3044 is configured to sequentially arrange the text information of at least one text line image according to the position information to obtain a target recognition result of the target image.
By adopting the device, firstly, the image type of the target image can be determined, then, the correction processing mode corresponding to the target image is determined according to the image type, then, the target image is corrected according to the correction processing mode corresponding to the target image, secondly, at least one text line image can be extracted from the corrected target image, and finally, the character to be recognized in the at least one text line image is recognized according to the character recognition model. Due to the fact that different image types correspond to different correction processing modes, images of different image types can be corrected according to the corresponding correction processing modes, and character recognition is conducted on the corrected images.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 8 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. As shown in fig. 8, the electronic device 800 may include: a processor 801, a memory 802. The electronic device 800 may also include one or more of a multimedia component 803, an input/output (I/O) interface 804, and a communications component 805.
The processor 801 is configured to control the overall operation of the electronic device 800, so as to complete all or part of the steps in the character recognition method. The memory 802 is used to store various types of data to support operation at the electronic device 800, such as instructions for any application or method operating on the electronic device 800 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and so forth. The Memory 802 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 803 may include screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 802 or transmitted through the communication component 805. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 804 provides an interface between the processor 801 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 805 is used for wired or wireless communication between the electronic device 800 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding communication component 805 may include: Wi-Fi module, bluetooth module, NFC module.
In an exemplary embodiment, the electronic Device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the character recognition method described above.
In another exemplary embodiment, there is also provided a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the character recognition method described above. For example, the computer readable storage medium may be the memory 802 described above that includes program instructions that are executable by the processor 801 of the electronic device 800 to perform the character recognition method described above.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.