KR20220122566A

KR20220122566A - Text recognition model training method, text recognition method, and apparatus

Info

Publication number: KR20220122566A
Application number: KR1020220101802A
Authority: KR
Inventors: 청콴 장; 위에천 위; 위린 리; 지안지안 차오; 샤멍 친; 쿤 야오; 준위 한; 징투오 리우; 얼루이 딩; 징동 왕
Original assignee: 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드
Priority date: 2022-03-22
Filing date: 2022-08-16
Publication date: 2022-09-02
Also published as: CN114399769A; CN115035538A; CN114399769B; CN115035538B; JP2022177242A

Abstract

The present invention relates to a method for training a text recognition model, a method for recognizing text, and a device for recognizing text. The present invention relates to the field of artificial intelligence technology and, specifically, to the field of deep learning and computer vision technology, and the present invention can be applied to scenarios such as optical character recognition. The method for training a text recognition model comprises the steps of: performing mask prediction on a part of an obtained first sample image to obtain a predicted complete image corresponding to the first sample image; performing mask prediction on a part of the texts in an obtained second sample image to obtain predicted text content for the part of the texts; training a model according to the predicted complete image and the predicted text content to obtain a pre-training model; and generating a text recognition model based on the pre-training model, wherein the text recognition model is for performing text recognition on the image to be recognized. The present invention enables the pre-training model to learn stronger image vision reasoning ability and text meaning reasoning ability. Therefore, when the text recognition model generated based on the pre-training model performs the text recognition, the accuracy and the reliability of the text recognition are improved.

Description

TEXT RECOGNITION MODEL TRAINING METHOD, TEXT RECOGNITION METHOD, AND APPARATUS

본 출원은 인공지능 (Artificial Intelligence, AI) 기술분야에 관한 것으로서, 구체적으로 딥러닝, 컴퓨터 비전 기술분야에 관한 것이며, 광학 문자 인식(Optical Character Recognition, OCR) 등의 시나리오에 적용될 수 있으며, 특히 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법 및 장치에 관한 것이다. This application relates to the field of artificial intelligence (AI) technology, specifically to the field of deep learning and computer vision technology, and can be applied to scenarios such as optical character recognition (OCR), in particular text It relates to a training method of a recognition model, a text recognition method and an apparatus.

OCR 기술은 교육, 금융, 의료, 교통 및 보험과 같은 다양한 산업 분야에서 모두 광범위한 관심을 받으며 적용되고 있다.OCR technology has received widespread interest and is being applied in various industries such as education, finance, healthcare, transportation and insurance.

종래기술에서, OCR 기술과 딥러닝 기술을 결합하여 텍스트 인식 모델을 구축하여, 텍스트 인식 모델을 기반으로 이미지에 대해 텍스트 인식을 수행할 수 있다.In the prior art, by combining OCR technology and deep learning technology to build a text recognition model, text recognition may be performed on an image based on the text recognition model.

그러나 텍스트 인식 모델은 일반적으로 비전 정보에 의존하며, 비전 정보를 기반으로 이미지 중의 텍스트 내용을 분별하므로, 인식 정확도가 보다 낮은 단점이 존재한다.However, the text recognition model generally relies on vision information, and since the text content in the image is discriminated based on the vision information, the recognition accuracy is lower.

본 출원은 텍스트 인식의 신뢰성을 향상시키기 위한 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법 및 장치를 제공한다. The present application provides a method for training a text recognition model, a text recognition method, and an apparatus for improving the reliability of text recognition.

본 출원의 제1 측면에 따르면, 텍스트 인식 모델의 트레이닝 방법을 제공하며, 상기 방법은,According to a first aspect of the present application, there is provided a method for training a text recognition model, the method comprising:

획득된 제1 샘플 이미지 중의 부분 이미지에 대해 마스크 예측을 수행하여, 상기 제1 샘플 이미지와 대응되는 예측 풀 이미지를 획득하는 단계; performing mask prediction on a partial image among the obtained first sample images to obtain a predicted full image corresponding to the first sample image;

획득된 제2 샘플 이미지 중의 부분 텍스트에 대해 상기 마스크 예측을 수행하여, 상기 부분 텍스트와 대응되는 예측 텍스트 내용을 획득하는 단계; performing the mask prediction on the partial text in the obtained second sample image to obtain predicted text content corresponding to the partial text;

상기 예측 풀 이미지와 상기 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하고, 상기 프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성하며, 여기서, 상기 텍스트 인식 모델은 인식 대상 이미지에 대해 텍스트 인식을 수행하기 위한 것인 단계;를 포함한다. A pre-training model is obtained by training based on the predicted full image and the predicted text content, and a text recognition model is generated based on the pre-training model, wherein the text recognition model performs text recognition for a recognition target image. Including; step for carrying out.

본 출원의 제2 측면에 따르면, 텍스트 인식 방법을 제공하며, 상기 방법은,According to a second aspect of the present application, there is provided a text recognition method, the method comprising:

인식 대상 이미지를 획득하며, 여기서, 상기 인식 대상 이미지는 텍스트를 포함하는 단계; acquiring a recognition target image, wherein the recognition target image includes text;

사전에 트레이닝된 텍스트 인식 모델을 기반으로 상기 인식 대상 이미지에 대해 텍스트 인식을 수행하여, 상기 인식 대상 이미지 중의 텍스트 내용을 획득하는 단계;를 포함하고, Performing text recognition on the recognition target image based on a previously trained text recognition model to obtain text content in the recognition target image;

여기서, 상기 텍스트 인식 모델은 제1 측면에 따른 방법을 기반으로 획득된 것이다. Here, the text recognition model is obtained based on the method according to the first aspect.

본 출원의 제3 측면에 따르면, 텍스트 인식 모델의 트레이닝 장치를 제공하며, 상기 장치는 예측 유닛, 트레이닝 유닛, 생성 유닛을 포함하고,According to a third aspect of the present application, there is provided an apparatus for training a text recognition model, the apparatus comprising a prediction unit, a training unit, and a generating unit,

상기 예측 유닛은 획득된 제1 샘플 이미지 중의 부분 이미지에 대해 마스크 예측을 수행하여, 상기 제1 샘플 이미지와 대응되는 예측 풀 이미지를 획득하고; the prediction unit performs mask prediction on the partial images in the obtained first sample images to obtain a prediction full image corresponding to the first sample image;

상기 예측 유닛은 또한 획득된 제2 샘플 이미지 중의 부분 텍스트에 대해 상기 마스크 예측을 수행하여, 상기 부분 텍스트와 대응되는 예측 텍스트 내용을 획득하고; the prediction unit is further configured to perform the mask prediction on the partial text in the obtained second sample image to obtain a predicted text content corresponding to the partial text;

상기 트레이닝 유닛은 상기 예측 풀 이미지와 상기 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하고; the training unit trains on the basis of the predicted full image and the predicted text content to obtain a pre-training model;

상기 생성 유닛은 상기 프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성하며, 여기서, 상기 텍스트 인식 모델은 인식 대상 이미지에 대해 텍스트 인식을 수행하기 위한 것이다. The generating unit generates a text recognition model based on the pre-training model, wherein the text recognition model is for performing text recognition on a recognition target image.

본 출원의 제4 측면에 따르면, 텍스트 인식 장치를 제공하며, 상기 장치는,According to a fourth aspect of the present application, there is provided a text recognition apparatus, the apparatus comprising:

인식 대상 이미지를 획득하며, 여기서, 상기 인식 대상 이미지는 텍스트를 포함하는 획득 유닛; an acquiring unit for acquiring a recognition object image, wherein the recognition object image includes text;

사전에 트레이닝된 텍스트 인식 모델을 기반으로 상기 인식 대상 이미지에 대해 텍스트 인식을 수행하여, 상기 인식 대상 이미지 중의 텍스트 내용을 획득하는 인식 유닛;을 포함하고,a recognition unit configured to perform text recognition on the recognition object image based on a text recognition model trained in advance to obtain text content in the recognition object image; and

여기서, 상기 텍스트 인식 모델은 제1 측면에 따른 방법으로 획득된 것이다. Here, the text recognition model is obtained by the method according to the first aspect.

본 출원의 제5 측면에 따르면, 전자기기를 제공하며, 상기 전자기기는,According to a fifth aspect of the present application, there is provided an electronic device, the electronic device comprising:

적어도 하나의 프로세서; 및at least one processor; and

상기 적어도 하나의 프로세서와 통신 연결되는 메모리;를 포함하며, a memory communicatively coupled to the at least one processor; and

상기 메모리에 상기 적어도 하나의 프로세서에 의해 실행 가능한 명령이 저장되어 있고, 상기 명령은 상기 적어도 하나의 프로세서에 의해 실행되어, 상기 적어도 하나의 프로세서가 제1 측면 또는 제2 측면에 따른 방법을 수행할 수 있도록 한다. instructions executable by the at least one processor are stored in the memory, and the instructions are executed by the at least one processor to cause the at least one processor to perform the method according to the first aspect or the second aspect. make it possible

본 출원의 제6 측면에 따르면, 컴퓨터 명령이 저장되어 있는 비일시적 컴퓨터 판독 가능 저장매체를 제공하며, 여기서, 상기 컴퓨터 명령은 컴퓨터로 하여금 제1 측면 또는 제2 측면에 따른 방법을 수행하도록 한다. According to a sixth aspect of the present application, there is provided a non-transitory computer-readable storage medium having computer instructions stored thereon, wherein the computer instructions cause a computer to perform the method according to the first or second aspect.

본 출원의 제7 측면에 따르면, 컴퓨터 프로그램을 제공하며, 상기 컴퓨터 프로그램은 판독 가능 저장매체에 저장되고, 전자기기의 적어도 하나의 프로세서는 상기 판독 가능 저장매체로부터 상기 컴퓨터 프로그램을 판독할 수 있으며, 상기 적어도 하나의 프로세서는 상기 컴퓨터 프로그램을 실행하여 전자기기가 제1 측면 또는 제2 측면에 따른 방법을 수행하도록 한다.According to a seventh aspect of the present application, there is provided a computer program, wherein the computer program is stored in a readable storage medium, and at least one processor of an electronic device can read the computer program from the readable storage medium, The at least one processor executes the computer program to cause the electronic device to perform the method according to the first aspect or the second aspect.

본 출원의 마스크 예측을 기반으로 제1 샘플 이미지에 대응되는 예측 풀 이미지를 획득하고, 마스크 예측을 기반으로 제2 샘플 이미지 중의 부분 텍스트의 예측 텍스트 내용을 획득하고, 예측 풀 이미지와 예측 텍스트 내용을 결합하여 프리 트레이닝 모델을 생성하고, 프리 트레이닝 모델을 기반으로 텍스트 인식 모델을 생성하는 기술방안을 기초로, 프리 트레이닝 모델이 보다 강한 이미지 비전 추리 능력과 텍스트 의미 추리 능력을 학습하도록 하고, 이에 따라 프리 트레이닝 모델을 기반으로 생성된 텍스트 인식 모델이 텍스트 인식을 수행할 때, 텍스트 인식의 정확성과 신뢰성을 향상시킨다. Based on the mask prediction of the present application, the prediction full image corresponding to the first sample image is obtained, the prediction text content of the partial text in the second sample image is obtained based on the mask prediction, and the prediction full image and the prediction text contents are obtained. Based on the technical method of creating a pre-training model by combining them, and creating a text recognition model based on the pre-training model, the pre-training model learns stronger image vision reasoning ability and text meaning reasoning ability, and accordingly When the text recognition model generated based on the training model performs text recognition, the accuracy and reliability of text recognition are improved.

본 부분에 기재되는 내용은 본 출원의 실시예의 핵심 또는 중요 특징을 특정하려는 목적이 아니며, 본 출원의 범위를 한정하는 것도 아니라는 점을 이해하여야 한다. 본 출원의 기타 특징은 아래의 명세서로부터 쉽게 이해할 수 있다.It should be understood that the contents described in this section are not intended to specify key or important features of the embodiments of the present application, and are not intended to limit the scope of the present application. Other features of the present application can be easily understood from the following specification.

첨부된 도면은 본 방안을 더 충분히 이해하도록 제공되는 것으로서, 본 출원에 대한 한정은 아니다. 여기서,
도 1은 본 출원의 제1 실시예에 따른 도면이다.
도 2는 본 출원의 제2 실시예에 따른 도면이다.
도 3은 본 출원의 제3 실시예에 따른 도면이다.
도 4는 본 출원의 제4 실시예에 따른 도면이다.
도 5는 본 출원의 제5 실시예에 따른 도면이다.
도 6은 본 출원의 제6 실시예에 따른 도면이다.
도 7은 본 출원의 제7 실시예에 따른 도면이다.
도 8은 본 출원의 제8 실시예에 따른 도면이다.
도 9는 본 출원의 제9 실시예에 따른 도면이다.
도 10은 본 출원의 실시예의 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법을 구현하기 위한 전자기기의 블록도이다. The accompanying drawings are provided to more fully understand the present method, and are not limited to the present application. here,
1 is a view according to a first embodiment of the present application.
2 is a view according to a second embodiment of the present application.
3 is a view according to a third embodiment of the present application.
4 is a view according to a fourth embodiment of the present application.
5 is a view according to a fifth embodiment of the present application.
6 is a view according to a sixth embodiment of the present application.
7 is a view according to a seventh embodiment of the present application.
8 is a view according to an eighth embodiment of the present application.
9 is a view according to a ninth embodiment of the present application.
10 is a block diagram of an electronic device for implementing a training method of a text recognition model and a text recognition method according to an embodiment of the present application.

아래에서는 첨부 도면을 결합하여 본 출원의 예시적인 실시예에 대해 설명하며, 이해를 돕기 위하여 본 출원의 실시예의 다양한 세부 사항을 포함하며, 이들은 단지 예시적인 것으로만 간주하여야 한다. 따라서, 본 분야의 통상적인 지식을 가진 자라면, 여기에 기재되는 실시예에 대해 다양한 변경과 수정을 가할 수 있으며, 이는 본 출원의 범위와 정신을 벗어나지 않는다는 점을 이해하여야 한다. 마찬가지로, 명확성과 간결성을 위하여, 아래의 기재에서 공지 기능과 구조에 대한 설명을 생략한다.The following describes exemplary embodiments of the present application in conjunction with the accompanying drawings, and includes various details of the embodiments of the present application for easy understanding, which should be regarded as exemplary only. Accordingly, it should be understood that various changes and modifications may be made to the embodiments described herein by those of ordinary skill in the art without departing from the scope and spirit of the present application. Similarly, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the description below.

OCR 기술과 딥러닝을 결합하여 텍스트 인식 모델을 구축할 때, "모듈 분리"의 방식을 사용하여 구현할 수 있고, "엔드 대 엔드 모델"의 방식을 사용할 수도 있다. When building a text recognition model by combining OCR technology and deep learning, it can be implemented using the "module separation" method, or the "end-to-end model" method.

예시적으로, "모듈 분리"의 방식이란, 텍스트 검출 모듈, 정보 추출 모듈, 텍스트 인식 모듈을 구축하고, 이 세 모듈을 결합하여 텍스트 인식 모델을 구축하는 것을 가리킨다. Exemplarily, the "module separation" method refers to building a text detection module, an information extraction module, and a text recognition module, and combining these three modules to build a text recognition model.

"모듈 분리"의 방식을 사용할 경우, 사전에 각 모듈을 구축하고, 각 모듈을 결합하여야 하며, 과정이 상대적으로 번잡하고, 효율이 상대적으로 낮으며, 정확성이 누적 겹치므로, 상기 방식을 기반으로 구축된 텍스트 인식 모델의 인식 정확성이 보다 낮은 단점을 초래한다.When using the "module separation" method, each module must be built in advance and each module must be combined. The recognition accuracy of the built text recognition model results in a lower disadvantage.

예시적으로, "엔드 대 엔드 모델"의 방식이란, 입력단으로부터 출력단까지 하나의 예측 결과를 획득하며, 예를 들어 입력단에 이미지를 입력하고, 출력단에서 이미지에 대한 예측 텍스트 내용을 획득한다. Illustratively, the method of "end-to-end model" obtains one prediction result from the input end to the output end, for example, inputs an image to the input end, and obtains the predicted text content for the image at the output end.

하지만, "엔드 대 엔드 모델"의 방식을 사용하면 데이터 라벨링을 수행하여야 하며, 예를 들어 이미지의 진실한 텍스트 내용에 대해 라벨링하고, 트레이닝을 위해 제공되는 데이터가 비교적 유효해야 하므로, 트레이닝된 텍스트 인식 모델의 신뢰성이 보다 낮은 단점을 초래한다.However, using the "end-to-end model" approach, data labeling must be performed, e.g. labeling against the true textual content of an image, and the data provided for training must be relatively valid, so a trained text recognition model of lower reliability.

상술한 어느 하나의 방법을 기반으로 트레이닝하여 획득된 텍스트 인식 모델은 일반적으로 두 가지 유형의 판단만 수행하며, 서로 다른 수직 유형마다 다른 클래스 필드 수요가 있을 때, 텍스트 인식 모델, 특히 분류된 채널 수량을 재설계하여야 하고, 텍스트 인식 모델도 다시 트레이닝하여야 하며, 멀티플렉싱될 수 없다. The text recognition model obtained by training based on any one of the methods described above generally performs only two types of judgments, and when there is a different class field demand for different vertical types, the text recognition model, especially the number of classified channels must be redesigned, the text recognition model must also be retrained, and cannot be multiplexed.

예를 들어, OCR 기술 중의 이미지 문자 검출 모델(EAST), 분할된 문자 검출 모델(DB), 및 텍스트 검출기(LOMO) 등은 일반적으로 텍스트(text) 유형과 비텍스트 유형(non-text)과 같은 두 유형의 판단에만 사용될 수 있다. 만약 특정 구체적인 수직 유형에서의 사용자가 관심하는 필드 인식 수요를 해결하여야 할 경우, 분류 클래스 수량을 증가시켜야 한다. For example, image character detection model (EAST), segmented character detection model (DB), and text detector (LOMO) in OCR technology are generally used as text type and non-text type. It can only be used for both types of judgment. If it is necessary to solve the field recognition demand of the user's interest in a specific specific vertical type, the number of classification classes should be increased.

일부 실시예에서, 클래스 검출 확장 방식을 통해, 트레이닝하여 새로운 텍스트 인식 모델을 획득할 수 있으며, 예를 들어 기존 텍스트 인식 모델의 기초 상에서, 별도의 언어 모델을 추가하여 필드 분류할 수 있다. In some embodiments, a new text recognition model may be acquired by training through a class detection extension method, for example, a separate language model may be added on the basis of an existing text recognition model to perform field classification.

예를 들어, 만약 텍스트 인식 모델이 OCR 기술 중의 엔드 대 엔드 텍스트 검출과 인식(FOTS) 및 텍스트 검출과 인식 모델(Mask Text Spotter)이면, 예컨대 양방향 인코더 표시(Bidirectional Encoder Representation from Transformers, BERT)와 같은 별도의 언어 모델을 추가하여, 새로운 텍스트 인식 모델을 획득하여야 하며, 별도의 언어 모델을 추가하므로, 별도의 트레이닝을 추가하여야 하며, 이에 따라 트레이닝 코스트가 보다 높고, 효율이 보다 낮은 단점을 초래한다.For example, if the text recognition model is an end-to-end text detection and recognition (FOTS) and text detection and recognition model (Mask Text Spotter) in OCR technology, such as Bidirectional Encoder Representation from Transformers (BERT) A separate language model must be added to obtain a new text recognition model, and since a separate language model is added, additional training must be added, which results in a higher training cost and lower efficiency.

상술한 기술 문제점 중 적어도 하나를 방지하기 위하여, 본 출원의 발명자는 창조적 노동을 거쳐, 본 출원의 발명 사상에 이르게 되었다. 구체적으로, "엔드 대 엔드 모델"의 방식을 사용하여 트레이닝하여 프리 트레이닝 모델을 획득하고, 즉 모델 베이스에 대해 엔드 대 엔드의 프리 트레이닝을 수행하며, 비전 차원과 의미 차원을 결합하여 프리 트레이닝을 수행하고, 프리 트레이닝된 베이스를 기반으로 텍스트 인식 모델을 생성한다.In order to avoid at least one of the technical problems described above, the inventor of the present application has come to the inventive idea of the present application through creative labor. Specifically, training using the "end-to-end model" method to obtain a pre-training model, that is, performing end-to-end pre-training on the model base, and performing pre-training by combining the vision dimension and semantic dimension and create a text recognition model based on the pre-trained base.

상술한 발명 사상을 기반으로, 본 출원은 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법 및 장치를 제공하며, 인공지능 기술분야에 관한 것으로서, 구체적으로 딥러닝, 컴퓨터 비전 기술분야에 관한 것이며, OCR 등의 시나리오에 적용되어, 텍스트 인식 모델의 텍스트 인식에 대한 신뢰성을 향상시킬 수 있다. Based on the above-described inventive idea, the present application provides a training method of a text recognition model, a text recognition method and apparatus, and relates to the field of artificial intelligence technology, specifically to the field of deep learning and computer vision technology, OCR, etc. can be applied to the scenario of , to improve the reliability of text recognition of the text recognition model.

도 1은 본 출원의 제1 실시예에 따른 도면이다. 도 1에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 모델의 트레이닝 방법은 아래의 단계(S101 - S1013)를 포함한다.1 is a view according to a first embodiment of the present application. As shown in FIG. 1 , the training method of the text recognition model provided in this embodiment includes the following steps ( S101 - S1013 ).

S101: 획득된 제1 샘플 이미지 중의 부분 이미지에 대해 마스크 예측을 수행하여, 제1 샘플 이미지와 대응되는 예측 풀 이미지를 획득한다. S101: Mask prediction is performed on a partial image of the obtained first sample image to obtain a prediction full image corresponding to the first sample image.

예시적으로, 본 실시예의 수행 주체는 텍스트 인식 모델의 트레이닝 장치(이하, '트레이닝 장치'로 약칭)일 수 있고, 트레이닝 장치는 서버(예컨대, 클라우드 서버, 또는 로컬 서버, 또는 서버 클러스터)일 수 있고, 단말기일 수도 있고, 컴퓨터일 수도 있고, 프로세서일 수도 있고, 칩 등일 수도 있으며, 본 실시예에서는 한정하지 않는다. Illustratively, the subject of this embodiment may be a training apparatus for text recognition model (hereinafter, abbreviated as 'training apparatus'), and the training apparatus may be a server (eg, a cloud server, a local server, or a server cluster). It may be a terminal, a computer, a processor, a chip, etc., which is not limited in this embodiment.

여기서, 마스크 예측이란, 부분 이미지 또는 텍스트 등에 대해 마스크(mask) 처리(또는, '가림 처리'라고도 함)를 수행하고, mask 처리 전, 즉 가림 처리 전의 이미지 또는 텍스트 등의 완전한 이미지 또는 텍스트 등으로 복원하는 것을 가리킨다. Here, mask prediction refers to performing mask processing (or 'occlusion processing') on partial images or text, etc., and using complete images or texts such as images or text before mask processing, that is, before occlusion processing. refers to restoration.

상응하게, 상기 단계에 대해서는, 텍스트를 포함하는 제1 샘플 이미지를 획득하고, 제1 샘플 이미지의 부분 이미지에 대해 mask 처리를 수행하고, mask 처리 후의 이미지를 기반으로 완전한 제1 샘플 이미지(즉, 예측 풀 이미지)를 예측하는 것으로 이해할 수 있다. Correspondingly, for the above steps, obtaining a first sample image including text, performing mask processing on a partial image of the first sample image, and performing a mask processing on a complete first sample image (that is, based on the image after mask processing) It can be understood as predicting the prediction pool image).

다시 말하면, 상기 단계에 대해서는, 이미지 재구성 태스크(mask image modelling)로서, 마스크 예측의 방식을 결합하여 제1 샘플 이미지에 대해 이미지 재구성을 수행하는 것으로 이해할 수 있다. In other words, for the above step, as an image reconstruction task (mask image modeling), it can be understood as performing image reconstruction on the first sample image by combining the method of mask prediction.

S102: 획득된 제2 샘플 이미지 중의 부분 텍스트에 대해 마스크 예측을 수행하여, 부분 텍스트와 대응되는 예측 텍스트 내용을 획득한다. S102: Perform mask prediction on the partial text in the obtained second sample image to obtain the predicted text content corresponding to the partial text.

상술한 분석을 결합하면, 상기 단계에 대해서는, 텍스트를 포함하는 제2 샘플 이미지를 획득하고, 제2 샘플 이미지 중의 부분 텍스트에 대해 mask 처리를 수행하고, mask 처리 후의 텍스트를 기반으로 mask 처리된 부분 텍스트의 텍스트 내용(즉, 예측 텍스트 내용)을 예측하는 것으로 이해할 수 있다. Combining the above analysis, for the above steps, a second sample image including text is obtained, mask processing is performed on partial text in the second sample image, and a masked portion is based on the text after mask processing. It can be understood as predicting the textual content of the text (ie predictive text content).

다시 말하면, 상기 단계에 대해서는 텍스트 재구성 태스크(mask OCR modelling)로서, 마스크 예측의 방식을 결합하여 제2 샘플 이미지에 대해 텍스트 재구성을 수행하며, 구체적으로 제2 샘플 이미지 중의 부분 텍스트에 대해 재구성하는 것으로 이해할 수 있다. In other words, as for the above step, as a text reconstruction task (mask OCR modeling), text reconstruction is performed on the second sample image by combining the method of mask prediction, specifically, reconstruction on partial text in the second sample image. I can understand.

특별히 설명하여야 할 점은, 제1 샘플 이미지와 제2 샘플 이미지는 동일한 이미지일 수 있고, 상이한 이미지일 수도 있으며, 본 실시예에서는 한정하지 않는다. It should be particularly noted that the first sample image and the second sample image may be the same image or different images, and the present embodiment is not limited thereto.

S103: 예측 풀 이미지와 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하고, 프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성한다. S103: Obtain a pre-training model by training based on the predicted full image and the predicted text content, and generate a text recognition model based on the pre-training model.

여기서, 텍스트 인식 모델은 인식 대상 이미지에 대해 텍스트 인식을 수행하기 위한 것이다. Here, the text recognition model is for performing text recognition on an image to be recognized.

프리 트레이닝 모델에 대해서는, 텍스트 인식 모델의 베이스로 이해하거나, 또는, 텍스트 인식 모델의 은닉층으로 이해할 수 있다. The pre-training model may be understood as a base of the text recognition model or may be understood as a hidden layer of the text recognition model.

상술한 분석을 결합하여 알 수 있는 바와 같이, 프리 트레이닝 모델은 이미지 재구성과 텍스트 재구성을 기반으로 트레이닝하여 획득된 것으로서, 프리 트레이닝 모델이 보다 강한 이미지 비전 추리 능력과 텍스트 의미 추리 능력을 학습하도록 하고, 프리 트레이닝 모델을 기반으로 생성된 텍스트 인식 모델이 보다 강한 정확성과 신뢰성을 갖도록 한다. As can be seen by combining the above analysis, the pre-training model is obtained by training based on image reconstruction and text reconstruction, so that the pre-training model learns stronger image vision reasoning ability and text meaning reasoning ability, The text recognition model generated based on the pre-training model has stronger accuracy and reliability.

본 실시예에서, 엔드 대 엔드의 모델 트레이닝을 구현할 수 있으며, 즉 바로 제1 샘플 이미지와 제2 샘플 이미지를 기반으로 각각에 대응되는 예측 결과를 출력할 수 있으며, 예를 들어 제1 샘플 이미지에 대응되는 예측 결과는 예측 풀 이미지이고, 제2 샘플 이미지에 대응되는 예측 결과는 예측 텍스트 내용이며, 예컨대 인공 또는 OCR 기술을 기반으로 제2 샘플 이미지에 대해 텍스트 검출을 수행하여 텍스트를 얻는 단계와 같은 기타 단계를 추가할 필요가 없으므로, 트레이닝 효율을 향상시키고, 트레이닝 리소스와 코스트를 절약한다. In this embodiment, end-to-end model training can be implemented, that is, prediction results corresponding to each can be output directly based on the first sample image and the second sample image, for example, in the first sample image. The corresponding prediction result is a prediction full image, and the prediction result corresponding to the second sample image is the predicted text content, such as performing text detection on the second sample image based on artificial or OCR technology to obtain text. There is no need to add other steps, which improves training efficiency and saves training resources and costs.

상술한 분석을 기반으로 알 수 있는 바와 같이, 본 출원의 실시예는 텍스트 인식 모델의 트레이닝 방법을 제공하며, 획득된 제1 샘플 이미지 중의 부분 이미지에 대해 마스크 예측을 수행하여, 제1 샘플 이미지와 대응되는 예측 풀 이미지를 획득하고, 획득된 제2 샘플 이미지 중의 부분 텍스트에 대해 마스크 예측을 수행하여, 부분 텍스트와 대응되는 예측 텍스트 내용을 획득하고, 예측 풀 이미지와 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하고, 프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성하며, 여기서, 텍스트 인식 모델은 인식 대상 이미지에 대해 텍스트 인식을 수행하기 위한 것인 것을 포함하며, 본 실시예에서, 마스크 예측을 기반으로 제1 샘플 이미지에 대응되는 예측 풀 이미지를 획득하고, 마스크 예측을 기반으로 제2 샘플 이미지 중의 부분 텍스트의 예측 텍스트 내용을 획득하고, 예측 풀 이미지와 예측 텍스트 내용을 결합하여 프리 트레이닝 모델을 생성하고, 프리 트레이닝 모델을 기반으로 텍스트 인식 모델을 생성하는 기술특징을 통해, 프리 트레이닝 모델이 보다 강한 이미지 비전 추리 능력과 텍스트 의미 추리 능력을 학습하도록 하고, 이에 따라 프리 트레이닝 모델을 기반으로 생성된 텍스트 인식 모델이 텍스트 인식을 수행할 때, 텍스트 인식의 정확성과 신뢰성을 향상시킨다. As can be seen based on the above analysis, the embodiment of the present application provides a training method of a text recognition model, and performs mask prediction on a partial image in the obtained first sample image, so that the first sample image and Obtaining the corresponding prediction full image, performing mask prediction on the partial text in the obtained second sample image, obtaining the prediction text content corresponding to the partial text, and training based on the prediction full image and the prediction text content acquiring a pre-training model, and generating a text recognition model based on the pre-training model, wherein the text recognition model is for performing text recognition on an image to be recognized, in this embodiment, mask prediction Acquire a prediction full image corresponding to the first sample image based on Through the technical feature of creating a text recognition model based on the pre-training model and generating a text recognition model based on the pre-training model, the pre-training model learns stronger image vision reasoning ability and text semantic reasoning ability, and accordingly generated based on the pre-training model When the proposed text recognition model performs text recognition, it improves the accuracy and reliability of text recognition.

도 2는 본 출원의 제2 실시예에 따른 도면이다. 도 2에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 모델의 트레이닝 방법은 단계(S201 - S203)를 포함한다.2 is a view according to a second embodiment of the present application. As shown in Fig. 2, the training method of the text recognition model provided in this embodiment includes steps S201 to S203.

S201: 타겟 대상을 획득한다. S201: Acquire a target object.

여기서, 타겟 대상은 제1 샘플 이미지와 제2 샘플 이미지를 포함한다. Here, the target object includes a first sample image and a second sample image.

이해하여야 할 점은, 번잡한 설명을 피하기 위하여, 본 실시예에서는 본 실시예 중 상술한 실시예와 동일한 기술특징에 대한 반복되는 설명을 생략한다.It should be understood that, in order to avoid complicated description, repeated descriptions of the same technical features as those of the above-described embodiments are omitted in the present embodiment.

S202: 타겟 대상 중의 부분 대상을 랜덤으로 가리우고, 타겟 대상 중 가리워지지 않은 대상을 기초로, 타겟 대상 중 가리워진 부분 대상에 대해 예측하여, 예측 결과를 획득한다. S202: Partial objects among the target objects are randomly covered, and prediction results are obtained by predicting the partial objects of the target objects based on the objects not covered among the target objects.

여기서, 만약 타겟 대상이 제1 샘플 이미지이면, 타겟 대상 중의 부분 대상은 부분 이미지이고, 예측 결과는 예측 풀 이미지이다. Here, if the target object is the first sample image, the partial object of the target object is the partial image, and the prediction result is the prediction full image.

만약 타겟 대상이 제2 샘플 이미지이면, 타겟 대상 중의 부분 대상은 부분 텍스트이고, 예측 결과는 예측 텍스트 내용이다. If the target object is the second sample image, the partial object of the target object is the partial text, and the prediction result is the predicted text content.

일부 실시예에서, 타겟 대상 중 가리워지지 않은 대상을 기초로, 타겟 대상 중 가리워진 부분 대상에 대해 예측하여, 예측 결과를 획득하는 것은 아래의 단계를 포함한다. In some embodiments, obtaining a prediction result by predicting a partial obscured object among the target objects based on the non-occluded object among the target objects includes the following steps.

제1 단계: 타겟 대상 중 가리워지지 않은 대상에 대응되는 대상 특징을 추출하여, 제1 대상 특징을 획득한다. Step 1: A first target characteristic is obtained by extracting a target feature corresponding to an unobscured target from among the target targets.

제2 단계: 제1 대상 특징을 기초로, 타겟 대상 중 가리워진 부분 대상에 대해 예측하여, 예측 결과를 획득한다. Step 2: Predicting a partial object hidden among the target objects based on the first target feature to obtain a prediction result.

여기서, 만약 타겟 대상이 제1 샘플 이미지이면, 제1 대상 특징은 제1 비전 특징이다. 만약 타겟 대상이 제2 샘플 이미지이면, 제1 대상 특징은 제1 의미 특징이다. Here, if the target object is a first sample image, the first object characteristic is a first vision characteristic. If the target object is a second sample image, the first object characteristic is the first semantic characteristic.

S203: 예측 풀 이미지와 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하고, 프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성한다. S203: Obtain a pre-training model by training based on the predicted full image and the predicted text content, and generate a text recognition model based on the pre-training model.

여기서, 텍스트 인식 모델은 인식 대상 이미지에 대해 텍스트 인식하기 위한 것이다. Here, the text recognition model is for text recognition with respect to an image to be recognized.

읽는 자가 본 출원의 구현 원리를 더욱 충분히 이해하도록, 이하 도 3을 참조하여 상술한 실시예(도 1과 도 2에 도시된 실시예)에 대해 상세하게 설명한다. In order for the reader to more fully understand the implementation principle of the present application, the above-described embodiment (the embodiment shown in FIGS. 1 and 2 ) with reference to FIG. 3 will be described in detail below.

도 3은 본 출원의 제3 실시예에 따른 도면이다. 도 3에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 모델의 트레이닝 방법은 아래의 단계(S301 - S307)를 포함한다.3 is a view according to a third embodiment of the present application. As shown in FIG. 3 , the training method of the text recognition model provided in this embodiment includes the following steps ( S301 - S307 ).

S301: 제1 샘플 이미지를 획득한다. S301: Acquire a first sample image.

마찬가지로, 번잡한 기재를 피하기 위하여, 본 실시예에서는 본 실시예 중 상술한 실시예와 동일한 기술특징에 대한 반복되는 설명을 생략한다. Similarly, in order to avoid complicated description, in the present embodiment, repeated description of the same technical features as in the above-described embodiment of the present embodiment will be omitted.

S302: 제1 샘플 이미지 중의 부분 이미지를 랜덤으로 가리운다. S302: Randomly cover partial images in the first sample image.

이해하여야 할 점은, 네트워크 모델의 트레이닝은 일반적으로 반복 트레이닝하는 과정이며, 본 실시예에서, 매번의 반복 트레이닝은 모두 랜덤으로 제1 샘플 이미지의 부분 이미지를 가리우므로, 제1 샘플 이미지의 수량은 하나일 수 있고, 물론, 제1 샘플 이미지의 수량은 복수일 수도 있으며, 본 실시예에서는 한정하지 않는다. It should be understood that the training of the network model is generally a process of repeated training, and in this embodiment, each repetition training randomly covers partial images of the first sample image, so the quantity of the first sample image is may be one, and, of course, the number of the first sample images may be plural, which is not limited in this embodiment.

S303: 제1 샘플 이미지 중 가리워지지 않은 이미지를 기초로, 제1 샘플 이미지 중 가리워진 부분 이미지에 대해 예측하여, 예측 풀 이미지를 획득한다. S303: Based on the non-occluded image among the first sample images, prediction is performed on the partial image of the first sample image to obtain a prediction full image.

예시적으로, 제1 샘플 이미지에 대해 랜덤으로 가리운 후, 제1 샘플 이미지 중의 부분 이미지는 가리워지고, 다른 부분 이미지는 가리워지지 않으므로, 가리워지지 않은 이미지를 기반으로 완전한 제1 샘플 이미지(즉, 예측 풀 이미지)를 결정할 수 있다. Illustratively, after random occlusion on the first sample image, partial images in the first sample image are occluded and other partial images are not occluded, so that the complete first sample image (ie, prediction full image) can be determined.

본 실시예에서, "랜덤으로 가림 + 예측"의 방식을 결합하여, 예측 풀 이미지를 결정함으로써, 트레이닝 과정에서의 불가 결정성을 증가시킬 수 있으며, 이에 따라 트레이닝하여 획득된 프리 트레이닝 모델이 완전한 이미지를 복원하는 신뢰성을 향상시킨다. In this embodiment, by combining the method of "random occlusion + prediction" to determine the prediction full image, the indeterminacy in the training process can be increased, so that the pre-training model obtained by training is a complete image to restore reliability and improve reliability.

여기서, S302 - S303은 마스크 자동 인코더(MAE)를 기반으로 구현할 수 있다. 다시 말하면, 제1 샘플 이미지를 마스크 자동 인코더로 입력하여, 예측 풀 이미지를 출력할 수 있다. Here, S302 - S303 may be implemented based on a mask automatic encoder (MAE). In other words, a prediction full image may be output by inputting the first sample image to the mask auto-encoder.

일부 실시예에서, S303은 아래의 단계들을 포함할 수 있다.In some embodiments, S303 may include the following steps.

제1 단계: 제1 샘플 이미지 중 가리워지지 않은 이미지에 대응되는 비전 특징을 추출하여, 제1 비전 특징을 획득한다. Step 1: A first vision feature is obtained by extracting a vision feature corresponding to an unobscured image from among the first sample images.

여기서, 비전 특징은 텍스처 특징, 윤곽 특징, 칼라 특징, 및 형상 특징과 같은 것들을 포함하며, 여기서는 일일이 나열하지 않는다. Here, vision features include such things as texture features, contour features, color features, and shape features, which are not listed here.

상응하게, 제1 비전 특징이란, 제1 샘플 이미지 중 가리워지지 않은 이미지에 대응되는 텍스처 특징, 윤곽 특징, 칼라 특징, 및 형상 특징과 같은 것들을 가리킨다. Correspondingly, the first vision feature refers to such as a texture feature, a contour feature, a color feature, and a shape feature corresponding to an unoccluded image of the first sample image.

제2 단계: 제1 비전 특징을 기초로, 제1 샘플 이미지 중 가리워진 부분 이미지에 대해 예측하여, 예측 풀 이미지를 획득한다. Step 2: Acquire a predicted full image by predicting a partial image in the first sample image based on the first vision feature.

본 실시예에서, 가리워지지 않은 이미지에 대응되는 텍스처 특징, 윤곽 특징, 칼라 특징, 및 형상 특징 등의 비전 특징을 결합하여, 예측 풀 이미지를 획득하는 것은, 비전 콘텍스트를 기반으로 예측 풀 이미지를 획득하여, 트레이닝하여 시각적 큐의 콘텍스트 지식 학습을 완성할 수 있는 프리 트레이닝 모델을 획득하는 것에 해당된다. In this embodiment, combining vision features such as texture features, contour features, color features, and shape features corresponding to the non-occluded image to obtain a predictive full image is to obtain a predictive full image based on the vision context. Thus, it corresponds to acquiring a pre-training model capable of completing the learning of the context knowledge of the visual cue by training.

일부 실시예에서, 제2 단계는 아래의 서브 단계들을 포함할 수 있다. In some embodiments, the second step may include the following sub-steps.

제1 서브 단계: 제1 비전 특징을 기초로, 제1 샘플 이미지 중 가리워진 부분 이미지에 대응되는 비전 특징을 예측하여, 제2 비전 특징을 획득한다. First sub-step: estimating a vision feature corresponding to an obscured partial image of the first sample image based on the first vision feature to obtain a second vision feature.

예시적으로, 상술한 분석을 참조하면, 상기 서브 단계에 대해서는, 가리워지지 않은 이미지에 대응되는 예컨대 텍스처 특징, 윤곽 특징, 칼라 특징, 및 형상 특징 등의 비전 특징을 기초로, 예측하여 가리워진 부분 이미지에 대응되는 예컨대 텍스처 특징, 윤곽 특징, 칼라 특징, 및 형상 특징 등의 비전 특징을 획득하는 것으로 이해할 수 있다. Illustratively, referring to the above-described analysis, for the sub-step, the predicted and obscured portion based on vision features such as texture features, contour features, color features, and shape features corresponding to the non-occluded image It may be understood as acquiring vision features corresponding to the image, such as, for example, texture features, contour features, color features, and shape features.

제2 서브 단계: 제2 비전 특징을 기초로, 제1 샘플 이미지 중 가리워진 부분 이미지를 결정한다. Second sub-step: Determine an obscured partial image of the first sample image based on the second vision characteristic.

예시적으로, 가리워진 부분 이미지에 대응되는 예컨대 텍스처 특징, 윤곽 특징, 칼라 특징, 및 형상 특징 등의 비전 특징을 획득한 후, 상기 비전 특징을 기반으로 가리워진 부분 이미지를 보충 및 복구할 수 있다. Illustratively, after acquiring vision features, such as texture features, contour features, color features, and shape features, corresponding to the occluded partial image, the occluded partial image may be supplemented and restored based on the vision feature. .

제3 서브 단계: 제1 샘플 이미지 중 가리워지지 않은 이미지, 및 결정된 제1 샘플 이미지 중 가리워진 부분 이미지를 기초로, 예측 풀 이미지를 생성한다. A third sub-step: a prediction full image is generated based on the non-occluded image among the first sample images and the determined partial image of the first sample image.

상술한 분석을 참조하면, 가리워진 부분 이미지에 대해 보충 및 복구한 후, 바로 가리워진 부분 이미지가 복원되고, 가리워지지 않은 부분 이미지와 복원된 가리워진 부분 이미지에 대해 스플라이싱하여, 예측 풀 이미지를 획득하고, 즉 제1 샘플 이미지를 복원하여, 예측 풀 이미지와 제1 샘플 이미지가 고도로 일치하도록 함으로써, 예측 풀 이미지의 정확성과 신뢰성을 향상시킨다. Referring to the above analysis, after supplementing and repairing the occluded partial image, the occluded partial image is restored immediately, and splicing for the non-occluded partial image and the reconstructed occluded partial image, the prediction full image , that is, by reconstructing the first sample image, so that the prediction full image and the first sample image are highly consistent, thereby improving the accuracy and reliability of the prediction full image.

S304: 제2 샘플 이미지를 획득한다. S304: Acquire a second sample image.

상술한 분석을 참조하여 알 수 있는 바와 같이, 제1 샘플 이미지와 제2 샘플 이미지는 동일한 이미지일 수 있고, 상응하게, 만약 제1 샘플 이미지와 제2 샘플 이미지가 동일한 이미지이면, 해당 단계를 생략할 수 있다. As can be seen with reference to the above analysis, the first sample image and the second sample image may be the same image, and correspondingly, if the first sample image and the second sample image are the same image, the corresponding step is omitted. can do.

S305: 제2 샘플 이미지 중의 부분 텍스트를 랜덤으로 가리운다. S305: Randomly hide partial text in the second sample image.

마찬가지로, 네트워크 모델의 트레이닝은 일반적으로 반복 트레이닝하는 과정이며, 본 실시예에서, 매번의 반복 트레이닝은 모두 제2 샘플 이미지의 부분 텍스트를 랜덤으로 가리우는 것이므로, 제2 샘플 이미지의 수량은 하나일 수 있고, 물론, 제2 샘플 이미지의 수량은 복수일 수도 있으며, 본 실시예에서는 한정하지 않는다. Similarly, the training of the network model is generally a process of repetitive training, and in this embodiment, each repetition training is to randomly cover the partial text of the second sample image, so the quantity of the second sample image can be one. And, of course, the number of second sample images may be plural, and the present embodiment is not limited thereto.

예를 들어, 제2 샘플 이미지 중의 부분 단어, 또는 부분 구절 등을 랜덤으로 가리울 수 있다.For example, partial words or partial phrases in the second sample image may be randomly hidden.

S306: 제2 샘플 이미지 중 가리워지지 않은 텍스트를 기초로, 제2 샘플 이미지 중 가리워진 부분 텍스트에 대해 예측하여, 예측 텍스트 내용을 획득한다. S306: Based on the non-occluded text in the second sample image, predict for the hidden text in the second sample image, and obtain predictive text content.

예시적으로, 제2 샘플 이미지에 대해 랜덤으로 가리운 후, 제2 샘플 이미지 중의 부분 텍스트는 가리워지고, 다른 부분 텍스트는 가리워지지 않으며, 이때 가리워지지 않은 텍스트를 기반으로 가리워진 부분 텍스트의 텍스트 내용(즉, 예측 텍스트 내용)을 결정할 수 있다. Illustratively, after randomly covering the second sample image, the partial text in the second sample image is hidden, and the other partial text is not hidden, wherein the text content of the partial text that is hidden based on the unhidden text ( That is, predictive text content) can be determined.

본 실시예에서, "랜덤으로 가림 + 예측"의 방식을 결합하여, 텍스트 내용을 결정함으로써, 트레이닝 과정에서의 불가 결정성을 증가시킬 수 있으므로, 트레이닝하여 획득된 프리 트레이닝 모델이 완전한 이미지를 복원하는 신뢰성을 향상시킨다. In this embodiment, by combining the method of "random occlusion + prediction", by determining the text content, indeterminacy in the training process can be increased, so that the pre-training model obtained by training restores the complete image. improve reliability.

여기서, S305 - S306은 마스크 언어 모델(Masked Language Model, MLM)을 기반으로 구현할 수 있다. 다시 말하면, 제2 샘플 이미지를 마스크 언어 모델로 입력하여, 예측 텍스트 내용을 출력할 수 있다. Here, S305 - S306 may be implemented based on a Masked Language Model (MLM). In other words, the predicted text content may be output by inputting the second sample image to the mask language model.

일부 실시예에서, S306는 아래의 단계들을 포함할 수 있다. In some embodiments, S306 may include the following steps.

제1 단계: 제2 샘플 이미지 중 가리워지지 않은 텍스트에 대응되는 의미 특징을 추출하여, 제1 의미 특징을 획득한다. Step 1: A first semantic feature is obtained by extracting a semantic feature corresponding to the text that is not covered by the second sample image.

여기서, 의미 특징이란 각 문자열 사이의 논리 관계의 특징을 가리킨다. 상응하게, 제1 의미 특징에 대해서는, 가리워지지 않은 텍스트에 포함된 각 문자열 사이의 논리 관계의 특징으로 이해할 수 있고, 가리워지지 않은 텍스트 중의 각 문자(글 및/또는 단어) 사이의 관련 관계의 특징으로 이해할 수도 있다. Here, the semantic characteristic refers to the characteristic of a logical relationship between each character string. Correspondingly, for the first semantic characteristic, it can be understood as the characteristic of the logical relationship between each character string included in the unmasked text, and the characteristic of the relational relationship between each character (letter and/or word) in the unmasked text. can also be understood as

제2 단계: 제1 의미 특징을 기초로, 제2 샘플 이미지 중 가리워진 부분 텍스트에 대해 예측하여, 예측 텍스트 내용을 획득한다. Second step: based on the first semantic feature, predict the hidden partial text in the second sample image to obtain the predicted text content.

본 실시예에서, 가리워지지 않은 텍스트에 대응되는 각 문자열 사이의 논리 관계 등의 의미 특징을 결합하여, 예측 텍스트 내용을 획득하는 것은, 의미 콘텍스트를 기반으로 예측 텍스트 내용을 획득하여, 트레이닝하여 의미 큐의 콘텍스트 지식 학습을 완성할 수 있는 프리 트레이닝 모델을 획득하는 것에 해당된다. In this embodiment, obtaining the predictive text content by combining semantic features such as a logical relationship between each character string corresponding to the non-occluded text is to acquire the predictive text content based on the semantic context, train it to obtain the semantic queue It corresponds to acquiring a pre-training model that can complete the context knowledge learning of

일부 실시예에서, 제2 단계는 아래의 서브 단계들을 포함할 수 있다.In some embodiments, the second step may include the following sub-steps.

제1 서브 단계: 제1 의미 특징을 기초로, 제2 샘플 이미지 중 가리워진 부분 텍스트에 대응되는 의미 특징을 예측하여, 제2 의미 특징을 획득한다. First sub-step: Predicting a semantic feature corresponding to the partial text hidden in the second sample image based on the first semantic feature to obtain a second semantic feature.

예시적으로, 상술한 분석을 참조하면, 상기 서브 단계에 대해서는, 가리워지지 않은 텍스트에 대응되는 예컨대 각 문자열 사이의 논리 관계의 특징 등의 의미 특징을 기초로, 예측하여 가리워진 부분 텍스트에 대응되는 예컨대 각 문자열 사이의 논리 관계의 특징 등의 의미 특징을 획득하는 것으로 이해할 수 있다. Illustratively, referring to the above-described analysis, for the sub-step, predicting corresponding to the hidden partial text based on semantic characteristics, such as the characteristic of the logical relationship between each character string, corresponding to the unhidden text For example, it can be understood as acquiring semantic characteristics such as characteristics of a logical relationship between each character string.

제2 서브 단계: 제2 의미 특징을 기초로, 예측 텍스트 내용을 생성한다. Second sub-step: based on the second semantic feature, generate predictive text content.

예시적으로, 가리워지지 않은 텍스트에 대응되는 예컨대 각 문자열 사이의 논리 관계의 특징 등의 의미 특징을 획득한 후, 상기 의미 특징을 기반으로 가리워진 부분 텍스트의 의미 특징을 보충 및 복구할 수 있다. Illustratively, after acquiring semantic features corresponding to the non-occluded text, such as, for example, a feature of a logical relationship between each character string, the semantic feature of the hidden partial text may be supplemented and restored based on the semantic feature.

상술한 분석을 참조하면, 가리워진 부분 텍스트의 의미 특징에 대해 보충 및 복구한 후, 바로 가리워진 부분 텍스트의 의미 특징이 복원되고, 상기 의미 특징에 대응되는 텍스트 내용(즉, 예측 텍스트 내용)을 결정하여, 예측 텍스트 내용과 가리워진 부분 텍스트의 텍스트 내용이 고도로 일치되도록 함으로써, 예측 텍스트 내용의 정확성과 신뢰성을 향상시킨다. Referring to the above analysis, after supplementing and restoring the semantic feature of the hidden partial text, the semantic feature of the hidden partial text is immediately restored, and the text content corresponding to the semantic feature (that is, the predictive text content) is By determining, the predictive text content and the text content of the hidden partial text are highly consistent, thereby improving the accuracy and reliability of the predictive text content.

S307: 예측 풀 이미지와 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하고, 프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성한다. S307: Obtain a pre-training model by training based on the predicted full image and the predicted text content, and generate a text recognition model based on the pre-training model.

도 4는 본 출원의 제4 실시예에 따른 도면이다. 도 4에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 모델의 트레이닝 방법은 아래의 단계(S401 - S405)를 포함한다.4 is a view according to a fourth embodiment of the present application. As shown in FIG. 4 , the training method of the text recognition model provided in this embodiment includes the following steps ( S401 - S405 ).

S401: 획득된 제1 샘플 이미지 중의 부분 이미지에 대해 마스크 예측을 수행하여, 제1 샘플 이미지와 대응되는 예측 풀 이미지를 획득한다. S401: Mask prediction is performed on a partial image of the obtained first sample image to obtain a prediction full image corresponding to the first sample image.

S402: 획득된 제2 샘플 이미지 중의 부분 텍스트에 대해 마스크 예측을 수행하여, 부분 텍스트와 대응되는 예측 텍스트 내용을 획득한다. S402: Perform mask prediction on the partial text in the obtained second sample image to obtain predicted text content corresponding to the partial text.

S403: 예측 풀 이미지와 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득한다. S403: Acquire a pre-training model by training based on the predicted full image and the predicted text content.

예시적으로, 예측 풀 이미지와 예측 텍스트 내용을 기반으로, 기초 네트워크 모델에 대해 트레이닝하여, 프리 트레이닝 모델을 획득할 수 있다. Illustratively, based on the predicted full image and the predicted text content, the basic network model may be trained to obtain a pre-training model.

예를 들어, 예측 풀 이미지와 예측 텍스트 내용을 기반으로, 기초 네트워크 모델의 모델 파라미터에 대해 조정하여 프리 트레이닝 모델을 획득할 수 있다. For example, based on the predicted full image and the predicted text content, the pre-training model may be obtained by adjusting the model parameters of the basic network model.

여기서, 기초 네트워크 모델은 비전 변환기(Vision Transformer, ViT)일 수 있고, 뉴럴 네트워크 모델(Backbone), 예컨대 컨벌루션 뉴럴 네트워크 모델(CNN)일 수도 있고, 기타 네트워크 모델일 수도 있으며, 본 실시예에서는 한정하지 않는다. Here, the basic network model may be a Vision Transformer (ViT), a neural network model (Backbone), for example, a convolutional neural network model (CNN), or other network models, but not limited in this embodiment. does not

S404: 인식 대상 태스크와 트레이닝 이미지를 획득한다. S404: Acquire a task to be recognized and a training image.

여기서, 트레이닝 이미지는 텍스트를 포함한다. Here, the training image includes text.

여기서, 인식 대상 태스크는 텍스트 인식 모델의 인식 수요를 기반으로 결정된 것일 수 있고, 예컨대 인식 대상 태스크는 문자 검출 태스크일 수 있고, 텍스트 인식 태스크일 수도 있고, 필드 분류 태스크일 수도 있고, 기타 인식 태스크일 수도 있고, 여기서는 일일이 나열하지 않는다. Here, the recognition target task may be determined based on the recognition demand of the text recognition model. For example, the recognition target task may be a character detection task, a text recognition task, a field classification task, or other recognition tasks. It may be, and it is not listed here.

S405: 인식 대상 태스크와 트레이닝 이미지를 기초로, 프리 트레이닝 모델에 대해 트레이닝하여, 텍스트 인식 모델을 획득한다.S405: Based on the task to be recognized and the training image, train the pre-training model to obtain a text recognition model.

상술한 분석을 참조하여 알 수 있는 바와 같이, 프리 트레이닝 모델은 비전 큐의 콘텍스트 지식 학습을 완성하기 위한 모델을 구비할 뿐만 아니라, 의미 큐 콘텍스트 지식 학습을 위한 모델도 구비하며, 즉 프리 트레이닝 모델은 멀티모달 특징 추출 베이스이므로, 프리 트레이닝 모델을 결합하여 트레이닝하여 획득된 텍스트 인식 모델은 비전 큐 기반 콘텍스트 지식 인식 능력을 구비할 뿐만 아니라, 의미 큐 기반 콘텍스트 지식 인식 능력도 구비한다. As can be seen with reference to the above analysis, the pre-training model not only has a model for completing the context knowledge learning of the vision cue, but also has a model for learning the semantic cue context knowledge, that is, the pre-training model is Since it is a multimodal feature extraction base, the text recognition model obtained by training by combining the pre-training model not only has a vision cue-based context knowledge recognition capability, but also has a semantic cue-based context knowledge recognition capability.

또한 인식 대상 태스크를 결합하여 프리 트레이닝 모델에 대해 트레이닝함으로써, 서로 다른 인식 수요에 기반하여 트레이닝하여 이에 대응되는 텍스트 인식 모델을 획득하여, 트레이닝하여 획득되는 텍스트 인식 모델의 유연성과 다양성을 향상시킬 수 있고, 다양한 인식 시나리오에 적용될 수 있으며, 상이한 인식 수요를 만족시킨다. In addition, by training the pre-training model by combining the tasks to be recognized, training based on different recognition demands to obtain a corresponding text recognition model, thereby improving the flexibility and diversity of the text recognition model obtained by training, , can be applied to various recognition scenarios, and satisfy different recognition needs.

일부 실시예에서, 프리 트레이닝 모델(즉 멀티모달 특징 추출 베이스)를 텍스트 검출 네트워크 모델(Efficient and Accuracy Scene Text, EAST), 분할 기반 문자 검출 네트워크(Differentiable Binarization, DB), 텍스트 검출 네트워크(Look More Than Once, LOMO) 등에 로딩하여, 텍스트 인식 모델의 문자 검출 태스크를 구현할 수 있고; 또 예를 들어, 프리 트레이닝 모델을 컨벌루션 순환 뉴럴 네트워크(Convolutional Recurrent Neural Network, CRNN)로 로딩할 수 있으며, 여기서, 컨벌루션 순환 뉴럴 네트워크는 연결성 시간 분류(Connectionist Temporal Classification, CTC) 복호화 방식을 사용할 수 있고, 주의 매커니즘(Attention) 복호화 방식을 사용할 수도 있고, 변환기(transformer) 복호화 방법 등을 사용할 수도 있으며, 이로부터 텍스트 인식 모델의 텍스트 인식 태스크를 구현하고; 또 예를 들어, 프리 트레이닝 모델을 풀 연결 네트워크 모델(Fully Connected, FC), 또는 컨벌루션 뉴럴 네트워크 모델(Convolutional Neural Networks, CNN)로 로딩하여, 텍스트 인식 모델의 필드 분류 태스크를 구현할 수 있다. In some embodiments, the pre-training model (i.e., multimodal feature extraction base) is combined with a text detection network model (Efficient and Accuracy Scene Text, EAST), a segmentation-based character detection network (Differentiable Binarization, DB), and a text detection network (Look More Than). Once, LOMO), etc., to implement the character detection task of the text recognition model; Also, for example, the pre-training model can be loaded into a convolutional recurrent neural network (CRNN), where the convolutional recurrent neural network can use a Connectionist Temporal Classification (CTC) decoding method, and , may use an Attention decoding method, may use a Transformer decoding method, etc., and implement the text recognition task of the text recognition model therefrom; Also, for example, the pre-training model can be loaded into a Fully Connected (FC) model or a Convolutional Neural Networks (CNN) model to implement a field classification task of a text recognition model.

일부 실시예에서, S405는 아래의 단계들을 포함할 수 있다. In some embodiments, S405 may include the following steps.

제1 단계: 트레이닝 이미지를 프리 트레이닝 모델로 입력하여, 트레이닝 이미지에 대응되는 멀티모달 특징맵(Multi-modal Feature Maps)을 획득한다. Step 1: By inputting the training image as a pre-training model, multi-modal feature maps corresponding to the training image are obtained.

상술한 분석을 참조하면, 멀티모달 특징맵은 트레이닝 이미지의 복수의 차원의 특징, 예컨대 비전 차원의 특징과 의미 차원의 특징을 나타내기 위한 것이다. 예컨대 멀티모달 특징맵은 트레이닝 이미지에 대응되는 이미지 특징과 의미 특징을 나타낼 수 있다. Referring to the above-described analysis, the multimodal feature map is for representing a plurality of dimensional features of a training image, for example, a vision dimension feature and a semantic dimension feature. For example, the multimodal feature map may indicate image features and semantic features corresponding to the training image.

일부 실시예에서, 멀티모달 특징맵은 (d*h*w)으로 나타낼 수 있고, 여기서, d는 특징 채널 수량을 나타내고, h와 w는 멀티모달 특징맵의 높이와 폭을 나타낸다. In some embodiments, the multimodal feature map may be represented by (d*h*w), where d represents the feature channel quantity, and h and w represent the height and width of the multimodal feature map.

제2 단계: 인식 대상 태스크와 멀티모달 특징맵을 기초로, 텍스트 인식 모델을 생성한다. Step 2: Create a text recognition model based on the task to be recognized and the multimodal feature map.

본 실시예에서, 멀티모달 특징맵은 복수의 차원으로부터 트레이닝 이미지의 특징에 대해 나타낼 수 있는 바, 트레이닝 이미지의 비전 특징을 나타낼 수 있을 뿐만 아니라, 트레이닝 이미지의 의미 특징을 나타낼 수도 있으며, 나타내는 비전 특징과 의미 특징은 보다 강한 신뢰성과 전면성을 가지므로, 멀티모달 특징맵을 결합하여 생성된 텍스트 인식 모델은 보다 강한 신뢰성과 정확성을 갖는다. In this embodiment, the multimodal feature map may represent the features of the training image from a plurality of dimensions, and thus may not only represent the vision features of the training image, but may also represent the semantic features of the training image and represent vision features and semantic features have stronger reliability and fullness, so the text recognition model created by combining multimodal feature maps has stronger reliability and accuracy.

일부 실시예에서, 제2 단계는 아래와 같은 서브 단계들을 포함할 수 있다.In some embodiments, the second step may include the following sub-steps.

제1 서브 단계: 멀티모달 특징맵을 기초로, 트레이닝 이미지의 인식 대상 태스크에 따른 예측 인식 결과를 획득한다. First sub-step: Acquire a prediction recognition result according to a task to be recognized in a training image based on the multimodal feature map.

예시적으로, 멀티모달 특징맵을 컨벌루션 순환 뉴럴 네트워크로 입력하여, 예측 인식 결과(예측 텍스트 결과)를 획득할 수 있다. For example, a prediction recognition result (prediction text result) may be obtained by inputting a multimodal feature map into a convolutional recurrent neural network.

제2 서브 단계: 트레이닝 이미지의 기설정된 진실한 인식 결과, 및 예측 인식 결과를 기초로, 텍스트 인식 모델을 구축한다. Second sub-step: based on the preset true recognition result of the training image and the predictive recognition result, build a text recognition model.

여기서, 진실한 인식 결과는 사전에 트레이닝 이미지에 대해 라벨링하여 얻어진 것일 수 있고, 본 실시예에서는 라벨링하는 방식에 대해 한정하지 않으며, 예컨대 인공 라벨링 방식일 수 있고, 자동 라벨링 방식일 수도 있다. Here, the true recognition result may be obtained by labeling the training image in advance, and in this embodiment, the labeling method is not limited, for example, an artificial labeling method or an automatic labeling method.

예시적으로, 진실한 인식 결과와 예측 인식 결과 사이의 손실값을 연산할 수 있고, 만약 손실값이 기설정된 손실 임계값보다 크면(또는 같으면), 트레이닝을 반복으로 수행하고, 반대로, 만약 손실값이 기설정된 손실 임계값보다 작으면, 텍스트 인식 모델의 구축을 완성하고, 또는, 만약 반복 횟수가 기설정된 반복 횟수에 도달하면, 텍스트 인식 모델의 구축을 완성한다. Illustratively, it is possible to calculate a loss value between the true recognition result and the predicted recognition result, and if the loss value is greater than (or equal to) a preset loss threshold, training is repeatedly performed, and conversely, if the loss value is If it is less than the preset loss threshold, the text recognition model is completed, or, if the number of iterations reaches the preset number of iterations, the text recognition model is completed.

예를 들어, 만약 기차 승차권에 대해 텍스트 인식을 수행하기 위한 텍스트 인식 모델을 트레이닝해야 할 경우, 트레이닝 이미지는 기차 승차권 이미지이고, 기차표 이미지를 프리 트레이닝 모델로 입력하여, 기차 승차권 이미지의 멀티모달 특징맵을 출력하고, 멀티모달 특징맵을 예컨대 컨벌루션 순환 뉴럴 네트워크에 입력하고, 예컨대 기차 승차권 이미지 중의 "날짜, 기차 번호, 좌석 번호" 등의 예측 인식 결과를 출력하고, 상기 예측 인식 결과와 사전에 라벨링된 "날짜, 기차 번호, 좌석 번호"(즉, 진실한 인식 결과)를 비교하여, 트레이닝하여 텍스트 인식 모델을 획득하며, 트레이닝하여 획득된 텍스트 인식 모델은 인식 대상 승차권 이미지 중의 "날짜, 기차 번호, 좌석 번호" 텍스트 내용의 인식에 사용될 수 있다. For example, if it is necessary to train a text recognition model to perform text recognition on a train ticket, the training image is a train ticket image, and the train ticket image is input as a pre-training model, and the multimodal feature map of the train ticket image is output, input the multimodal feature map to, for example, a convolutional recurrent neural network, and output prediction recognition results such as "date, train number, seat number" in the train ticket image, for example, By comparing "date, train number, seat number" (that is, the true recognition result), training to obtain a text recognition model, the text recognition model obtained by training is "date, train number, seat number" in the recognition target ticket image " can be used for the recognition of text content.

도 5는 본 출원의 제5 실시예에 따른 도면이다. 도 5에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 방법은 아래의 단계(S501 - S502)를 포함한다.5 is a view according to a fifth embodiment of the present application. As shown in FIG. 5 , the text recognition method provided in this embodiment includes the following steps ( S501 - S502 ).

S501: 인식 대상 이미지를 획득한다. S501: Acquire a recognition target image.

여기서, 인식 대상 이미지는 텍스트를 포함한다. Here, the recognition target image includes text.

예시적으로, 본 실시예의 수행 주체는 텍스트 인식 장치일 수 있고, 텍스트 인식 장치는 트레이닝 장치와 동일한 장치일 수 있고, 트레이닝 장치와 다른 장치일 수도 있으며, 본 실시예에서는 한정하지 않는다. Exemplarily, the subject of this embodiment may be a text recognition apparatus, and the text recognition apparatus may be the same apparatus as the training apparatus, or may be a different apparatus from the training apparatus, but this embodiment is not limited thereto.

S502: 사전에 트레이닝된 텍스트 인식 모델을 기반으로 인식 대상 이미지에 대해 텍스트 인식을 수행하여, 인식 대상 이미지 중의 텍스트 내용을 획득한다. S502: Perform text recognition on the recognition target image based on the pre-trained text recognition model to obtain text content in the recognition target image.

여기서, 텍스트 인식 모델은 상술한 어느 하나의 실시예에 따른 텍스트 인식 모델의 트레이닝 방법을 기반으로 획득된 것이다. Here, the text recognition model is obtained based on the training method of the text recognition model according to any one of the above-described embodiments.

일부 실시예에서, S502는 아래의 단계를 포함할 수 있다. In some embodiments, S502 may include the following steps.

제1 단계: 텍스트 인식 모델을 기초로 인식 대상 이미지의 멀티모달 특징맵을 결정한다. Step 1: Determine the multimodal feature map of the image to be recognized based on the text recognition model.

제2 단계: 멀티모달 특징맵을 기초로 인식 대상 이미지 중의 텍스트 내용을 결정한다. Step 2: Determine the text content in the recognition target image based on the multimodal feature map.

여기서, 인식 대상 이미지의 멀티모달 특징맵은 인식 대상 이미지의 비전 특징과 의미 특징을 나타내기 위한 것이다. Here, the multimodal feature map of the recognition target image is for representing the vision characteristics and semantic features of the recognition target image.

예시적으로, 상술한 분석을 참조하면, 텍스트 인식 모델은 프리 트레이닝 모델을 포함하고, 만약 텍스트 인식 모델이 프리 트레이닝 모델을 컨벌루션 순환 뉴럴 네트워크에 로딩하여 트레이닝하여 획득된 것이면, 즉 텍스트 인식 모델이 컨벌루션 순환 뉴럴 네트워크를 더 포함하며, 본 실시예는 아래와 같이 이해할 수 있다.Illustratively, referring to the above analysis, the text recognition model includes a pre-training model, and if the text recognition model is obtained by loading the pre-training model into a convolutional recurrent neural network and training, that is, the text recognition model is a convolutional model. It further includes a recurrent neural network, and this embodiment can be understood as follows.

인식 대상 이미지를 프리 트레이닝 모델로 입력하여, 멀티모달 특징맵을 출력하고, 멀티모달 특징맵을 컨벌루션 순환 뉴럴 네트워크로 입력하여, 인식 대상 이미지 중의 텍스트 내용을 출력한다. An image to be recognized is input to the pre-training model, a multimodal feature map is output, and the multimodal feature map is input to a convolutional recurrent neural network to output text content in the image to be recognized.

도 6은 본 출원의 제6 실시예에 따른 도면이다. 도 6에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 모델의 트레이닝 장치(600)는, 예측 유닛(601), 트레이닝 유닛(602), 생성 유닛(603)을 포함하고,6 is a view according to a sixth embodiment of the present application. 6, the training apparatus 600 of the text recognition model provided in this embodiment includes a prediction unit 601, a training unit 602, and a generation unit 603,

예측 유닛(601)은 획득된 제1 샘플 이미지 중의 부분 이미지에 대해 마스크 예측을 수행하여, 제1 샘플 이미지와 대응되는 예측 풀 이미지를 획득하고;the prediction unit 601 performs mask prediction on the partial images in the obtained first sample images to obtain a prediction full image corresponding to the first sample image;

예측 유닛(601)은 또한 획득된 제2 샘플 이미지 중의 부분 텍스트에 대해 마스크 예측을 수행하여, 부분 텍스트와 대응되는 예측 텍스트 내용을 획득하고; The prediction unit 601 also performs mask prediction on the partial text in the obtained second sample image to obtain the predicted text content corresponding to the partial text;

트레이닝 유닛(602)은 예측 풀 이미지와 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하고;the training unit 602 trains on the basis of the predicted full image and the predicted text content to obtain a pre-training model;

생성 유닛(603)은 프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성하며, 여기서, 텍스트 인식 모델은 인식 대상 이미지에 대해 텍스트 인식을 수행하기 위한 것이다. The generating unit 603 generates a text recognition model based on the pre-training model, wherein the text recognition model is for performing text recognition on a recognition target image.

도 7은 본 출원의 제7 실시예에 따른 도면이다. 도 7에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 모델의 트레이닝 장치(700)는 예측 유닛(701)을 포함한다.7 is a view according to a seventh embodiment of the present application. As shown in FIG. 7 , the training apparatus 700 of the text recognition model provided in the present embodiment includes a prediction unit 701 .

예측 유닛(701)은 획득된 제1 샘플 이미지 중의 부분 이미지에 대해 마스크 예측을 수행하여, 제1 샘플 이미지와 대응되는 예측 풀 이미지를 획득하기 위한 것이다.The prediction unit 701 is for performing mask prediction on a partial image among the obtained first sample images, to obtain a prediction full image corresponding to the first sample image.

예측 유닛(701)은 또한 획득된 제2 샘플 이미지 중의 부분 텍스트에 대해 마스크 예측을 수행하여, 부분 텍스트와 대응되는 예측 텍스트 내용을 획득하기 위한 것이다. The prediction unit 701 is also for performing mask prediction on the partial text in the obtained second sample image, to obtain the predicted text content corresponding to the partial text.

도 7을 참조하면, 일부 실시예에서, 예측 유닛(701)은,Referring to FIG. 7 , in some embodiments, the prediction unit 701 includes:

타겟 대상 중의 부분 대상을 랜덤으로 가리우는 가림 서브 유닛(7011);an obscuring sub-unit 7011 that randomly covers a partial target among target targets;

타겟 대상 중 가리워지지 않은 대상을 기초로, 타겟 대상 중 가리워진 부분 대상에 대해 예측하여, 예측 결과를 획득하는 예측 서브 유닛(7012);을 포함한다. and a prediction sub-unit 7012 configured to obtain a prediction result by predicting a partial object hidden among the target objects based on the non-occluded object among the target objects.

여기서, 만약 타겟 대상이 제1 샘플 이미지이면, 타겟 대상 중의 부분 대상은 부분 이미지이고, 예측 결과는 예측 풀 이미지이며; 만약 타겟 대상이 제2 샘플 이미지이면, 타겟 대상 중의 부분 대상은 부분 텍스트이고, 예측 결과는 예측 텍스트 내용이다. Here, if the target object is a first sample image, a partial object of the target object is a partial image, and the prediction result is a prediction full image; If the target object is the second sample image, the partial object of the target object is the partial text, and the prediction result is the predicted text content.

일부 실시예에서, 예측 서브 유닛(7012)은,In some embodiments, the prediction sub-unit 7012 includes:

타겟 대상 중 가리워지지 않은 대상에 대응되는 대상 특징을 추출하여, 제1 대상 특징을 획득하는 추출 모듈;an extraction module for obtaining a first target characteristic by extracting a target characteristic corresponding to an unobscured target among target targets;

제1 대상 특징을 기초로, 타겟 대상 중 가리워진 부분 대상에 대해 예측하여, 예측 결과를 획득하는 예측 모듈;을 포함한다. and a prediction module configured to obtain a prediction result by predicting a partial object hidden among the target objects based on the first target feature.

여기서, 만약 타겟 대상이 제1 샘플 이미지이면, 제1 대상 특징은 제1 비전 특징이고; 만약 타겟 대상이 제2 샘플 이미지이면, 제1 대상 특징은 제1 의미 특징이다. wherein if the target object is a first sample image, the first object feature is a first vision feature; If the target object is a second sample image, the first object characteristic is the first semantic characteristic.

일부 실시예에서, 타겟 대상은 제1 샘플 이미지이고, 상기 제1 대상 특징은 제1 비전 특징이며; 예측 모듈은,In some embodiments, the target object is a first sample image, the first object characteristic is a first vision characteristic; The prediction module is

제1 비전 특징을 기초로, 제1 샘플 이미지 중 가리워진 부분 이미지에 대응되는 비전 특징을 예측하여, 제2 비전 특징을 획득하는 제1 예측 서브 모듈;a first prediction sub-module configured to obtain a second vision feature by predicting a vision feature corresponding to a partial image obscured in the first sample image based on the first vision feature;

제2 비전 특징을 기초로, 제1 샘플 이미지 중 가리워진 부분 이미지를 결정하는 제1 결정 서브 모듈;a first determining sub-module configured to determine an obscured partial image of the first sample image based on the second vision characteristic;

제1 샘플 이미지 중 가리워지지 않은 이미지, 및 결정된 제1 샘플 이미지 중 가리워진 부분 이미지를 기초로, 예측 풀 이미지를 생성하는 제1 생성 서브 모듈;을 포함한다. and a first generating sub-module configured to generate a full prediction image based on an unobscured image among the first sample images and a partial image that is obscured from among the determined first sample images.

일부 실시예에서, 타겟 대상은 제2 샘플 이미지이고, 상기 제1 대상 특징은 제1 의미 특징이며; 예측 모듈은,In some embodiments, the target object is a second sample image, and the first object characteristic is a first semantic characteristic; The prediction module is

제1 의미 특징을 기초로, 제2 샘플 이미지 중 가리워진 부분 텍스트에 대응되는 의미 특징을 예측하여, 제2 의미 특징을 획득하는 제2 예측 서브 모듈;a second prediction submodule configured to obtain a second semantic feature by predicting a semantic feature corresponding to the partial text hidden in the second sample image based on the first semantic feature;

제2 의미 특징을 기초로, 예측 텍스트 내용을 생성하는 제2 생성 서브 모듈;을 포함한다. and a second generation sub-module configured to generate the predicted text content based on the second semantic characteristic.

텍스트 인식 모델의 트레이닝 장치(700)는 The training apparatus 700 of the text recognition model is

예측 풀 이미지와 예측 텍스트 내용을 기초로 트레이닝하여 프리 트레이닝 모델을 획득하기 위한 것인 트레이닝 유닛(702);a training unit 702, configured to train on the basis of the predicted full image and the predicted text content to obtain a pre-training model;

프리 트레이닝 모델을 기초로 텍스트 인식 모델을 생성하며, 여기서, 텍스트 인식 모델은 인식 대상 이미지에 대해 텍스트 인식을 수행하기 위한 것인 생성 유닛(703)을 더 포함한다. Generate a text recognition model based on the pre-training model, wherein the text recognition model further includes a generating unit 703 for performing text recognition on a recognition target image.

도 7을 참조하면, 일부 실시예에서, 생성 유닛(703)은,Referring to FIG. 7 , in some embodiments, generating unit 703 includes:

인식 대상 태스크와 트레이닝 이미지를 획득하며, 여기서, 트레이닝 이미지는 텍스트를 포함하는 획득 서브 유닛(7031);Acquire a task to be recognized and a training image, wherein the training image includes an acquiring sub-unit 7031 including text;

인식 대상 태스크와 트레이닝 이미지를 기초로, 프리 트레이닝 모델에 대해 트레이닝하여, 텍스트 인식 모델을 획득하는 트레이닝 서브 유닛(7032);을 포함한다. and a training sub-unit 7032 configured to train a pre-training model based on the recognition target task and the training image to obtain a text recognition model.

일부 실시예에서, 트레이닝 서브 유닛(7032)은,In some embodiments, the training sub-unit 7032 comprises:

트레이닝 이미지를 프리 트레이닝 모델로 입력하여, 트레이닝 이미지에 대응되는 멀티모달 특징맵을 획득하는 입력 모듈;an input module for inputting a training image as a pre-training model to obtain a multimodal feature map corresponding to the training image;

인식 대상 태스크와 멀티모달 특징맵을 기초로, 텍스트 인식 모델을 생성하는 생성 모듈;을 포함한다. and a generation module for generating a text recognition model based on the recognition target task and the multi-modal feature map.

일부 실시예에서, 생성 모듈은,In some embodiments, the generating module comprises:

멀티모달 특징맵을 기초로, 트레이닝 이미지의 인식 대상 태스크에 따른 예측 인식 결과를 예측하는 제3 예측 서브 모듈;a third prediction sub-module for predicting a prediction recognition result according to a task to be recognized in a training image based on the multimodal feature map;

트레이닝 이미지의 기설정된 진실한 인식 결과, 및 예측 인식 결과를 기초로, 텍스트 인식 모델을 구축하는 구축 서브 모듈;을 포함한다. and a building sub-module for building a text recognition model based on the preset true recognition result of the training image and the predictive recognition result.

도 8은 본 출원의 제8 실시예에 따른 도면이다. 도 8에 도시된 바와 같이, 본 실시예에서 제공하는 텍스트 인식 장치(800)는,8 is a view according to an eighth embodiment of the present application. As shown in FIG. 8 , the text recognition apparatus 800 provided in this embodiment includes:

인식 대상 이미지를 획득하며, 여기서, 인식 대상 이미지는 텍스트를 포함하는 획득 유닛(801); an acquiring unit 801 for acquiring a recognition object image, wherein the recognition object image includes text;

사전에 트레이닝된 텍스트 인식 모델을 기반으로 인식 대상 이미지에 대해 텍스트 인식을 수행하여, 인식 대상 이미지 중의 텍스트 내용을 획득하는 인식 유닛(802);을 포함한다. and a recognition unit 802 that performs text recognition on the recognition object image based on the text recognition model trained in advance to obtain text content in the recognition object image.

도 8을 참조하여 알 수 있는 바와 같이, 일부 실시예에서, 인식 유닛(802)은,As can be seen with reference to FIG. 8 , in some embodiments, the recognition unit 802 comprises:

텍스트 인식 모델을 기초로 인식 대상 이미지의 멀티모달 특징맵을 결정하는 제1 결정 유닛(8021);a first determining unit 8021 for determining a multimodal feature map of the recognition target image based on the text recognition model;

멀티모달 특징맵을 기초로 인식 대상 이미지 중의 텍스트 내용을 결정하는 제2 결정 유닛(8022);을 포함한다. and a second determining unit 8022 that determines text content in the recognition target image based on the multimodal feature map.

도 9는 본 출원의 제9 실시예에 따른 도면이다. 도 9에 도시된 바와 같이, 본 출원에 따른 전자기기(900)는 프로세서(901)와 메모리(902)를 포함할 수 있다. 9 is a view according to a ninth embodiment of the present application. As shown in FIG. 9 , the electronic device 900 according to the present application may include a processor 901 and a memory 902 .

메모리(902)는 프로그램을 저장하기 위한 것이고; 메모리(902)는 휘발성 메모리(volatile memory)를 포함할 수 있고, 예를 들어 정적 랜덤 액세스 메모리(static random-access memory, SRAM), 더블 데이터 레이트 동기식 동적 랜덤 액세스 메모리(Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM) 등과 같은 랜덤 액세스 메모리(random-access memory, RAM)를 들 수 있고; 메모리는 비휘발성 메모리(non-volatile memory)를 포함할 수도 있고, 예를 들어 플래쉬 메모리(flash memory)를 들 수 있다. 메모리(902)는 컴퓨터 프로그램(상술한 방법을 구현하는 응용 프로그램, 기능 모듈 등), 컴퓨터 명령 등을 저장하기 위한 것이며, 상술한 컴퓨터 프로그램, 컴퓨터 명령 등은 섹션을 나누어 하나 또는 복수의 메모리(902)에 저장될 수도 있다. 상술한 컴퓨터 프로그램, 컴퓨터 명령, 데이터 등은 프로세서(901)에 의해 호출될 수 있다.memory 902 is for storing programs; Memory 902 may include volatile memory, such as static random-access memory (SRAM), double data rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access) random-access memory (RAM), such as Memory, DDR SDRAM, and the like; The memory may include a non-volatile memory, for example, a flash memory. The memory 902 is for storing a computer program (application program, function module, etc. for implementing the above method), computer instructions, etc. ) may be stored in The above-described computer program, computer instructions, data, etc. may be called by the processor 901 .

상술한 컴퓨터 프로그램, 컴퓨터 명령 등은 섹션을 나누어 하나 또는 복수의 메모리(902)에 저장될 수 있다. 또한 상술한 컴퓨터 프로그램, 컴퓨터 명령 등은 프로세서(901)에 의해 호출될 수 있다. The above-described computer programs, computer instructions, and the like may be stored in one or a plurality of memories 902 by dividing the sections. In addition, the above-described computer programs, computer instructions, etc. may be called by the processor 901 .

프로세서(901)는 메모리(902)에 저장된 컴퓨터 프로그램을 실행하여, 상술한 실시예에 따른 방법 중의 각각의 단계를 구현하기 위한 것이다. The processor 901 is for executing a computer program stored in the memory 902 to implement each step of the method according to the above-described embodiment.

구체적으로 상술한 방법 실시예 중의 관련 기재를 참조할 수 있다.Specifically, reference may be made to the relevant description in the above-described method examples.

프로세서(901)와 메모리(902)는 별도의 구성일 수 있고, 일체로 집적된 구성일 수도 있다. 프로세서(901)와 메모리(902)가 별도의 구성일 때, 메모리(902), 프로세서(901)는 버스(903)를 통해 커플링 연결될 수 있다. The processor 901 and the memory 902 may be configured separately or may be integrally integrated. When the processor 901 and the memory 902 are separate components, the memory 902 and the processor 901 may be coupled to each other through the bus 903 .

본 실시예의 전자기기는 상술한 방법 중의 기술방안을 수행할 수 있으며, 그 구체적 구현 과정과 기술원리가 동일하므로, 여기서는 반복되는 설명을 생략한다. The electronic device of this embodiment can perform the technical solution of the above-described methods, and since the specific implementation process and the technical principle are the same, repeated description is omitted here.

본 출원의 기술방안에서, 관련되는 사용자 개인 정보(예컨대, 얼굴 이미지 등)의 수집, 저장, 사용, 가공, 전송, 제공 및 공개 등의 처리는 모두 관련 법률 법규의 규정에 부합되며, 공서양속에 어긋나지 않는다.In the technical solution of this application, the collection, storage, use, processing, transmission, provision and disclosure of relevant user personal information (eg, face image, etc.) does not deviate

본 출원의 실시예에 따르면, 본 출원은 전자기기, 판독 가능 저장매체 및 컴퓨터 프로그램 제품을 더 제공한다. According to an embodiment of the present application, the present application further provides an electronic device, a readable storage medium, and a computer program product.

본 출원의 실시예에 따르면, 본 출원은 컴퓨터 프로그램을 더 제공하며, 컴퓨터 프로그램은 판독 가능 저장매체에 저장되고, 전자기기의 적어도 하나의 프로세서는 판독 가능 저장매체로부터 컴퓨터 프로그램을 판독할 수 있으며, 적어도 하나의 프로세서는 컴퓨터 프로그램을 실행하여 전자기기가 상술한 어느 하나의 실시예에 따른 방안을 수행하도록 한다.According to an embodiment of the present application, the present application further provides a computer program, the computer program is stored in a readable storage medium, at least one processor of the electronic device can read the computer program from the readable storage medium, At least one processor executes a computer program so that the electronic device performs the method according to any one of the above-described embodiments.

도 10은 본 출원의 실시예를 수행할 수 있는 예시적인 전자기기(1000)를 나타내는 블록도이다. 전자기기는 랩톱 컴퓨터, 데스크톱 컴퓨터, 워크 스테이션, 개인 정보 단말, 서버, 블레이드 서버, 대형 컴퓨터, 및 기타 적합한 컴퓨터와 같은 다양한 형태의 디지털 컴퓨터를 의미한다. 전자기기는 개인 정보 단말, 셀룰러폰, 스마트 폰, 웨어러블 기기 및 기타 유사한 컴퓨팅 장치와 같은 다양한 형태의 모바일 장치를 의미할 수도 있다. 본문에 개시된 부재, 이들의 연결 및 관계, 및 이들의 기능은 단지 예시적인 것이며, 본문에 개시된 것 및/또는 요구하는 본 출원의 구현을 한정하려는 의도가 아니다.10 is a block diagram illustrating an exemplary electronic device 1000 capable of performing an embodiment of the present application. Electronic device means various types of digital computers such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, large computers, and other suitable computers. Electronic devices may refer to various types of mobile devices such as personal information terminals, cellular phones, smart phones, wearable devices, and other similar computing devices. The elements disclosed herein, their connections and relationships, and their functions, are illustrative only and are not intended to limit what is disclosed in the text and/or the required implementation of the present application.

도 10에 도시된 바와 같이, 기기(1000)는, 읽기 전용 메모리(ROM, 1002)에 저장된 컴퓨터 프로그램 또는 저장 유닛(1008)으로부터 랜덤 액세스 메모리(RAM, 1003)에 로딩된 컴퓨터 프로그램을 기초로, 다양한 적합한 동작 및 처리를 수행할 수 있는 컴퓨팅 유닛(1001)을 포함한다. RAM(1003)에는, 기기(1000)의 동작에 필요한 다양한 프로그램과 데이터를 더 저장할 수 있다. 컴퓨팅 유닛(1001), ROM(1002) 및 RAM(1003)은 버스(1004)를 통해 서로 연결된다. 입력/출력(I/O) 인터페이스(1005)도 버스(1004)에 연결된다.10, the device 1000, based on a computer program stored in a read-only memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a random access memory (RAM) 1003, It includes a computing unit 1001 capable of performing a variety of suitable operations and processing. The RAM 1003 may further store various programs and data necessary for the operation of the device 1000 . The computing unit 1001 , the ROM 1002 , and the RAM 1003 are connected to each other via a bus 1004 . An input/output (I/O) interface 1005 is also coupled to the bus 1004 .

기기(1000) 중의 복수의 부재는 I/O 인터페이스(1005)에 연결되고, 예를 들어 키보드, 마우스 등과 같은 입력 유닛(1006); 예를 들어 다양한 유형의 디스플레이, 스피커 등과 같은 출력 유닛(1007); 예를 들어 자기 디스크, 광 디스크 등과 같은 저장 유닛(1008); 및 예를 들어 네트워크 카드, 모뎀, 무선 통신 트랜시버 등과 같은 통신 유닛(1009)을 포함한다. 통신 유닛(1009)은 기기(1000)가 인터넷과 같은 컴퓨터 네트워크 및/또는 다양한 통신 네트워크를 통해 기타 기기와 정보/데이터를 교환하는 것을 허용한다.A plurality of members of the device 1000 are connected to the I/O interface 1005 and include, for example, an input unit 1006 such as a keyboard, a mouse, and the like; output units 1007 such as, for example, various types of displays, speakers, etc.; a storage unit 1008 such as, for example, a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as, for example, a network card, modem, wireless communication transceiver, and the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices via a computer network such as the Internet and/or various communication networks.

컴퓨팅 유닛(1001)은 처리 및 연산 능력을 갖춘 다양한 범용 및/또는 전용 처리 모듈일 수 있다. 컴퓨팅 유닛(1001)의 일부 예시로서 중앙 처리 유닛(CPU), 그래픽 처리 유닛(GPU), 다양한 전용 인공지능(AI) 연산 칩, 다양한 기계 학습 모델 알고리즘을 실행하는 컴퓨팅 유닛, 디지털 신호 프로세서(DSP), 및 임의의 적합한 프로세서, 컨트롤러, 마이크로 컨트롤러 등을 포함하지만 이에 제한되는 것은 아니다. 컴퓨팅 유닛(1001)은 상술한 각각의 방법과 처리, 예컨대 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법을 수행한다. 예를 들어, 일부 실시예에서, 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법은 컴퓨터 소프트웨어 프로그램으로 구현되어, 명시적으로 저장 유닛(1008)과 같은 기계 판독 가능 매체에 저장될 수 있다. 일부 실시예에서, 컴퓨터 프로그램의 부분 또는 전부는 ROM(1002) 및/또는 통신 유닛(1009)을 통해 기기(1000) 상에 로딩 및/또는 설치될 수 있다. 컴퓨터 프로그램이 RAM(1003)에 로딩되어 컴퓨팅 유닛(1001)에 의해 실행될 때, 상술한 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법의 하나 또는 복수의 단계를 수행할 수 있다. 대안으로서, 기타 실시예에서, 컴퓨팅 유닛(1001)은 기타 임의의 적합한 방식을 통해(예를 들어, 펌웨어를 통해) 텍스트 인식 모델의 트레이닝 방법, 텍스트 인식 방법을 수행하도록 구성될 수 있다.The computing unit 1001 may be various general-purpose and/or dedicated processing modules with processing and computing capabilities. Some examples of the computing unit 1001 include a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computational chips, a computing unit executing various machine learning model algorithms, and a digital signal processor (DSP). , and any suitable processor, controller, microcontroller, and the like. The computing unit 1001 performs each of the above-described methods and processes, for example, a training method of a text recognition model and a text recognition method. For example, in some embodiments, the training method of the text recognition model, the text recognition method may be implemented as a computer software program and explicitly stored in a machine-readable medium such as the storage unit 1008 . In some embodiments, portions or all of the computer program may be loaded and/or installed on the device 1000 via the ROM 1002 and/or the communication unit 1009 . When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001 , one or a plurality of steps of the training method of the text recognition model and the text recognition method described above may be performed. Alternatively, in other embodiments, computing unit 1001 may be configured to perform a text recognition method, a training method of a text recognition model, via any other suitable manner (eg, via firmware).

본 명세서에 기재되는 시스템 및 기술의 다양한 실시형태는 디지털 전자 회로 시스템, 집적 회로 시스템, 필드 프로그래머블 어레이(FPGA), 전용 집적 회로(ASIC), 전용 표준 제품(ASSP), 시스템 온 칩 시스템(SOC), 부하 프로그래머블 논리 장치, 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합에서 구현될 수 있다. 이러한 다양한 실시형태는 하나 또는 복수의 컴퓨터 프로그램에서 실시되는 것을 포함할 수 있고, 해당 하나 또는 복수의 컴퓨터 프로그램은 적어도 하나의 프로그래머블 프로세서를 포함하는 프로그래머블 시스템 상에서 실행 및/또는 해석될 수 있으며, 해당 프로그래머블 프로세서는 전용 또는 범용 프로그래머블 프로세서일 수 있고, 저장 시스템, 적어도 하나의 입력 장치, 및 적어도 하나의 출력 장치로부터 데이터와 명령을 수신하고, 데이터와 명령을 해당 저장 시스템, 해당 적어도 하나의 입력 장치, 및 해당 적어도 하나의 출력 장치로 전송할 수 있다.Various embodiments of the systems and techniques described herein include digital electronic circuit systems, integrated circuit systems, field programmable arrays (FPGAs), dedicated integrated circuits (ASICs), dedicated standard products (ASSPs), system-on-a-chip systems (SOCs). , load programmable logic device, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs, wherein the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, and The processor may be a dedicated or general purpose programmable processor, and receives data and instructions from a storage system, at least one input device, and at least one output device, and transmits data and instructions to the storage system, the at least one input device, and may be transmitted to the corresponding at least one output device.

본 출원의 방법을 실시하기 위한 프로그램 코드는 하나 또는 복수의 프로그래밍 언어의 임의의 조합으로 작성될 수 있다. 이러한 프로그램 코드는 범용 컴퓨터, 전용 컴퓨터 또는 기타 프로그래머블 데이터 처리 장치의 프로세서 또는 컨트롤러에 제공되어, 프로그램 코드가 프로세서 또는 컨트롤러에 의해 실행될 때 흐름도 및/또는 블록도에서 규정하는 기능/조작이 실시되도록 할 수 있다. 프로그램 코드는 완전히 기계 상에서 실행되거나, 부분적으로 기계 상에서 실행될 수 있으며, 독립 소프트웨어 패키지로서 부분적으로 기계 상에서 실행되고 부분적으로 원격 기계 상에서 실행되거나 완전히 원격 기계 또는 서버 상에서 실행될 수도 있다.Program code for implementing the method of the present application may be written in any combination of one or a plurality of programming languages. Such program code may be provided to a processor or controller of a general-purpose computer, dedicated computer, or other programmable data processing device so that, when the program code is executed by the processor or controller, the functions/operations specified in the flowcharts and/or block diagrams are performed. have. The program code may run entirely on the machine or partially on the machine, as a standalone software package, partially on the machine and partly on a remote machine, or entirely on the remote machine or server.

본 출원의 문맥에서, 기계 판독 가능 매체는 유형의 매체일 수 있고, 명령 실행 시스템, 장치 또는 기기에 의해 사용되거나 명령 실행 시스템, 장치 또는 기기와 결합되어 사용되는 프로그램을 포함하거나 저장할 수 있다. 기계 판독 가능 매체는 기계 판독 가능 신호 매체이거나 기계 판독 가능 저장 매체일 수 있다. 기계 판독 가능 매체는 전자적, 자기적, 광학적, 전자기적, 적외선, 또는 반도체 시스템, 장치 또는 기기, 또는 상술한 내용의 임의의 적합한 조합을 포함할 수 있지만 이에 제한되는 것은 아니다. 기계 판독 가능 저장매체의 더 구체적인 예시로서 하나 또는 복수의 선을 기반으로 하는 전기적 연결, 휴대형 컴퓨터 디스크, 하드 디스크, 랜덤 액세스 메모리(RAM), 읽기 전용 메모리(ROM), 소거 가능 및 프로그래머블 읽기 전용 메모리(EPROM 또는 플래쉬 메모리), 광섬유, 휴대용 컴팩트 읽기 전용 메모리(CD-ROM), 광학 저장 장치, 자기 저장 장치, 또는 상술한 내용의 임의의 조합을 포함한다.In the context of this application, a machine-readable medium may be a tangible medium and may contain or store a program used by or in combination with an instruction execution system, apparatus, or apparatus. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or appliances, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable and programmable read-only memory. (EPROM or Flash memory), optical fiber, portable compact read-only memory (CD-ROM), optical storage, magnetic storage, or any combination of the foregoing.

사용자와의 인터랙션을 제공하기 위하여, 컴퓨터 상에서 본 명세서에 기재되는 시스템 및 기술을 실시할 수 있으며, 해당 컴퓨터는 사용자에게 정보를 디스플레이하기 위한 디스플레이 장치(예를 들어, CRT(캐소드레이 튜브) 또는 LCD(액정 디스플레이) 모니터); 및 키보드와 지향 장치(예를 들어, 마우스 또는 트랙볼)를 구비하고, 사용자는 해당 키보드와 해당 지향 장치를 통해 입력을 컴퓨터로 제공할 수 있다. 기타 종류의 장치는 사용자와의 인터랙션을 제공할 수도 있다. 예를 들어, 사용자에게 제공되는 피드백은 임의의 형태의 센싱 피드백(예를 들어, 시각적 피드백, 청각적 피드백, 또는 촉각적 피드백)일 수 있고; 임의의 형태(사운드 입력, 음성 입력 또는 촉각 입력)을 통해 사용자로부터의 입력을 수신할 수 있다.To provide for interaction with a user, the systems and techniques described herein may be implemented on a computer, the computer comprising a display device (eg, a CRT (cathode tube) or LCD for displaying information to the user. (liquid crystal display) monitor); and a keyboard and a pointing device (eg, a mouse or a trackball), wherein the user can provide input to the computer through the keyboard and the pointing device. Other types of devices may provide for interaction with the user. For example, the feedback provided to the user may be any form of sensing feedback (eg, visual feedback, auditory feedback, or tactile feedback); An input from the user may be received through any form (sound input, voice input, or tactile input).

여기에 기재되는 시스템과 기술은 백그라운드 부재를 포함하는 컴퓨팅 시스템(예를 들어, 데이터 서버로서), 또는 중간부재를 포함하는 컴퓨팅 시스템(예를 들어, 응용 서버), 또는 프론트 엔드 부재를 포함하는 컴퓨팅 시스템(예를 들어, 그래픽 유저 인터페이스 또는 인터넷 브라우저를 구비하는 사용자 컴퓨터, 사용자는 해당 그래픽 유저 인터페이스 또는 해당 인터넷 브라우저를 통해 여기에 기재되는 시스템 및 기술의 실시형태와 인터랙션할 수 있다), 또는 이러한 백그라운드 부재, 중간 부재, 또는 프론트 엔드 부재를 포함하는 임의의 조합의 컴퓨팅 시스템에서 실시할 수 있다. 임의의 형태 또는 매체의 디지털 데이터 통신(예를 들어, 통신 네트워크)을 통해 시스템의 부재를 서로 연결시킬 수 있다. 통신 네트워크의 예시로서, 근거리 통신망(LAN), 광역 통신망(WAN) 및 인터넷을 포함한다.The systems and techniques described herein provide a computing system that includes a background member (eg, as a data server), or a computing system that includes an intermediate member (eg, an application server), or a computing system that includes a front end member. system (eg, a user computer having a graphical user interface or Internet browser, through which a user may interact with embodiments of the systems and technologies described herein), or such a background It may be practiced in any combination of computing systems including members, intermediate members, or front end members. Any form or medium of digital data communication (eg, a communication network) may connect the members of the system to one another. Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

컴퓨터 시스템은 클라이언트와 서버를 포함할 수 있다. 클라이언트와 서버는 일반적으로 서로 멀리 떨어져 있으며, 통상적으로 통신 네트워크를 통해 인터랙션한다. 상응한 컴퓨터 상에서 실행되며 서로 클라이언트 - 서버 관계를 가지는 컴퓨터 프로그램을 통해 클라이언트와 서버의 관계를 생성한다. 서버는 클라우드 서버일 수 있고, 클라우드 컴퓨팅 서버 또는 클라우드 호스트라고도 불리우며, 클라우드 컴퓨팅 서비스 시스템 중의 일 호스트 제품으로서, 기존의 물리 호스트와 가상 사설 서버("Virtual Private Server", 또는 "VPS"로 약칭)에 존재하는 관리 상의 어려움이 크고, 서비스 확장이 약한 흠결을 해결한다. 서버는 분포식 시스템의 서버, 또는 블록 체인이 결합된 서버일 수도 있다.A computer system may include a client and a server. A client and server are typically remote from each other and typically interact through a communications network. Creates a client-server relationship through computer programs running on corresponding computers and having a client-server relationship to each other. The server may be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, and is connected to an existing physical host and a virtual private server (“Virtual Private Server” or “VPS” for short). Address the shortcomings of existing management difficulties and weak service expansion. The server may be a server of a distributed system, or a server combined with a block chain.

상술한 다양한 형태의 프로세스를 사용하여, 단계를 재배열, 추가 또는 삭제할 수 있다는 점을 이해하여야 한다. 예를 들어, 본 출원에 기재된 각 단계는 병열로 수행될 수 있고 순차적으로 수행될 수도 있고 서로 다른 순서로 수행될 수도 있으며, 본 출원에 개시된 기술방안이 원하는 결과를 얻을 수만 있다면, 본 명세서에서는 이에 대해 제한하지 않는다.It should be understood that steps may be rearranged, added, or deleted using the various types of processes described above. For example, each step described in the present application may be performed in parallel, sequentially, or in a different order, and as long as the technical solution disclosed in this application can achieve a desired result, this specification not limited about

상술한 구체적인 실시형태는 본 출원의 보호범위에 대한 한정이 아니다. 본 분야의 통상의 지식을 가진 자라면, 설계 요구와 기타 요소를 기초로, 다양한 수정, 조합, 서브 조합 및 대체를 수행할 수 있다는 점을 이해하여야 한다. 본 출원의 사상과 원칙 내에서 이루어진 모든 수정, 동등한 치환 및 개선 등은 모두 본 출원의 보호 범위 내에 포함되어야 한다.The specific embodiments described above are not limited to the protection scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made based on design requirements and other factors. All modifications, equivalent substitutions and improvements made within the spirit and principle of the present application should be included within the protection scope of the present application.

Claims

In the training method of a text recognition model,
performing mask prediction on a partial image among the obtained first sample images to obtain a predicted full image corresponding to the first sample image;
performing the mask prediction on the partial text in the obtained second sample image to obtain predicted text content corresponding to the partial text;
A pre-training model is obtained by training based on the predicted full image and the predicted text content, and a text recognition model is generated based on the pre-training model, wherein the text recognition model performs text recognition for a recognition target image. A method of training a text recognition model, comprising: a step to perform.

The method of claim 1, wherein the mask prediction comprises:
randomly obscuring a partial object among the target objects;
Based on the non-occluded target among the target targets, predicting the partial target of the target target to obtain a prediction result; includes,
wherein, if the target object is a first sample image, a partial object of the target object is a partial image, and the prediction result is the prediction full image; and if the target object is a second sample image, a partial object of the target object is a partial text, and the prediction result is the content of the predicted text.

3. The method of claim 2,
The step of obtaining a prediction result by predicting a partial object covered among the target objects based on the non-occluded object among the target objects,
obtaining a first target characteristic by extracting a target characteristic corresponding to an unobscured target among the target targets;
Based on the first target characteristic, predicting a partial target of the target target, and obtaining the prediction result; includes,
wherein if the target object is a first sample image, the first object feature is a first vision feature; If the target object is a second sample image, the first object feature is a first semantic feature.

4. The method of claim 3,
the target object is a first sample image, and the first object feature is a first vision feature; The step of obtaining the prediction result by predicting the partial object hidden among the target objects based on the first target feature,
obtaining a second vision characteristic by predicting a vision characteristic corresponding to an obscured partial image of the first sample image based on the first vision characteristic;
determining an obscured partial image of the first sample image based on the second vision characteristic;
The method of training a text recognition model comprising: generating the predicted full image based on an image that is not obscured from among the first sample images and the determined partial image of the first sample image.

5. The method of claim 3 or 4,
the target object is a second sample image, and the first object feature is a first semantic feature; The step of obtaining the prediction result by predicting a partial object hidden among the target objects based on the first target characteristic,
obtaining a second semantic feature by predicting a semantic feature corresponding to the partially hidden text in the second sample image based on the first semantic feature;
and generating the predicted text content based on the second semantic feature.

5. The method according to any one of claims 1 to 4,
The step of generating a text recognition model based on the pre-training model comprises:
acquiring a task to be recognized and a training image, wherein the training image includes text;
Training method of a text recognition model comprising a; training on the pre-training model based on the recognition target task and the training image to obtain the text recognition model.

7. The method of claim 6,
The step of obtaining the text recognition model by training on the pre-training model on the basis of the recognition target task and the training image,
obtaining a multimodal feature map corresponding to the training image by inputting the training image as the pre-training model;
Generating the text recognition model based on the recognition target task and the multi-modal feature map; training method of a text recognition model comprising a.

8. The method of claim 7,
The step of generating the text recognition model based on the recognition target task and the multi-modal feature map comprises:
predicting a prediction recognition result according to the recognition target task of the training image based on the multimodal feature map;
The training method of a text recognition model comprising a; based on a preset true recognition result of the training image and the predictive recognition result, building the text recognition model.

In the text recognition method,
acquiring a recognition target image, wherein the recognition target image includes text;
Performing text recognition on the recognition target image based on a previously trained text recognition model to obtain text content in the recognition target image;
Here, the text recognition model is a text recognition method, characterized in that it is obtained based on the training method of the text recognition model according to any one of claims 1 to 4.

10. The method of claim 9,
The step of performing text recognition on the recognition target image based on a pre-trained text recognition model to obtain text content in the recognition target image includes:
determining a multi-modal feature map of the recognition target image based on the text recognition model, and determining text content in the recognition target image based on the multi-modal feature map,
Here, the multi-modal feature map of the recognition target image is a text recognition method for representing vision features and semantic features of the recognition target image.

In the training apparatus of the text recognition model,
a prediction unit, a training unit, and a generating unit;
the prediction unit performs mask prediction on the partial images in the obtained first sample images to obtain a prediction full image corresponding to the first sample image;
The prediction unit is further configured to perform the mask prediction on the partial text in the obtained second sample image to obtain a predicted text content corresponding to the partial text;
the training unit trains on the basis of the predicted full image and the predicted text content to obtain a pre-training model;
and the generating unit generates a text recognition model based on the pre-training model, wherein the text recognition model is configured to perform text recognition on a recognition target image.

The method of claim 11 , wherein the prediction unit comprises:
an obscuring sub-unit randomly covering a partial target among the target targets;
A prediction sub-unit configured to obtain a prediction result by predicting a partial object hidden among the target objects based on the non-occluded object among the target objects; and
wherein, if the target object is a first sample image, a partial object of the target object is a partial image, and the prediction result is the prediction full image; When the target object is a second sample image, the partial object of the target object is a partial text, and the prediction result is the content of the predicted text.

The method of claim 12, wherein the prediction sub-unit,
an extraction module for obtaining a first target characteristic by extracting a target characteristic corresponding to an unobscured target among the target targets;
A prediction module for obtaining the prediction result by predicting a partial object hidden among the target objects based on the first target feature;
wherein if the target object is a first sample image, the first object feature is a first vision feature; If the target object is a second sample image, the first object feature is a first semantic feature.

14. The method of claim 13,
the target object is a first sample image, and the first object feature is a first vision feature; The prediction module is
a first prediction sub-module configured to obtain a second vision feature by predicting a vision feature corresponding to an obscured partial image of the first sample image based on the first vision feature;
a first determining sub-module configured to determine an obscured partial image of the first sample image based on the second vision characteristic;
The training apparatus of a text recognition model comprising a; a first generation sub-module for generating the predicted full image based on an image that is not obscured from among the first sample images and the determined partial image of the first sample image.

15. The method of claim 13 or 14,
the target object is a second sample image, and the first object feature is a first semantic feature; The prediction module is
a second prediction submodule configured to obtain a second semantic feature by predicting a semantic feature corresponding to the partial text hidden in the second sample image based on the first semantic feature;
and a second generation sub-module configured to generate the predicted text content based on the second semantic feature.

The method according to any one of claims 11 to 14, wherein the generating unit comprises:
an acquiring sub-unit for acquiring a task to be recognized and a training image, wherein the training image includes text;
and a training sub-unit configured to acquire the text recognition model by training the pre-training model based on the recognition target task and the training image.

The method of claim 16, wherein the training sub-unit,
an input module for inputting the training image as the pre-training model to obtain a multimodal feature map corresponding to the training image;
A text recognition model training apparatus comprising a; a generating module for generating the text recognition model based on the recognition target task and the multi-modal feature map.

The method of claim 17, wherein the generating module comprises:
a third prediction sub-module for predicting a prediction recognition result according to the recognition target task of the training image based on the multimodal feature map;
Training apparatus for a text recognition model comprising a; a building sub-module for constructing the text recognition model based on a preset true recognition result of the training image and the predictive recognition result.

A text recognition device comprising:
an acquiring unit for acquiring a recognition object image, wherein the recognition object image includes text;
a recognition unit configured to perform text recognition on the recognition object image based on a previously trained text recognition model to obtain text content in the recognition object image; and
Here, the text recognition model is a text recognition apparatus, characterized in that it is obtained based on the training method of the text recognition model according to any one of claims 1 to 4.

The method of claim 19, wherein the recognition unit,
a first determining unit for determining a multimodal feature map of the recognition target image based on the text recognition model;
a second determining unit that determines text content in the recognition target image based on the multimodal feature map;
Here, the multi-modal feature map of the recognition target image is a text recognition apparatus for indicating vision features and semantic features of the recognition target image.

In electronic devices,
at least one processor; and
a memory communicatively coupled to the at least one processor; and
An instruction executable by the at least one processor is stored in the memory, and the instruction is executed by the at least one processor, so that the at least one processor according to any one of claims 1 to 4 to perform a training method of a text recognition model; Alternatively, the electronic device characterized in that the at least one processor performs the text recognition method according to claim 9 .

A non-transitory computer readable storage medium storing computer instructions, the computer instructions causing a computer to perform the method of training a text recognition model according to any one of claims 1 to 4; Alternatively, the computer instruction causes the computer to perform the text recognition method according to claim 9 .

In a computer program stored in a computer-readable storage medium,
implementing the training method of a text recognition model according to any one of claims 1 to 4 when the computer program is executed by a processor; or a computer program for implementing the text recognition method according to claim 9 when said computer instructions are executed by a processor.