CN113723301B

CN113723301B - OCR recognition and branch processing method and device for import goods customs declaration

Info

Publication number: CN113723301B
Application number: CN202111012168.0A
Authority: CN
Inventors: 洪志权; 卢山; 崔伟成; 李双
Original assignee: Guangzhou Xinsilu Information Technology Co ltd
Current assignee: Guangzhou Xinsilu Information Technology Co ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2024-08-30
Anticipated expiration: 2041-08-31
Also published as: CN113723301A

Abstract

The application discloses a method and a device for OCR recognition and line separation of an imported goods customs clearance, which are used for traversing the initial characters of an nth line in a preset coordinate range based on a first coordinate corresponding to a list head of a commodity code, determining whether the line is a new line by judging whether the preset number of characters in the nth line are preset attributes, such as unified numbers, so that the line spacing of the line is determined by utilizing the difference of second ordinate between the initial characters of the two lines, and meanwhile, line separation and line separation are realized, and the technical problems that after OCR recognition processing is carried out on the imported goods customs clearance, the OCR recognition is a line of contents generally caused by the fact that the commodity numbers, the commodity names and the specification types are close together in the list, and the boundary of each line of contents in the commodity names and the specification types cannot be distinguished are solved because the commodity number line is not accurately recognized.

Description

OCR recognition and branch processing method and device for import goods customs declaration

Technical Field

The application relates to the technical field of image recognition, in particular to an OCR recognition branch processing method and device for an import commodity customs declaration form.

Background

In the field of cross-border electronic commerce, a proxy customs clearance enterprise usually acquires customs clearance information provided by clients, wherein the customs clearance information can be in an electronic version or a paper version.

The import customs clearance information provided by the clients necessarily comprises an import goods customs clearance form, and if the import goods customs clearance form is in a paper version, the import goods customs clearance form is converted into an electronic version form through an OCR recognition processing mode.

After the OCR recognition processing is performed on the import goods customs clearance, generally, because the contents of the two columns of the commodity number, the commodity name and the specification model in the form in the import goods customs clearance are close together, OCR is recognized as a column of contents, and because the commodity number column is not accurately recognized, there is a technical problem that the boundary of each row of contents in the commodity name and the specification model column cannot be distinguished.

Disclosure of Invention

The application provides an OCR recognition branch processing method and device for an imported goods customs declaration, which solve the technical problems that after OCR recognition processing is carried out on the imported goods customs declaration, the content of two columns of commodity numbers, commodity names and specification types in a table are close together, the OCR is recognized as a column of content, and the boundary of each row of content in the commodity names and specification types is indistinguishable due to the fact that the commodity number column is not accurately recognized.

In view of this, the first aspect of the present application provides a method for OCR recognition and branching processing of import goods customs notes, the method comprising:

s101, acquiring a first coordinate corresponding to a gauge outfit of a commodity code, wherein the first coordinate comprises a first abscissa and a first ordinate;

s102, determining an initial character of an nth row in a preset coordinate range based on the first coordinate, wherein n is more than or equal to 1;

S103, if the preset number of characters in the nth row are preset attributes, recording a second coordinate of the initial character, wherein the second coordinate comprises a second abscissa and a second ordinate, and the second ordinate is a top ordinate of the initial character;

s104, subtracting the second ordinate of the nth row from the second ordinate of the (n-1) th row to determine an ordinate interval of the (n-1) th row;

s105, n=n+1, and returns to step S102.

Optionally, the step S102 specifically includes:

Traversing downwards along the first ordinate within a preset coordinate range of the first abscissa, and determining the initial character of the traversed nth row, wherein n is more than or equal to 1.

Optionally, the step S103 specifically includes:

Starting with the initial character of the nth row, identifying the preset attribute of the characters with preset quantity;

And if the preset number of characters in the nth row is a preset attribute, recording a second coordinate of the initial character, wherein the second coordinate comprises a second abscissa and a second ordinate, and the second ordinate is a top ordinate of the initial character.

Optionally, the step S103 further includes:

If any character in the preset number of characters in the nth row is not the preset attribute, matching the character with a preset character template, converting the character into a corresponding character of the preset attribute if the matching is successful, and ignoring the nth row if the matching is failed.

Optionally, after the step S102, before the step S103, the method further includes:

if the preset keyword is identified in the nth row, stopping the line dividing processing, otherwise, continuing to execute.

A second aspect of the present application provides an import goods customs clearance OCR recognition branching processing apparatus, the apparatus comprising:

The acquisition unit is used for acquiring a first coordinate corresponding to the gauge outfit of the commodity code, wherein the first coordinate comprises a first abscissa and a first ordinate;

the identification unit is used for determining the initial character of the nth row in a preset coordinate range based on the first coordinate, wherein n is more than or equal to 1;

The attribute judging unit is used for recording a second coordinate of the initial character if the preset number of characters in the nth row are preset attributes, wherein the second coordinate comprises a second abscissa and a second ordinate, and the second ordinate is the top ordinate of the initial character;

a determining unit, configured to subtract the second ordinate of the nth row from the second ordinate of the n-1 th row, and determine an ordinate interval of the n-1 th row;

And the iteration unit is used for carrying out n=n+1 and jumping to the identification unit.

Optionally, the identification unit is specifically configured to:

Optionally, the attribute determining unit is specifically configured to:

Optionally, the attribute determining unit is further configured to:

Optionally, the method further comprises:

And the ending judging unit is used for stopping the line dividing processing if the preset keyword is identified in the nth line, and continuing to execute if the preset keyword is not identified in the nth line.

From the above technical solutions, the embodiment of the present application has the following advantages:

The application provides a method and a device for OCR recognition and line separation of an imported goods customs clearance, which are used for traversing the initial characters of an nth line in a preset coordinate range based on a first coordinate corresponding to a table header of a commodity code, determining whether the line is a new line by judging whether the preset number of characters in the nth line are preset attributes, such as unified numbers, so that the line spacing of the line is determined by utilizing the difference of second ordinate between the initial characters of the two lines, and meanwhile, line separation and line separation are realized, and the technical problems that after OCR recognition processing is carried out on the imported goods customs clearance, the OCR is generally recognized as a line of content due to the fact that the commodity numbers, the commodity names and the contents of the two lines of specifications in the table are close together, and the boundary of each line of the contents in the commodity names and the specification line cannot be distinguished due to the fact that the commodity numbers are not accurately recognized are not recognized are solved.

Drawings

FIG. 1 is a flow chart of a method for OCR recognition and branching processing of import goods customs clearance sheets according to the present application;

Fig. 2 is a schematic structural diagram of an OCR recognition branching processing device for import goods customs clearance.

Detailed Description

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application designs an OCR recognition branch processing method and device for an imported goods customs declaration, which solve the technical problems that after OCR recognition processing is carried out on the imported goods customs declaration, the OCR is recognized as a column of content generally because the content of two columns of commodity numbers, commodity names and specification types in a table are close together, and the boundary of each row of content in the commodity names and specification types is indistinguishable because the commodity number columns are not accurately recognized.

For easy understanding, referring to fig. 1, fig. 1 is a flowchart of a method for OCR recognition and branching processing of an import goods customs clearance according to an embodiment of the present application, as shown in fig. 1, specifically:

It should be noted that, after the character encoded by the commodity is recognized by OCR, the header of the character is a "quotient" character, and a first coordinate corresponding to the "quotient" character is obtained, where the first abscissa may be a left border abscissa of the "quotient" character, or may be a right border abscissa of the "quotient" character, and the first ordinate may be a top ordinate of the "quotient" character, or may be a bottom ordinate of the "quotient" character.

S102, determining a starting character of an nth row in a preset coordinate range based on a first coordinate, wherein n is more than or equal to 1;

It should be noted that, based on the first abscissa and the first ordinate of the "quotient" character, the start character of the nth row is determined within the preset coordinate range, it may be understood that the first start character found by traversing down along the first ordinate corresponds to the first row, the second start character may correspond to the second row, but the second start character may also be the second row of the commodity name and the specification model, but in the actual import bill, a complete commodity name and specification model may include multiple rows of contents, and further determination is required.

It should be noted that, if the pre-set number of characters in the nth row is a preset attribute, the first row of the commodity code corresponding to the nth row may be determined, for example: the first 10 characters are numbers, or the first 2 characters are English and the 3 rd to 10 th characters are numbers, and specific rule setting is formulated according to coding rules of commodity codes. After the start of the commodity code is determined, the actual second coordinates of the start character of the commodity code need to be recorded, wherein the top ordinate of the start character needs to be recorded for better branching.

S104, subtracting the second ordinate of the nth row from the second ordinate of the n-1 th row to determine an ordinate interval of the n-1 th row;

After the second ordinate of the start character of the nth row is obtained, the second ordinate of the start character determined in the previous row, that is, the nth-1 row, is subtracted from the second ordinate of the start character, so as to determine the ordinate section of the nth-1 row. For example: the second ordinate of the 1 st line start character is 10, the second ordinate of the 2 nd line start character is 0, the line spacing of the 1 st line is 10, and the ordinate interval is 0 to 10.

S105, n=n+1, and returns to step S102.

It should be noted that, in an import bill, there may be 2 or more pieces of merchandise information, and then n=n+1 needs to be returned to search for the start character of the nth row again.

Further, step S102 specifically includes:

It should be noted that, based on the first abscissa and the first ordinate of the "quotient" character, the start character of the nth row is determined within the preset coordinate range, for example: based on the first abscissa, the character is searched down along the first ordinate within the abscissa range of ±50.

It will be appreciated that traversing down the first ordinate, the first initial character found corresponds to the first row, the second initial character may correspond to the second row, but the second initial character may also be the second row of the trade name and specification model, but in the actual import bill, a complete trade name and specification model may contain multiple rows of content, requiring further determination.

Further, step S103 specifically includes:

If the preset number of characters in the nth row is the preset attribute, recording a second coordinate of the initial character, wherein the second coordinate comprises a second abscissa and a second ordinate, and the second ordinate is the top ordinate of the initial character.

It should be noted that, if the n-th line starts with the initial character and the pre-set number of characters is the pre-set attribute, the first line of the commodity code corresponding to the line may be determined, for example: the first 10 characters are numbers, or the first 2 characters are English and the 3 rd to 10 th characters are numbers, and specific rule setting is formulated according to coding rules of commodity codes. After the start of the commodity code is determined, the actual second coordinates of the start character of the commodity code need to be recorded, wherein the top ordinate of the start character needs to be recorded for better branching.

Further, step S103 further includes:

If any character in the preset number of characters in the nth row is not the preset attribute, matching the character with a preset character template, converting the character into a corresponding character with the preset attribute if the matching is successful, and ignoring the nth row if the matching is failed.

It should be noted that if in the nth row, starting from the initial character, any character in the previous preset number of characters is not a preset attribute, the character is matched with a preset character template, for example: english O or O is converted into number 0, english l is converted into number 1, english S or S is converted into number 5, english B is converted into number 6, english B is converted into number 8, english q is converted into number 9 and the like, matching of preset character templates can be carried out on attribute rules of characters at specific positions according to actual commodity codes by converting the numbers into English, if matching is successful, the accuracy of OCR recognition is insufficient, the second coordinates of initial characters are still recorded after all the conversions, if matching is unsuccessful, the content of the line is ignored, for example, the preset number of characters of the line 2 is recognized to be unmatched, the line 2 can be the content of commodity names and specification models, and the content of the next line is ignored to be searched continuously.

Further, after step S102, before step S103, the method further includes:

It should be noted that, in the import goods customs declaration form, the configuration of the preset keywords is performed at the end of the form content, and the occurrence of the preset keywords in the n-th row of identification content represents that the form content identification has ended, and the process of dividing the rows is not performed.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an OCR recognition branching processing device for an import goods customs declaration sheet according to an embodiment of the present application, as shown in fig. 2, specifically:

An obtaining unit 201, configured to obtain a first coordinate corresponding to a header of the commodity code, where the first coordinate includes a first abscissa and a first ordinate;

An identification unit 202, configured to determine, based on the first coordinate, a start character of an nth row within a preset coordinate range, where n is greater than or equal to 1;

an attribute determining unit 203, configured to record a second coordinate of the start character if the preset number of characters in the nth row is a preset attribute, where the second coordinate includes a second abscissa and a second ordinate, and the second ordinate is a top ordinate of the start character;

a determining unit 204, configured to subtract the second ordinate of the n-th row from the second ordinate of the n-1 th row, and determine an ordinate interval of the n-1 th row;

An iteration unit 205, configured to skip n=n+1 to the identification unit.

Further, the identification unit is specifically configured to:

Further, the attribute determining unit is specifically configured to:

Further, the attribute judgment unit is further configured to:

Further, the method further comprises the following steps:

According to the method and the device for OCR recognition and line separation of the import goods customs clearance, the initial characters of the nth line in the preset coordinate range are traversed based on the first coordinates corresponding to the table head of the goods code, whether the preset number of characters in the nth line are preset attributes, such as unified numbers, is judged, whether the line is a new line is determined, the line spacing of the line is determined by means of the difference of the second ordinate between the initial characters of the two lines, meanwhile, line separation and line separation are achieved, and the technical problem that after OCR recognition processing is conducted on the import goods customs clearance, the fact that the goods numbers in the import goods customs clearance are close to the contents of the two lines of the goods names and the specification types in the table is generally solved, the OCR recognition is a line of contents is caused, and the fact that the goods numbers in the lines are not accurately recognized, and the boundary of the contents of each line in the specification lines cannot be distinguished is solved.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An imported goods customs clearance sheet OCR recognition branch processing method is characterized by comprising the following steps:

S105, n=n+1, and returns to step S102;

the step S102 specifically includes:

Traversing downwards along the first ordinate within a preset coordinate range of the first abscissa, and determining the initial character of the traversed nth row, wherein n is more than or equal to 1;

The step S103 specifically includes:

If the preset number of characters in the nth row are preset attributes, recording a second coordinate of the initial character, wherein the second coordinate comprises a second abscissa and a second ordinate, and the second ordinate is a top ordinate of the initial character;

the step S103 further includes:

2. The method for processing import goods customs clearance OCR recognition branching according to claim 1, wherein after the step S102, the step S103 further comprises:

3. An import goods customs clearance OCR recognition branch processing device, comprising:

An iteration unit, configured to skip n=n+1 to the identification unit;

the attribute judging unit is specifically configured to:

the attribute judgment unit is further configured to:

4. The import shipment customs clearance OCR recognition branch processing apparatus of claim 3, further comprising: