CN107358208B - A kind of PDF document structured message extracting method and device - Google Patents

A kind of PDF document structured message extracting method and device Download PDF

Info

Publication number
CN107358208B
CN107358208B CN201710576556.9A CN201710576556A CN107358208B CN 107358208 B CN107358208 B CN 107358208B CN 201710576556 A CN201710576556 A CN 201710576556A CN 107358208 B CN107358208 B CN 107358208B
Authority
CN
China
Prior art keywords
title
page
content
grades
titles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710576556.9A
Other languages
Chinese (zh)
Other versions
CN107358208A (en
Inventor
徐龙
李德彦
杨宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co., Ltd
Original Assignee
China Science And Technology (beijing) Co Ltd
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Science And Technology (beijing) Co Ltd, Beijing Shenzhou Taiyue Software Co Ltd filed Critical China Science And Technology (beijing) Co Ltd
Priority to CN201710576556.9A priority Critical patent/CN107358208B/en
Publication of CN107358208A publication Critical patent/CN107358208A/en
Application granted granted Critical
Publication of CN107358208B publication Critical patent/CN107358208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/43Editing text-bitmaps, e.g. alignment, spacing; Semantic analysis of bitmaps of text without OCR

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application discloses a kind of PDF document structured message extracting method, the method includes:Obtain the original page of PDF document;At least one actual page comprising content of text or title is extracted from the original page;Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;Each described title of structured storage and the content of text for being under the jurisdiction of the title.Structured message extracting method in above-mentioned technical proposal title at different levels in PDF document and can be under the jurisdiction of the corresponding content of text of titles at different levels and extract, and structured storage, to obtain structured message, so that the structured message extraction of PDF document can automate realization, avoid manual reprocessing, convenient and efficient.

Description

A kind of PDF document structured message extracting method and device
Technical field
This application involves PDF document information extraction field more particularly to a kind of PDF document structured message extracting methods. In addition, the application further relates to a kind of PDF document structured message extraction element.
Background technology
PDF (Portable Document Format, portable document format), is developed by Adobe Systems The file format gone out carries out exchange files for the mode unrelated with application program, operating system, hardware, belongs to format document. It is relatively independent between the page of PDF, it can verily reproduce each character, color and the image of original copy, but the storage of PDF It is non-structured data memory format, the not no logical construction of recording documents, without logical elements such as paragraph, tables.
Extract the information in PDF document, generally use OCR (Optical Character Recognition, optics word Symbol identification) technology.But it is the rendering carried out in a manner of vector using the information for the PDF document that OCR technique is extracted, It is not no logical relation (such as adjacent, front and back relationship) between each character.The text that the character extracted is formed is only It is the matrix that three coordinates of x, y, z are rendered plus rotation amount.Such text has that arbitrariness is also big for format and position, It also needs to be handled again by hand, can just obtain the structured message with clear hierarchical structure.
Therefore, the information in PDF document is extracted using existing method, in the text extracted, text formatting and position with Meaning, can not advantageously obtain structured message, this is those skilled in the art's urgent problem to be solved.
Invention content
The application provides a kind of PDF document structured message extracting method and a kind of PDF document structured message extraction dress It sets, to solve the problems, such as advantageously obtain PDF document structured message by the prior art.
In a first aspect, this application provides a kind of PDF document structured message extracting method, this method includes:
Obtain the original page of PDF document;
At least one actual page comprising content of text or title is extracted from the original page;
Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;
Each described title of structured storage and the content of text for being under the jurisdiction of the title.
With reference to first aspect, it is extracted at least from the original page in the first possible realization method in first aspect The step of one actual page comprising content of text or title, including:
Whether judge respectively in the original page comprising catalogue page, header and footer;
Catalogue page, header or footer in original page is deleted, at least one actual page is obtained.
With reference to first aspect and above-mentioned possible realization method, in second of possible realization method of first aspect, from The step of extracting titles at different levels in the actual page and being under the jurisdiction of the content of text of the title, including:
Extract the first order title in each actual page;
Current content between first order title and next first order title in actual page is extracted, as with current first The corresponding content of grade title;If the last one first order title in the entitled actual page of the current first order, is extracted in the actual page Content after current first order title, as content corresponding with current first order title;
By each first order title, and with the content corresponding to the first order title, as a level-one logical page (LPAGE);
If level-one title in the absence of in the level-one logical page (LPAGE), each described title of the structured storage and it is subordinate to In the content of text of the title the step of, including:
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein be subordinate to In first order title content of text be content corresponding with the first order title.
With reference to first aspect and above-mentioned possible realization method, in first aspect in the third possible realization method, institute State each first order title, and with the content corresponding to the first order title, the step of as a level-one logical page (LPAGE) before, It is further comprising the steps of:
If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order The corresponding content of title;
If first first order title in currently practical page, will be described currently practical not in the first row of currently practical page Content in page before first first order title is incorporated into the corresponding content of a first order title.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, from The step of extracting titles at different levels in the actual page and being under the jurisdiction of the content of text of the title, it is further comprising the steps of:
(N+1) grade title is extracted from each N grades of logical page (LPAGE) respectively, and is under the jurisdiction of the text of (N+1) grade title Content, N take >=1 integer.
With reference to first aspect and above-mentioned possible realization method, in the 5th kind of possible realization method of first aspect, institute It states and extracts (N+1) grade title from each N grades of logical page (LPAGE) respectively, and be under the jurisdiction of the content of text of (N+1) grade title Step, including:
Extract N+1 grades of titles in each N grades of logical page (LPAGE);
The content between current N+1 grades of titles and next N+1 grades of titles is extracted, is marked as with current N+1 grades Inscribe corresponding content;If the last one N+1 grades of title in N+1 grades of current entitled N grades of logical page (LPAGE)s, extract the N grades and patrol The content after current N+1 grades of titles in page is collected, as content corresponding with current N+1 grades of titles;
By each N+1 grades of title, and content corresponding with the N+1 grades of titles, as a N+1 grades of logical page (LPAGE)s;
Each described title of the structured storage and the step of be under the jurisdiction of the content of text of the title, including:
The 1st to N+1 grades titles of structured storage, and it is under the jurisdiction of the described 1st respectively in the text of N+1 grades of titles Hold, wherein the content of text for being under the jurisdiction of N+1 grades of titles is content corresponding with the N+1 grades of titles, is under the jurisdiction of i-stage mark The content of text of topic is the content in addition to i+1 grades of logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
With reference to first aspect and above-mentioned possible realization method, in the 6th kind of possible realization method of first aspect, institute It states and extracts N+1 grades of titles, and the step of being under the jurisdiction of the content of text of N+1 grades of titles from each N grades of logical page (LPAGE) respectively Including:
It determines in each N grades of logical page (LPAGE) and if the table is cut into table area there are table with the presence or absence of table Block extracts N+1 grades of titles and is under the jurisdiction of the content of text of the N+1 grades of titles.
With reference to first aspect and above-mentioned possible realization method, in the 7th kind of possible realization method of first aspect, institute The step of extracting the first order title in each actual page is stated, including:
Obtain the title line in actual page and title line Y axis coordinate in actual page;
If the difference of the Y axis coordinate of current head line and next title line is less than 3 Y-axis units in the same actual page When, next title line is merged with current head line;
The content of text of a line nearest from title line on title line is obtained as the first order title in actual page.
Second aspect, present invention also provides a kind of PDF document structured message extraction elements, including:
Acquiring unit, the original page for obtaining PDF document;
First extraction unit, for extracting at least one reality comprising content of text or title from the original page Page;
Second extraction unit, for from extracting titles at different levels in the actual page and be under the jurisdiction of in the text of the title Hold;
Storage unit each described title and is under the jurisdiction of the content of text of the title for structured storage.
In conjunction with second aspect, in second aspect in the first possible realization method, first extraction unit, including:
Judging unit, for whether judging respectively in the original page comprising catalogue page, header and footer;
Deleting unit obtains at least one actual page for deleting catalogue page, header or the footer in original page.
Compared with prior art, this method is removed first from the original page of PDF document and may be carried to structured message The part, such as catalogue page, header, footer etc. for generating and interfering are taken, actual page is generated, it is practical to complete to extract from original page The step of page.Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, ties Structureization stores, to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid hand Work processing, convenient and efficient.
Description of the drawings
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other drawings may also be obtained based on these drawings.
Fig. 1 to Fig. 7 is the flow of a specific implementation mode of the PDF document structured message of the application this extracting method Figure;
Fig. 8 to Figure 19 is sub-step in one embodiment of this extracting method of the PDF document structured message of the application Effect diagram;
Figure 20 is the structural schematic diagram of one embodiment of the PDF document structured message of the application this extraction element.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and attached drawing to this hair It is bright to be described in further detail.
Referring to FIG. 1, in a specific embodiment, this extracting method of PDF document structured message includes:
S100 obtains the original page of PDF document.
S200 extracts at least one actual page comprising content of text or title from original page.
S300 extracts titles at different levels and the content of text for being under the jurisdiction of the title from actual page.
Each described title of S400 structured storages and the content of text for being under the jurisdiction of the title.
Structured message refers to that information is decomposed into multiple inter-related component parts, each component part after analysis Between have specific hierarchical structure.In this application, PDF document structured message means the text extracted from PDF document, Title at different levels and the content of text for being under the jurisdiction of title have specific hierarchical structure in text.Structured message can subsequently pass through The file of the multiple formats such as html, word, txt is shown.
Structured storage refers to that the content of the multiple files of needs is saved in by tree structure and level in a file. In this application, each described title of structured storage and be under the jurisdiction of the content of text of the title, refer to by titles at different levels, And be under the jurisdiction of the content of titles at different levels, stored according to tree structure and level, to obtain PDF document structuring Information.
Above-mentioned method, removal may be to the extraction of structured message generation interference first from the original page of PDF document Part, such as catalogue page, header, footer etc., actual page is generated, the step of to complete to extract actual page from original page. Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, structured storage, To obtain structured message so that the structured message extraction of PDF document can automate realization, avoid manual processing, just It is prompt efficient.
The step of above-mentioned S100-S400, is described in detail below.
In the step of S100, the original page of PDF document can be inputted by user and be obtained, can also be from storage medium It obtains.
In the step of S200, referring to FIG. 2, the step of the step of can specifically include S210 and S220.
Whether S210 judges in original page comprising catalogue page, header and footer.
In step S210, include the following steps:
S211 obtains the page number of current original page, the character of current original page and the total line number of character;
S212 matches the page number of current original page and character with the first preset rules, whether determines current original page For catalogue page.
In step S211, the page number of current original page, the character of current original page and the total line number of character can pass through The tools such as PDFBox, iText directly acquire.Wherein, PDFBox is the Java platform class libraries of an operation PDF document, is out Source tool, anyone can be programmed on its basis, for creating PDF document, the already existing document of operation and extraction The text message of document.IText is also a java class libraries for generating PDF document increased income, not only by iText The document of PDF or rtf can be generated, and XML, Html file can be converted to pdf document.
In step S212, the first preset rules can be preset by developer or user.For example, first is pre- If in rule, determining whether that the rule of catalogue page includes:The page number of current original page is first page or second page, and current former The line number shared by heading order number on beginning page is more than the 40% of the total line number of character of current original page, and current original page is catalogue Page;Alternatively, the page number of current original page is first page or second page, and in the character of current original page, occur successively " Chinese, Line number shared by the character string of non-Chinese continuous symbol, serial number " form is more than the 40% of the total line number of character of current original page, when Preceding original page is catalogue page;Alternatively, the page number of current original page is first page or second page, and in the character in current original page Including preset keyword, current original page is catalogue page.
For example, if the page number of current original page is first page or second page, and the title sequence in original page Number, for example " 1.1 ", " 1.1.1 ", " 1, ", " 2, " etc., shared line number are more than the 40% of the total line number of character of current original page, Determine that current original page is catalogue page.Alternatively, if the page number of current original page is first page or second page, it is current original Page character in, such as " from signing for this contract 10 days (hesitating the phase) if in you require surrender, our company only to deduct cost The line number taken ... ... shared by the character string of form in this way such as 1.4 ", " chapter 1 ... ... 15 " is more than current original page The 40% of the total line number of character determines that current original page is catalogue page.Also alternatively, referring to FIG. 8, the if page of current original page Code is first page or second page, and includes " chapter 1 ", " first ", " Co., Ltd ", " catalogue " in current original page Deng these preset keywords when, determine that current original page be catalogue page.
In the first preset rules of step S212, in another example, judge whether the rule comprising header includes in original page: If the first line character is identical in continuous 3-5 pages of original page, determine that original page includes header.Further for example, judge be in original page The no rule comprising footer includes:If last column character is identical in continuous 3-5 pages of original page, determine that original page includes page Foot.
S220 deletes catalogue page, header or the footer in original page, obtains at least one actual page.
Specifically, if in original page including catalogue page, the whole page of catalogue page in original page is deleted;If being wrapped in original page Containing header, then the header in original page is deleted;If including footer in original page, the footer in original page is deleted.To Removal may to the structured text of PDF document extract generate interference original page or original page in partial content, obtain to A few actual page.
Before the step of carrying out S300, row can be formed first to being merged in the character with a line in actual page Text can obtain each reality by tools such as PDFBox in advance as shown in figure 9, being merged to the character of same a line The coordinate information of character on page, including X axis coordinate and Y axis coordinate are identical by Y axis coordinate or gap is within preset range Character merges, and obtains style of writing originally.It is traversed as unit of composing a piece of writing originally, to extract titles at different levels and be under the jurisdiction of the text of the title The step of this content, for example, by traversing the style of writing sheet in actual page, to extract first order title and be under the jurisdiction of the first order mark The content of text of topic;By traversing level-one logical page (LPAGE), to extract second level title in level-one logical page (LPAGE) and be under the jurisdiction of the second level The content of text of title.
The step of S300 and the step of corresponding S400 may include two kinds of situations, and one is be not present in level-one logical page (LPAGE) The case where next stage title, another kind are the case where there is also next stage titles in level-one logical page (LPAGE).
It please refers to Fig.3, Fig. 4, Figure 10 to Figure 14.Fig. 3 is the flow chart of S300-S400 in one embodiment, Fig. 4 the The flow chart of S311 steps in one embodiment.The effect diagram for the step of Figure 10 is S311 in one embodiment;Figure 11 For the effect diagram in one embodiment the step of S312;The effect for the step of Figure 12 is S313 in one embodiment is shown It is intended to;The effect diagram for the step of Figure 13 is S314 in one embodiment;Figure 14 is the step of S410 in one embodiment Rapid effect diagram.In one embodiment, the step of S300, includes:
S311 extracts the first order title in each actual page;
S312 extracts current content between first order title and next first order title in actual page, as with it is current The corresponding content of first order title;If the last one first order title in the entitled actual page of the current first order, extracts the reality Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one The corresponding content of level-one title;If first first order title in currently practical page not in the first row of currently practical page, Content before first first order title in the currently practical page is incorporated into the corresponding content of a upper first order title;
S314 by each first order title, and with the content corresponding to the first order title, as a level-one logical page (LPAGE).
If the step of level-one title in the absence of in the level-one logical page (LPAGE), corresponding S400, including:
Each first order title of S410 structured storages and the content of text for being under the jurisdiction of the first order title, wherein The content of text for being under the jurisdiction of first order title is content corresponding with the first order title.
It, can be according to the size of font, the pattern of font, word content or title in actual page in the S311 the step of Line etc. extracts the first order title in actual page;The size of the font, the pattern of font, word content or title line are all It can be obtained by tools such as PDFBox, iText.
The first order title in actual page is extracted by the font size in actual page, for example, by comparing each style of writing The size of this font, if the largest font of current line text, it is determined that current line text is first order title.Pass through reality The first order title in font style extraction actual page in page, for example, passing through this font style and the default font sample of composing a piece of writing Formula is matched, and determines that current line text is first order title.The font size of above-mentioned style of writing sheet, may be used current line text Font size of the size of middle first character as the style of writing sheet can also use multiple sizes in current line text identical The size of multiple characters, the font size as the style of writing sheet;Style of writing may be used the in this in the font style of above-mentioned style of writing sheet Font style of the pattern of one character as the style of writing sheet can also use multiple patterns in current line text identical multiple The pattern of character, the font style as the style of writing sheet.The first order in actual page is extracted by the word content in actual page Title, for example, being matched with predetermined keyword by word content, if containing " chapter 1 ", " second in word content The predetermined keywords such as chapter ", " first ", " first part ", it is determined that current line text is first order title.
Divided by the PDF document of title line for some first order titles, can also by the title line in actual page come First order title is extracted, referring to FIG. 4, including specifically:
S3111 obtains title line and title line Y axis coordinate in actual page in actual page;
If the difference of the Y axis coordinate of current head line and next title line is less than 3 Y-axis in the same actual pages of S3112 When unit, next title line is merged with current head line;
S3113 obtains the text of a line nearest from title line on title line as the first order title in actual page.
In the S3111 the step of, the Y axis coordinate of title line can be obtained by tools such as PDFBox, iText in actual page It takes.
In the S3113 the step of, a line nearest from title line, can by comparing this Y axis coordinate of style of writing with it is current The distance between the Y axis coordinate of title line obtains the text of the row as in actual page to determine a line nearest from title line First order title.
During extracting level-one logical page (LPAGE) from actual page, due to being carried out page by page according to the original sequence of actual page Extraction, it is possible to will appear a kind of situation:It ought to be used as the corresponding content of the same first order title, but because respectively front and back It is opened in two actual pages.The content of this part in actual page can be merged into upper one by the step of by above-mentioned S313 The corresponding content of a level-one title overcome common PDF to ensure that each level-one logical page (LPAGE) can include complete content The problem of content that paging is split in document information acquisition methods can not be polymerize.
Fig. 5, Fig. 6, Figure 15 are please referred to Figure 18, Fig. 5 is the flow chart of S300-S400 in second embodiment, Fig. 6 the The flow chart of S320 steps in two embodiments.The effect diagram for the step of Figure 15 is S321 in second embodiment;Figure 16 For the effect diagram in second embodiment the step of S322;The effect for the step of Figure 17 is S323 in second embodiment is shown It is intended to;Figure 18 is effect signal the step of being under the jurisdiction of the content of text of i-stage title involved in S420 in second embodiment Figure.In the second embodiment, the step of S300 includes:
S311 extracts the first order title in each actual page;
S312 extracts current content between first order title and next first order title in actual page, as with it is current The corresponding content of first order title;If the last one first order title in the entitled actual page of the current first order, extracts the reality Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one The corresponding content of level-one title;If first first order title in currently practical page not in the first row of currently practical page, Content before first first order title in the currently practical page is incorporated into the corresponding content of a upper first order title;
S314 by each first order title, and with the content corresponding to the first order title, as a level-one logical page (LPAGE);
If there are next stage titles in level-one logical page (LPAGE), further comprising the steps of, S320 is respectively from each N grades of logic (N+1) grade title is extracted in page, and is under the jurisdiction of the content of text of (N+1) grade title, and N takes >=1 integer.The step for can To use recursive process, until not including N+1 grades of titles in N grades of logical page (LPAGE)s.Include specifically:
S321 extracts N+1 grades of titles in each N grades of logical page (LPAGE), and N takes >=1 integer;
S322 extracts the content between current N+1 grades of titles and next N+1 grades of titles, as with current N+1 The corresponding content of grade title;If the last one N+1 grades of title in N+1 grades of current entitled N grades of logical page (LPAGE)s, extract the N Content in grade logical page (LPAGE) after current N+1 grades of titles, as content corresponding with current N+1 grades of titles;
S323 is by each N+1 grades of title, and content corresponding with the N+1 grades of titles, as a N+1 grades of logics Page.
Correspondingly the step of S400, including:
The 1st to N+1 grades titles of S420 structured storages, and be under the jurisdiction of respectively the described 1st to N+1 grades of titles text Content, wherein the content of text for being under the jurisdiction of N+1 grades of titles is content corresponding with the N+1 grades of titles, is under the jurisdiction of i-stage The content of text of title is the content in addition to i+1 grades of logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
It is corresponding with N grades of titles interior herein it should be noted that when in N grades of logical page (LPAGE)s including N+1 grades of titles Hold, contains the content of text and N+1 grades of logical page (LPAGE)s for being under the jurisdiction of N grades of titles.When there is no N+1 grades in N grades of logical page (LPAGE)s When title, content corresponding with N grades of titles is exactly under the jurisdiction of the content of text of N grades of titles.That is, in the application In, content corresponding with N grades of titles, and be under the jurisdiction of the content of text of N grades of titles, include therebetween and by comprising Relationship.
It should be noted that if extracting multiple level-one logical page (LPAGE)s from actual page, wherein in part primary logical page (LPAGE) not There are next stage title, there is also next stage titles in part primary logical page (LPAGE), then in the absence of level-one level-one logic Page, the step of structured storage for the structured storage in one embodiment the step of, for there is also next stage titles The step of level-one logical page (LPAGE), structured storage for structured storage in second embodiment the step of, the PDF texts that finally obtain In mark structure information, the structured storage result in two embodiments is contained.
For including table in some N grades of logical page (LPAGE)s, and there are title PDF document, such as PDF shown in Figure 19 in table Document then please refers to Fig. 7 and Figure 19, and Fig. 7 is the flow chart of S300-S400 in third embodiment, and Figure 19 is implemented for third In example in 320a table cutting schematic diagram.In third embodiment, in aforementioned PDF document structured message extracting method, The step of S320 includes:
If S320a is determined in each N grades of logical page (LPAGE) is cut into table there are table with the presence or absence of table by the table Block extracts N+1 grades of titles and is under the jurisdiction of the content of text of the N+1 grades of titles.
Specifically, it in the step of S320a, " determines and whether there is table in each N grades of logical page (LPAGE), if there are table, by institute State table and be cut into table block " the step of may include:
S320a1 determines in N grades of logical page (LPAGE)s whether include table according to the second preset rules;The second preset rules packet It includes:If including at least two continuous spaces with a line in the corresponding content of N grades of titles, and empty described at least continuous three row The position of lattice is identical, determines that there are tables in current N grades of logical page (LPAGE), and occurs a line at least two continuous spaces with first time As the initial row of table, there is end line of a line at least two continuous spaces as table in last time;
S320a2 is using the position in at least two of table continuous spaces as the longitudinally cutting line of table, with the null in table For transverse cut, table is cut into table block;
S320a3 is and N grades current with from left to right, sequence from top to bottom obtains the table area content in the block successively Content in logical page (LPAGE) in addition to the table together, as with the content corresponding to N grades of titles in current N grades of logical page (LPAGE).
The step of S320a, obtains the content in table, instead of original by the way that the table in N grades of logical page (LPAGE)s is carried out cutting Some tables form new N grade logical page (LPAGE)s to replace to have updated content corresponding with N grades of titles in former N grades of logical page (LPAGE) Change former N grades of logical page (LPAGE).And later the step of, that is, N+1 grades of titles and person in servitude are extracted in the slave N grades of logical page (LPAGE)s of S321-323 In the step of belonging to the content of text of the N+1 grades of titles, N grade logical page (LPAGE)s refer to new N grade logical page (LPAGE)s.
It should be noted that during when handling a PDF document, it is understood that there may be there are tables for part N grades of logical page (LPAGE) The case where lattice, table is not present in part N grades of logical page (LPAGE), at this point, for there is no the N of table grade logical page (LPAGE)s, using second reality The step of applying S320 in example extracts N+1 grades of titles and is under the jurisdiction of the content of text of the N+1 grades of titles, contains for existing The N grade logical page (LPAGE)s of the table of title, using extracting N+1 grades of titles in third embodiment the step of S320a and be under the jurisdiction of The content of text of the N+1 grades of titles.
0 is please referred to Fig.2, in another embodiment, also provides a kind of PDF document structured message extraction dress It sets, including:
Acquiring unit 1, the original page for obtaining PDF document;
First extraction unit 2, for extracting at least one reality comprising content of text or title from the original page Page;
Second extraction unit 3, for from extracting titles at different levels in the actual page and be under the jurisdiction of in the text of the title Hold;
Storage unit 4 each described title and is under the jurisdiction of the content of text of the title for structured storage.
Optionally, the first extraction unit 2, including:
Judging unit 21, for whether judging respectively in the original page comprising catalogue page, header and footer;
Deleting unit 22 obtains at least one actual page for deleting catalogue page, header or the footer in original page.
Above-mentioned PDF document structured message extraction element can automate the structured message of extraction PDF document, keep away Exempt from manual processing, convenient and efficient.It is deleted on the influential mesh of PDF document structured message extraction by the first extraction unit 2 Page, header and footer are recorded, to further ensure the accuracy of structured message extraction.
Optionally, the second extraction unit 3 includes:
First order title extraction unit, for extracting the first order title in each actual page;
First order contents extracting unit, for extract in actual page current first order title and next first order title it Between content, as content corresponding with current first order title;If the last one in the current entitled actual page of the first order the Level-one title extracts the content after current first order title in the actual page, in corresponding with current first order title Hold;
Level-one logical page (LPAGE) generation unit, for by each first order title, and with the content corresponding to the first order title, As a level-one logical page (LPAGE).
Storage unit 4 includes first order storage unit, is used in the absence of in the level-one logical page (LPAGE) when level-one title, Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein be under the jurisdiction of the first order The content of text of title is content corresponding with the first order title.
Optionally, the second extraction unit 3 further includes combining unit, the combining unit respectively with first order contents extraction list Member is connected with level-one logical page (LPAGE) generation unit, if for not having first order title in currently practical page, by the institute of currently practical page There is content to be incorporated into the corresponding content of a first order title;If or for first first order title in currently practical page Not in the first row of currently practical page, the content before first first order title in the currently practical page is incorporated into upper one The corresponding content of a first order title.
During extracting level-one logical page (LPAGE) from actual page, due to being carried out page by page according to the original sequence of actual page Extraction, it is possible to will appear a kind of situation:It ought to be used as the corresponding content of the same first order title, but because respectively front and back It is opened in two actual pages.By above-mentioned combining unit, the content of this part in actual page can be merged into upper one The corresponding content of a level-one title overcome common PDF to ensure that each level-one logical page (LPAGE) can include complete content The problem of content that paging is split in document information acquisition methods can not be polymerize.
Optionally, the second extraction unit 3 further includes N grades of extraction units, for being extracted from each N grades of logical page (LPAGE) respectively N+1 grades of titles, and it is under the jurisdiction of the content of text of N+1 grades of titles, N takes >=1 integer.Only exist when in N grades of logical page (LPAGE)s When N+1 grades of titles, N grades of extraction units are just run, and when N+1 grades of titles are not present in N grades of logical page (LPAGE)s, N grades of extractions are single Member is out of service.
Optionally, N grades of extraction units include:
N+1 grades of title extraction units, for extracting N+1 grades of titles in each N grades of logical page (LPAGE);
N+1 grades of contents extracting units, for extracting between current N+1 grades of titles and next N+1 grades of titles Content, as content corresponding with current N+1 grades of titles;If the last one in N+1 grades of current entitled N grades of logical page (LPAGE)s N+1 grades of titles extract the content after current N+1 grades of titles in the N grades of logical page (LPAGE), as with current N+1 grades of titles Corresponding content;
N+1 grades of logical page (LPAGE) generation units are used for each N+1 grades of title, and corresponding with the N+1 grades of titles interior Hold, as a N+1 grades of logical page (LPAGE)s.
Storage unit 4 further includes N grades of storage units, is used for the 1st to N+1 grades titles of structured storage, and be subordinate to respectively Belong to the described 1st to N+1 grades of titles content of text, wherein be under the jurisdiction of N+1 grades of titles content of text be and the N+ The corresponding content of 1 grade of title, the content of text for being under the jurisdiction of i-stage title are that i+1 grades are removed in content corresponding with the i-stage title Content except logical page (LPAGE), i=1,2 ..., N.N grades of storage units are only when there are ability when next stage title in level-one logical page (LPAGE) Operation, if in the absence of in level-one logical page (LPAGE) when level-one title, the operation of first order storage unit.
It should be noted that if the second extraction unit extracts multiple level-one logical page (LPAGE)s from actual page, wherein part one Level-one title in the absence of in grade logical page (LPAGE), there is also next stage titles in part primary logical page (LPAGE), then next for being not present The level-one logical page (LPAGE) of grade, structured storage use first order storage unit, for there is also the level-one logical page (LPAGE) of next stage title, Structured storage uses N grades of storage units, when handling a PDF document, two storage units may all can use arrive, It can be used only and arrive one of storage unit.
Optionally, the second extraction unit 3 further includes table cutting acquiring unit, for determining in each N grades of logical page (LPAGE) With the presence or absence of table, if there are table, the table is cut into table block, N+1 grades of titles is extracted and is under the jurisdiction of described The content of text of N+1 grades of titles.Include table in N grades of logical page (LPAGE)s, in content corresponding with N grades of titles, and N+1 grades Title in the table when, table cutting acquiring unit, direct cutting table may be used, then extract N+1 grades of titles and be subordinate to In the content of text of the N+1 grades of titles.Table cutting acquiring unit sometimes can be used alone, single instead of N grades of extractions Member, it is sometimes necessary to be used cooperatively with N grades of extraction units.
Optionally, first order title extraction unit may include:
Title line acquiring unit, for obtaining title line and title line Y axis coordinate in actual page in actual page;
Title line combining unit, for the Y axis coordinate when current head line and next title line in the same actual page Difference be less than 3 Y-axis units when, next title line is merged with current head line;
First order title acquiring unit, the content of text conduct for obtaining a line nearest from title line on title line First order title in actual page.
It is required that those skilled in the art can be understood that the technology in the embodiment of the present invention can add by software The mode of general hardware platform realize.Based on this understanding, the technical solution in the embodiment of the present invention substantially or Say that the part that contributes to existing technology can be expressed in the form of software products, which can deposit Storage is in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used so that computer equipment (can be with Be personal computer, server either network equipment etc.) execute certain part institutes of each embodiment of the present invention or embodiment The method stated.
The same or similar parts between the embodiments can be referred to each other in this specification.Invention described above is real The mode of applying is not intended to limit the scope of the present invention..

Claims (10)

1. a kind of PDF document structured message extracting method, which is characterized in that the method includes:
Obtain the original page of PDF document;
At least one actual page comprising content of text or title is extracted from the original page;
Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;
Each described title of structured storage and the content of text for being under the jurisdiction of the title;
Titles at different levels are extracted from the actual page and include the step of being under the jurisdiction of the content of text of the title:
By each N grades of titles, and with the content corresponding to the N grades of titles, as a N grades of logical page (LPAGE)s, N take >=1 it is whole Number;
Each described title of the structured storage and the step of be under the jurisdiction of the content of text of the title, including:
The 1st to N+1 grades titles of structured storage, and be under the jurisdiction of respectively the described 1st to N+1 grades of titles content of text, In, the content of text for being under the jurisdiction of N+1 grades of titles is content corresponding with the N+1 grades of titles, is under the jurisdiction of the text of i-stage title This content is the content in addition to i+1 grades of logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
2. PDF document structured message extracting method according to claim 1, which is characterized in that from the original page The step of extracting at least one actual page comprising content of text or title, including:
Whether judge respectively in the original page comprising catalogue page, header and footer;
Catalogue page, header or footer in original page is deleted, at least one actual page is obtained.
3. PDF document structured message extracting method according to claim 1, which is characterized in that from the actual page The step of extracting titles at different levels and being under the jurisdiction of the content of text of the title further include:
Extract the first order title in each actual page;
Extract current content between first order title and next first order title in actual page, as with current first order mark Inscribe corresponding content;If the last one first order title in the entitled actual page of the current first order, extract current in the actual page Content after first order title, as content corresponding with current first order title;
By each first order title, and with the content corresponding to the first order title, as a level-one logical page (LPAGE);
If level-one title in the absence of in the level-one logical page (LPAGE), each described title of the structured storage and it is under the jurisdiction of institute The step of stating the content of text of title, including:
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein be under the jurisdiction of The content of text of level-one title is content corresponding with the first order title.
4. PDF document structured message extracting method according to claim 3, which is characterized in that described by each first Grade title, and with the content corresponding to the first order title, the step of as a level-one logical page (LPAGE) before, further include following step Suddenly:
If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order title Corresponding content;
If first first order title in currently practical page, will be in the currently practical page not in the first row of currently practical page Content before first first order title is incorporated into the corresponding content of a first order title.
5. PDF document structured message extracting method according to claim 3, which is characterized in that from the actual page The step of extracting titles at different levels and being under the jurisdiction of the content of text of the title, it is further comprising the steps of:
N+1 grades of titles are extracted from each N grades of logical page (LPAGE) respectively, and are under the jurisdiction of the content of text of N+1 grades of titles, N takes >=1 integer.
6. PDF document structured message extracting method according to claim 5, which is characterized in that described respectively from each N+1 grades of titles, and the step of being under the jurisdiction of the content of text of N+1 grades of titles are extracted in a N grades of logical page (LPAGE), including:
Extract N+1 grades of titles in each N grades of logical page (LPAGE);
Extract the content between current N+1 grades of titles and next N+1 grades of titles, as with current N+1 grades of titles pair The content answered;If the last one N+1 grades of title in N+1 grades of current entitled N grades of logical page (LPAGE)s, extract the N grades of logical page (LPAGE) In content after current N+1 grades of titles, as content corresponding with current N+1 grades of titles;
By each N+1 grades of title, and content corresponding with the N+1 grades of titles, as a N+1 grades of logical page (LPAGE)s.
7. PDF document structured message extracting method according to claim 5, which is characterized in that described respectively from each N+1 grades of titles are extracted in a N grades of logical page (LPAGE), and the step of being under the jurisdiction of the content of text of N+1 grades of titles includes:
It determines in each N grades of logical page (LPAGE) and if the table is cut into table block, is carried there are table with the presence or absence of table It takes N+1 grades of titles and is under the jurisdiction of the content of text of the N+1 grades of titles.
8. according to claim 3-7 any one of them PDF document structured message extracting methods, which is characterized in that described to carry The step of taking the first order title in each actual page, including:
Obtain the title line in actual page and title line Y axis coordinate in actual page;
It, will if the difference of the Y axis coordinate of current head line and next title line is less than 3 Y-axis units in the same actual page Next title line merges with current head line;
The content of text of a line nearest from title line on title line is obtained as the first order title in actual page.
9. a kind of PDF document structured message extraction element, which is characterized in that including:
Acquiring unit, the original page for obtaining PDF document;
First extraction unit, for extracting at least one actual page comprising content of text or title from the original page;
Second extraction unit, for extracting titles at different levels from the actual page and being under the jurisdiction of the content of text of the title;
Storage unit each described title and is under the jurisdiction of the content of text of the title for structured storage;
Second extraction unit be specifically used for by each N grades of titles, and with the content corresponding to the N grades of titles, as One N grades of logical page (LPAGE), N take >=1 integer;
The storage unit is specifically used for the 1st to N+1 grades titles of structured storage, and is under the jurisdiction of the described 1st to N+1 respectively The content of text of grade title, wherein the content of text for being under the jurisdiction of N+1 grades of titles is content corresponding with the N+1 grades of titles, The content of text for being under the jurisdiction of i-stage title is the content in content corresponding with the i-stage title in addition to i+1 grades of logical page (LPAGE)s, i =1,2 ..., N.
10. PDF document structured message extraction element according to claim 9, which is characterized in that first extraction is single Member, including:
Judging unit, for whether judging respectively in the original page comprising catalogue page, header and footer;
Deleting unit obtains at least one actual page for deleting catalogue page, header or the footer in original page.
CN201710576556.9A 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device Active CN107358208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710576556.9A CN107358208B (en) 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710576556.9A CN107358208B (en) 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device

Publications (2)

Publication Number Publication Date
CN107358208A CN107358208A (en) 2017-11-17
CN107358208B true CN107358208B (en) 2018-07-13

Family

ID=60292655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710576556.9A Active CN107358208B (en) 2017-07-14 2017-07-14 A kind of PDF document structured message extracting method and device

Country Status (1)

Country Link
CN (1) CN107358208B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943956A (en) * 2017-11-24 2018-04-20 北京金堤科技有限公司 Conversion of page method, apparatus and conversion of page equipment
CN108614898B (en) * 2018-05-10 2021-06-25 爱因互动科技发展(北京)有限公司 Document analysis method and device
CN109492199B (en) * 2018-10-17 2023-04-28 四川译讯信息科技有限公司 PDF file conversion method based on OCR pre-judgment
CN110287785A (en) * 2019-05-20 2019-09-27 深圳壹账通智能科技有限公司 Text structure information extracting method, server and storage medium
CN110363102B (en) * 2019-06-24 2022-05-17 北京融汇金信信息技术有限公司 Object identification processing method and device for PDF (Portable document Format) file
CN110334346B (en) * 2019-06-26 2020-09-29 京东数字科技控股有限公司 Information extraction method and device of PDF (Portable document Format) file
CN110728240A (en) * 2019-10-14 2020-01-24 北京华宇信息技术有限公司 Method and device for automatically identifying title of electronic file
CN111985306B (en) * 2020-07-06 2024-09-27 北京欧应科技有限公司 OCR and information extraction method applied to medical field document
CN111881650B (en) * 2020-07-20 2024-08-20 北京百度网讯科技有限公司 PDF document generation method and device and electronic equipment
CN112712085A (en) * 2020-12-28 2021-04-27 哈尔滨工业大学 Method for extracting date in multi-language PDF document
CN113673294B (en) * 2021-05-11 2024-06-18 苏州超云生命智能产业研究院有限公司 Method, device, computer equipment and storage medium for extracting document key information
CN113298914B (en) * 2021-07-28 2021-10-15 北京明略软件系统有限公司 Knowledge chunk extraction method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN102855244A (en) * 2011-06-28 2013-01-02 北大方正集团有限公司 Method and device for file catalogue processing
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1605369A1 (en) * 2004-06-07 2005-12-14 ArchiveOnline AB Document database
US20050289161A1 (en) * 2004-06-29 2005-12-29 The Boeing Company Integrated document directory generator apparatus and methods
CN1320481C (en) * 2004-11-22 2007-06-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN100552673C (en) * 2007-08-30 2009-10-21 上海交通大学 Open type document isomorphism engines system
CN102541929B (en) * 2010-12-22 2014-04-02 北大方正集团有限公司 Method and device for extracting format file catalogue
CN102541948A (en) * 2010-12-23 2012-07-04 北大方正集团有限公司 Method and device for extracting document structure
CN104699714B (en) * 2013-12-09 2017-10-20 北大方正集团有限公司 Book version formatted file is converted to the method and device of EPUB formatted files
CN106446072B (en) * 2016-09-07 2019-10-18 百度在线网络技术(北京)有限公司 The treating method and apparatus of web page contents
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN102855244A (en) * 2011-06-28 2013-01-02 北大方正集团有限公司 Method and device for file catalogue processing
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information

Also Published As

Publication number Publication date
CN107358208A (en) 2017-11-17

Similar Documents

Publication Publication Date Title
CN107358208B (en) A kind of PDF document structured message extracting method and device
CN100447779C (en) Document information processing apparatus, document information processing method, and document information processing program
US10049100B2 (en) Financial event and relationship extraction
JP3887867B2 (en) How to register structured documents
CN109933796B (en) Method and device for extracting key information of bulletin text
US20090222395A1 (en) Systems, methods, and software for entity extraction and resolution coupled with event and relationship extraction
CN106446072B (en) The treating method and apparatus of web page contents
CN112380825B (en) PDF document cross-page table merging method and device, electronic equipment and storage medium
JPH08241332A (en) Device and method for retrieving all-sentence registered word
CN103329122A (en) Storage of a document using multiple representations
JP2000285140A (en) Device and method for processing document, device and method for classifying document, and computer readable recording medium recorded with program for allowing computer to execute these methods
CN111680634A (en) Document file processing method and device, computer equipment and storage medium
Rizvi et al. A hybrid approach and unified framework for bibliographic reference extraction
JP2008234670A (en) Document classification device, document classification method, and computer-readable recording medium for recording programs for executing these methods on computer
CN112418875B (en) Cross-platform tax intelligent customer service corpus migration method and device
CN113722472A (en) Technical literature information extraction method, system and storage medium
CN112990091A (en) Research and report analysis method, device, equipment and storage medium based on target detection
CN115995087B (en) Document catalog intelligent generation method and system based on fusion visual information
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
CN109740097A (en) A kind of Web page text extracting method of logic-based chained block
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
JP7549977B2 (en) Text mining device and text mining method
CN113343140B (en) Method for automatically extracting webpage text content based on neo4j graphic database
JP2000250908A (en) Support device for production of electronic book
EP1072986A2 (en) System and method for extracting data from semi-structured text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190905

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Patentee after: China Science and Technology (Beijing) Co., Ltd.

Address before: Room 601, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Co-patentee before: China Science and Technology (Beijing) Co., Ltd.

Patentee before: Beijing Shenzhou Taiyue Software Co., Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Patentee after: Dingfu Intelligent Technology Co., Ltd

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Patentee before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.