Invention content
The application provides a kind of PDF document structured message extracting method and a kind of PDF document structured message extraction dress
It sets, to solve the problems, such as advantageously obtain PDF document structured message by the prior art.
In a first aspect, this application provides a kind of PDF document structured message extracting method, this method includes:
Obtain the original page of PDF document;
At least one actual page comprising content of text or title is extracted from the original page;
Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;
Each described title of structured storage and the content of text for being under the jurisdiction of the title.
With reference to first aspect, it is extracted at least from the original page in the first possible realization method in first aspect
The step of one actual page comprising content of text or title, including:
Whether judge respectively in the original page comprising catalogue page, header and footer;
Catalogue page, header or footer in original page is deleted, at least one actual page is obtained.
With reference to first aspect and above-mentioned possible realization method, in second of possible realization method of first aspect, from
The step of extracting titles at different levels in the actual page and being under the jurisdiction of the content of text of the title, including:
Extract the first order title in each actual page;
Current content between first order title and next first order title in actual page is extracted, as with current first
The corresponding content of grade title;If the last one first order title in the entitled actual page of the current first order, is extracted in the actual page
Content after current first order title, as content corresponding with current first order title;
By each first order title, and with the content corresponding to the first order title, as a level-one logical page (LPAGE);
If level-one title in the absence of in the level-one logical page (LPAGE), each described title of the structured storage and it is subordinate to
In the content of text of the title the step of, including:
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein be subordinate to
In first order title content of text be content corresponding with the first order title.
With reference to first aspect and above-mentioned possible realization method, in first aspect in the third possible realization method, institute
State each first order title, and with the content corresponding to the first order title, the step of as a level-one logical page (LPAGE) before,
It is further comprising the steps of:
If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order
The corresponding content of title;
If first first order title in currently practical page, will be described currently practical not in the first row of currently practical page
Content in page before first first order title is incorporated into the corresponding content of a first order title.
With reference to first aspect and above-mentioned possible realization method, in the 4th kind of possible realization method of first aspect, from
The step of extracting titles at different levels in the actual page and being under the jurisdiction of the content of text of the title, it is further comprising the steps of:
(N+1) grade title is extracted from each N grades of logical page (LPAGE) respectively, and is under the jurisdiction of the text of (N+1) grade title
Content, N take >=1 integer.
With reference to first aspect and above-mentioned possible realization method, in the 5th kind of possible realization method of first aspect, institute
It states and extracts (N+1) grade title from each N grades of logical page (LPAGE) respectively, and be under the jurisdiction of the content of text of (N+1) grade title
Step, including:
Extract N+1 grades of titles in each N grades of logical page (LPAGE);
The content between current N+1 grades of titles and next N+1 grades of titles is extracted, is marked as with current N+1 grades
Inscribe corresponding content;If the last one N+1 grades of title in N+1 grades of current entitled N grades of logical page (LPAGE)s, extract the N grades and patrol
The content after current N+1 grades of titles in page is collected, as content corresponding with current N+1 grades of titles;
By each N+1 grades of title, and content corresponding with the N+1 grades of titles, as a N+1 grades of logical page (LPAGE)s;
Each described title of the structured storage and the step of be under the jurisdiction of the content of text of the title, including:
The 1st to N+1 grades titles of structured storage, and it is under the jurisdiction of the described 1st respectively in the text of N+1 grades of titles
Hold, wherein the content of text for being under the jurisdiction of N+1 grades of titles is content corresponding with the N+1 grades of titles, is under the jurisdiction of i-stage mark
The content of text of topic is the content in addition to i+1 grades of logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
With reference to first aspect and above-mentioned possible realization method, in the 6th kind of possible realization method of first aspect, institute
It states and extracts N+1 grades of titles, and the step of being under the jurisdiction of the content of text of N+1 grades of titles from each N grades of logical page (LPAGE) respectively
Including:
It determines in each N grades of logical page (LPAGE) and if the table is cut into table area there are table with the presence or absence of table
Block extracts N+1 grades of titles and is under the jurisdiction of the content of text of the N+1 grades of titles.
With reference to first aspect and above-mentioned possible realization method, in the 7th kind of possible realization method of first aspect, institute
The step of extracting the first order title in each actual page is stated, including:
Obtain the title line in actual page and title line Y axis coordinate in actual page;
If the difference of the Y axis coordinate of current head line and next title line is less than 3 Y-axis units in the same actual page
When, next title line is merged with current head line;
The content of text of a line nearest from title line on title line is obtained as the first order title in actual page.
Second aspect, present invention also provides a kind of PDF document structured message extraction elements, including:
Acquiring unit, the original page for obtaining PDF document;
First extraction unit, for extracting at least one reality comprising content of text or title from the original page
Page;
Second extraction unit, for from extracting titles at different levels in the actual page and be under the jurisdiction of in the text of the title
Hold;
Storage unit each described title and is under the jurisdiction of the content of text of the title for structured storage.
In conjunction with second aspect, in second aspect in the first possible realization method, first extraction unit, including:
Judging unit, for whether judging respectively in the original page comprising catalogue page, header and footer;
Deleting unit obtains at least one actual page for deleting catalogue page, header or the footer in original page.
Compared with prior art, this method is removed first from the original page of PDF document and may be carried to structured message
The part, such as catalogue page, header, footer etc. for generating and interfering are taken, actual page is generated, it is practical to complete to extract from original page
The step of page.Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, ties
Structureization stores, to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid hand
Work processing, convenient and efficient.
Specific implementation mode
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and attached drawing to this hair
It is bright to be described in further detail.
Referring to FIG. 1, in a specific embodiment, this extracting method of PDF document structured message includes:
S100 obtains the original page of PDF document.
S200 extracts at least one actual page comprising content of text or title from original page.
S300 extracts titles at different levels and the content of text for being under the jurisdiction of the title from actual page.
Each described title of S400 structured storages and the content of text for being under the jurisdiction of the title.
Structured message refers to that information is decomposed into multiple inter-related component parts, each component part after analysis
Between have specific hierarchical structure.In this application, PDF document structured message means the text extracted from PDF document,
Title at different levels and the content of text for being under the jurisdiction of title have specific hierarchical structure in text.Structured message can subsequently pass through
The file of the multiple formats such as html, word, txt is shown.
Structured storage refers to that the content of the multiple files of needs is saved in by tree structure and level in a file.
In this application, each described title of structured storage and be under the jurisdiction of the content of text of the title, refer to by titles at different levels,
And be under the jurisdiction of the content of titles at different levels, stored according to tree structure and level, to obtain PDF document structuring
Information.
Above-mentioned method, removal may be to the extraction of structured message generation interference first from the original page of PDF document
Part, such as catalogue page, header, footer etc., actual page is generated, the step of to complete to extract actual page from original page.
Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, structured storage,
To obtain structured message so that the structured message extraction of PDF document can automate realization, avoid manual processing, just
It is prompt efficient.
The step of above-mentioned S100-S400, is described in detail below.
In the step of S100, the original page of PDF document can be inputted by user and be obtained, can also be from storage medium
It obtains.
In the step of S200, referring to FIG. 2, the step of the step of can specifically include S210 and S220.
Whether S210 judges in original page comprising catalogue page, header and footer.
In step S210, include the following steps:
S211 obtains the page number of current original page, the character of current original page and the total line number of character;
S212 matches the page number of current original page and character with the first preset rules, whether determines current original page
For catalogue page.
In step S211, the page number of current original page, the character of current original page and the total line number of character can pass through
The tools such as PDFBox, iText directly acquire.Wherein, PDFBox is the Java platform class libraries of an operation PDF document, is out
Source tool, anyone can be programmed on its basis, for creating PDF document, the already existing document of operation and extraction
The text message of document.IText is also a java class libraries for generating PDF document increased income, not only by iText
The document of PDF or rtf can be generated, and XML, Html file can be converted to pdf document.
In step S212, the first preset rules can be preset by developer or user.For example, first is pre-
If in rule, determining whether that the rule of catalogue page includes:The page number of current original page is first page or second page, and current former
The line number shared by heading order number on beginning page is more than the 40% of the total line number of character of current original page, and current original page is catalogue
Page;Alternatively, the page number of current original page is first page or second page, and in the character of current original page, occur successively " Chinese,
Line number shared by the character string of non-Chinese continuous symbol, serial number " form is more than the 40% of the total line number of character of current original page, when
Preceding original page is catalogue page;Alternatively, the page number of current original page is first page or second page, and in the character in current original page
Including preset keyword, current original page is catalogue page.
For example, if the page number of current original page is first page or second page, and the title sequence in original page
Number, for example " 1.1 ", " 1.1.1 ", " 1, ", " 2, " etc., shared line number are more than the 40% of the total line number of character of current original page,
Determine that current original page is catalogue page.Alternatively, if the page number of current original page is first page or second page, it is current original
Page character in, such as " from signing for this contract 10 days (hesitating the phase) if in you require surrender, our company only to deduct cost
The line number taken ... ... shared by the character string of form in this way such as 1.4 ", " chapter 1 ... ... 15 " is more than current original page
The 40% of the total line number of character determines that current original page is catalogue page.Also alternatively, referring to FIG. 8, the if page of current original page
Code is first page or second page, and includes " chapter 1 ", " first ", " Co., Ltd ", " catalogue " in current original page
Deng these preset keywords when, determine that current original page be catalogue page.
In the first preset rules of step S212, in another example, judge whether the rule comprising header includes in original page:
If the first line character is identical in continuous 3-5 pages of original page, determine that original page includes header.Further for example, judge be in original page
The no rule comprising footer includes:If last column character is identical in continuous 3-5 pages of original page, determine that original page includes page
Foot.
S220 deletes catalogue page, header or the footer in original page, obtains at least one actual page.
Specifically, if in original page including catalogue page, the whole page of catalogue page in original page is deleted;If being wrapped in original page
Containing header, then the header in original page is deleted;If including footer in original page, the footer in original page is deleted.To
Removal may to the structured text of PDF document extract generate interference original page or original page in partial content, obtain to
A few actual page.
Before the step of carrying out S300, row can be formed first to being merged in the character with a line in actual page
Text can obtain each reality by tools such as PDFBox in advance as shown in figure 9, being merged to the character of same a line
The coordinate information of character on page, including X axis coordinate and Y axis coordinate are identical by Y axis coordinate or gap is within preset range
Character merges, and obtains style of writing originally.It is traversed as unit of composing a piece of writing originally, to extract titles at different levels and be under the jurisdiction of the text of the title
The step of this content, for example, by traversing the style of writing sheet in actual page, to extract first order title and be under the jurisdiction of the first order mark
The content of text of topic;By traversing level-one logical page (LPAGE), to extract second level title in level-one logical page (LPAGE) and be under the jurisdiction of the second level
The content of text of title.
The step of S300 and the step of corresponding S400 may include two kinds of situations, and one is be not present in level-one logical page (LPAGE)
The case where next stage title, another kind are the case where there is also next stage titles in level-one logical page (LPAGE).
It please refers to Fig.3, Fig. 4, Figure 10 to Figure 14.Fig. 3 is the flow chart of S300-S400 in one embodiment, Fig. 4 the
The flow chart of S311 steps in one embodiment.The effect diagram for the step of Figure 10 is S311 in one embodiment;Figure 11
For the effect diagram in one embodiment the step of S312;The effect for the step of Figure 12 is S313 in one embodiment is shown
It is intended to;The effect diagram for the step of Figure 13 is S314 in one embodiment;Figure 14 is the step of S410 in one embodiment
Rapid effect diagram.In one embodiment, the step of S300, includes:
S311 extracts the first order title in each actual page;
S312 extracts current content between first order title and next first order title in actual page, as with it is current
The corresponding content of first order title;If the last one first order title in the entitled actual page of the current first order, extracts the reality
Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one
The corresponding content of level-one title;If first first order title in currently practical page not in the first row of currently practical page,
Content before first first order title in the currently practical page is incorporated into the corresponding content of a upper first order title;
S314 by each first order title, and with the content corresponding to the first order title, as a level-one logical page (LPAGE).
If the step of level-one title in the absence of in the level-one logical page (LPAGE), corresponding S400, including:
Each first order title of S410 structured storages and the content of text for being under the jurisdiction of the first order title, wherein
The content of text for being under the jurisdiction of first order title is content corresponding with the first order title.
It, can be according to the size of font, the pattern of font, word content or title in actual page in the S311 the step of
Line etc. extracts the first order title in actual page;The size of the font, the pattern of font, word content or title line are all
It can be obtained by tools such as PDFBox, iText.
The first order title in actual page is extracted by the font size in actual page, for example, by comparing each style of writing
The size of this font, if the largest font of current line text, it is determined that current line text is first order title.Pass through reality
The first order title in font style extraction actual page in page, for example, passing through this font style and the default font sample of composing a piece of writing
Formula is matched, and determines that current line text is first order title.The font size of above-mentioned style of writing sheet, may be used current line text
Font size of the size of middle first character as the style of writing sheet can also use multiple sizes in current line text identical
The size of multiple characters, the font size as the style of writing sheet;Style of writing may be used the in this in the font style of above-mentioned style of writing sheet
Font style of the pattern of one character as the style of writing sheet can also use multiple patterns in current line text identical multiple
The pattern of character, the font style as the style of writing sheet.The first order in actual page is extracted by the word content in actual page
Title, for example, being matched with predetermined keyword by word content, if containing " chapter 1 ", " second in word content
The predetermined keywords such as chapter ", " first ", " first part ", it is determined that current line text is first order title.
Divided by the PDF document of title line for some first order titles, can also by the title line in actual page come
First order title is extracted, referring to FIG. 4, including specifically:
S3111 obtains title line and title line Y axis coordinate in actual page in actual page;
If the difference of the Y axis coordinate of current head line and next title line is less than 3 Y-axis in the same actual pages of S3112
When unit, next title line is merged with current head line;
S3113 obtains the text of a line nearest from title line on title line as the first order title in actual page.
In the S3111 the step of, the Y axis coordinate of title line can be obtained by tools such as PDFBox, iText in actual page
It takes.
In the S3113 the step of, a line nearest from title line, can by comparing this Y axis coordinate of style of writing with it is current
The distance between the Y axis coordinate of title line obtains the text of the row as in actual page to determine a line nearest from title line
First order title.
During extracting level-one logical page (LPAGE) from actual page, due to being carried out page by page according to the original sequence of actual page
Extraction, it is possible to will appear a kind of situation:It ought to be used as the corresponding content of the same first order title, but because respectively front and back
It is opened in two actual pages.The content of this part in actual page can be merged into upper one by the step of by above-mentioned S313
The corresponding content of a level-one title overcome common PDF to ensure that each level-one logical page (LPAGE) can include complete content
The problem of content that paging is split in document information acquisition methods can not be polymerize.
Fig. 5, Fig. 6, Figure 15 are please referred to Figure 18, Fig. 5 is the flow chart of S300-S400 in second embodiment, Fig. 6 the
The flow chart of S320 steps in two embodiments.The effect diagram for the step of Figure 15 is S321 in second embodiment;Figure 16
For the effect diagram in second embodiment the step of S322;The effect for the step of Figure 17 is S323 in second embodiment is shown
It is intended to;Figure 18 is effect signal the step of being under the jurisdiction of the content of text of i-stage title involved in S420 in second embodiment
Figure.In the second embodiment, the step of S300 includes:
S311 extracts the first order title in each actual page;
S312 extracts current content between first order title and next first order title in actual page, as with it is current
The corresponding content of first order title;If the last one first order title in the entitled actual page of the current first order, extracts the reality
Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one
The corresponding content of level-one title;If first first order title in currently practical page not in the first row of currently practical page,
Content before first first order title in the currently practical page is incorporated into the corresponding content of a upper first order title;
S314 by each first order title, and with the content corresponding to the first order title, as a level-one logical page (LPAGE);
If there are next stage titles in level-one logical page (LPAGE), further comprising the steps of, S320 is respectively from each N grades of logic
(N+1) grade title is extracted in page, and is under the jurisdiction of the content of text of (N+1) grade title, and N takes >=1 integer.The step for can
To use recursive process, until not including N+1 grades of titles in N grades of logical page (LPAGE)s.Include specifically:
S321 extracts N+1 grades of titles in each N grades of logical page (LPAGE), and N takes >=1 integer;
S322 extracts the content between current N+1 grades of titles and next N+1 grades of titles, as with current N+1
The corresponding content of grade title;If the last one N+1 grades of title in N+1 grades of current entitled N grades of logical page (LPAGE)s, extract the N
Content in grade logical page (LPAGE) after current N+1 grades of titles, as content corresponding with current N+1 grades of titles;
S323 is by each N+1 grades of title, and content corresponding with the N+1 grades of titles, as a N+1 grades of logics
Page.
Correspondingly the step of S400, including:
The 1st to N+1 grades titles of S420 structured storages, and be under the jurisdiction of respectively the described 1st to N+1 grades of titles text
Content, wherein the content of text for being under the jurisdiction of N+1 grades of titles is content corresponding with the N+1 grades of titles, is under the jurisdiction of i-stage
The content of text of title is the content in addition to i+1 grades of logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
It is corresponding with N grades of titles interior herein it should be noted that when in N grades of logical page (LPAGE)s including N+1 grades of titles
Hold, contains the content of text and N+1 grades of logical page (LPAGE)s for being under the jurisdiction of N grades of titles.When there is no N+1 grades in N grades of logical page (LPAGE)s
When title, content corresponding with N grades of titles is exactly under the jurisdiction of the content of text of N grades of titles.That is, in the application
In, content corresponding with N grades of titles, and be under the jurisdiction of the content of text of N grades of titles, include therebetween and by comprising
Relationship.
It should be noted that if extracting multiple level-one logical page (LPAGE)s from actual page, wherein in part primary logical page (LPAGE) not
There are next stage title, there is also next stage titles in part primary logical page (LPAGE), then in the absence of level-one level-one logic
Page, the step of structured storage for the structured storage in one embodiment the step of, for there is also next stage titles
The step of level-one logical page (LPAGE), structured storage for structured storage in second embodiment the step of, the PDF texts that finally obtain
In mark structure information, the structured storage result in two embodiments is contained.
For including table in some N grades of logical page (LPAGE)s, and there are title PDF document, such as PDF shown in Figure 19 in table
Document then please refers to Fig. 7 and Figure 19, and Fig. 7 is the flow chart of S300-S400 in third embodiment, and Figure 19 is implemented for third
In example in 320a table cutting schematic diagram.In third embodiment, in aforementioned PDF document structured message extracting method,
The step of S320 includes:
If S320a is determined in each N grades of logical page (LPAGE) is cut into table there are table with the presence or absence of table by the table
Block extracts N+1 grades of titles and is under the jurisdiction of the content of text of the N+1 grades of titles.
Specifically, it in the step of S320a, " determines and whether there is table in each N grades of logical page (LPAGE), if there are table, by institute
State table and be cut into table block " the step of may include:
S320a1 determines in N grades of logical page (LPAGE)s whether include table according to the second preset rules;The second preset rules packet
It includes:If including at least two continuous spaces with a line in the corresponding content of N grades of titles, and empty described at least continuous three row
The position of lattice is identical, determines that there are tables in current N grades of logical page (LPAGE), and occurs a line at least two continuous spaces with first time
As the initial row of table, there is end line of a line at least two continuous spaces as table in last time;
S320a2 is using the position in at least two of table continuous spaces as the longitudinally cutting line of table, with the null in table
For transverse cut, table is cut into table block;
S320a3 is and N grades current with from left to right, sequence from top to bottom obtains the table area content in the block successively
Content in logical page (LPAGE) in addition to the table together, as with the content corresponding to N grades of titles in current N grades of logical page (LPAGE).
The step of S320a, obtains the content in table, instead of original by the way that the table in N grades of logical page (LPAGE)s is carried out cutting
Some tables form new N grade logical page (LPAGE)s to replace to have updated content corresponding with N grades of titles in former N grades of logical page (LPAGE)
Change former N grades of logical page (LPAGE).And later the step of, that is, N+1 grades of titles and person in servitude are extracted in the slave N grades of logical page (LPAGE)s of S321-323
In the step of belonging to the content of text of the N+1 grades of titles, N grade logical page (LPAGE)s refer to new N grade logical page (LPAGE)s.
It should be noted that during when handling a PDF document, it is understood that there may be there are tables for part N grades of logical page (LPAGE)
The case where lattice, table is not present in part N grades of logical page (LPAGE), at this point, for there is no the N of table grade logical page (LPAGE)s, using second reality
The step of applying S320 in example extracts N+1 grades of titles and is under the jurisdiction of the content of text of the N+1 grades of titles, contains for existing
The N grade logical page (LPAGE)s of the table of title, using extracting N+1 grades of titles in third embodiment the step of S320a and be under the jurisdiction of
The content of text of the N+1 grades of titles.
0 is please referred to Fig.2, in another embodiment, also provides a kind of PDF document structured message extraction dress
It sets, including:
Acquiring unit 1, the original page for obtaining PDF document;
First extraction unit 2, for extracting at least one reality comprising content of text or title from the original page
Page;
Second extraction unit 3, for from extracting titles at different levels in the actual page and be under the jurisdiction of in the text of the title
Hold;
Storage unit 4 each described title and is under the jurisdiction of the content of text of the title for structured storage.
Optionally, the first extraction unit 2, including:
Judging unit 21, for whether judging respectively in the original page comprising catalogue page, header and footer;
Deleting unit 22 obtains at least one actual page for deleting catalogue page, header or the footer in original page.
Above-mentioned PDF document structured message extraction element can automate the structured message of extraction PDF document, keep away
Exempt from manual processing, convenient and efficient.It is deleted on the influential mesh of PDF document structured message extraction by the first extraction unit 2
Page, header and footer are recorded, to further ensure the accuracy of structured message extraction.
Optionally, the second extraction unit 3 includes:
First order title extraction unit, for extracting the first order title in each actual page;
First order contents extracting unit, for extract in actual page current first order title and next first order title it
Between content, as content corresponding with current first order title;If the last one in the current entitled actual page of the first order the
Level-one title extracts the content after current first order title in the actual page, in corresponding with current first order title
Hold;
Level-one logical page (LPAGE) generation unit, for by each first order title, and with the content corresponding to the first order title,
As a level-one logical page (LPAGE).
Storage unit 4 includes first order storage unit, is used in the absence of in the level-one logical page (LPAGE) when level-one title,
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein be under the jurisdiction of the first order
The content of text of title is content corresponding with the first order title.
Optionally, the second extraction unit 3 further includes combining unit, the combining unit respectively with first order contents extraction list
Member is connected with level-one logical page (LPAGE) generation unit, if for not having first order title in currently practical page, by the institute of currently practical page
There is content to be incorporated into the corresponding content of a first order title;If or for first first order title in currently practical page
Not in the first row of currently practical page, the content before first first order title in the currently practical page is incorporated into upper one
The corresponding content of a first order title.
During extracting level-one logical page (LPAGE) from actual page, due to being carried out page by page according to the original sequence of actual page
Extraction, it is possible to will appear a kind of situation:It ought to be used as the corresponding content of the same first order title, but because respectively front and back
It is opened in two actual pages.By above-mentioned combining unit, the content of this part in actual page can be merged into upper one
The corresponding content of a level-one title overcome common PDF to ensure that each level-one logical page (LPAGE) can include complete content
The problem of content that paging is split in document information acquisition methods can not be polymerize.
Optionally, the second extraction unit 3 further includes N grades of extraction units, for being extracted from each N grades of logical page (LPAGE) respectively
N+1 grades of titles, and it is under the jurisdiction of the content of text of N+1 grades of titles, N takes >=1 integer.Only exist when in N grades of logical page (LPAGE)s
When N+1 grades of titles, N grades of extraction units are just run, and when N+1 grades of titles are not present in N grades of logical page (LPAGE)s, N grades of extractions are single
Member is out of service.
Optionally, N grades of extraction units include:
N+1 grades of title extraction units, for extracting N+1 grades of titles in each N grades of logical page (LPAGE);
N+1 grades of contents extracting units, for extracting between current N+1 grades of titles and next N+1 grades of titles
Content, as content corresponding with current N+1 grades of titles;If the last one in N+1 grades of current entitled N grades of logical page (LPAGE)s
N+1 grades of titles extract the content after current N+1 grades of titles in the N grades of logical page (LPAGE), as with current N+1 grades of titles
Corresponding content;
N+1 grades of logical page (LPAGE) generation units are used for each N+1 grades of title, and corresponding with the N+1 grades of titles interior
Hold, as a N+1 grades of logical page (LPAGE)s.
Storage unit 4 further includes N grades of storage units, is used for the 1st to N+1 grades titles of structured storage, and be subordinate to respectively
Belong to the described 1st to N+1 grades of titles content of text, wherein be under the jurisdiction of N+1 grades of titles content of text be and the N+
The corresponding content of 1 grade of title, the content of text for being under the jurisdiction of i-stage title are that i+1 grades are removed in content corresponding with the i-stage title
Content except logical page (LPAGE), i=1,2 ..., N.N grades of storage units are only when there are ability when next stage title in level-one logical page (LPAGE)
Operation, if in the absence of in level-one logical page (LPAGE) when level-one title, the operation of first order storage unit.
It should be noted that if the second extraction unit extracts multiple level-one logical page (LPAGE)s from actual page, wherein part one
Level-one title in the absence of in grade logical page (LPAGE), there is also next stage titles in part primary logical page (LPAGE), then next for being not present
The level-one logical page (LPAGE) of grade, structured storage use first order storage unit, for there is also the level-one logical page (LPAGE) of next stage title,
Structured storage uses N grades of storage units, when handling a PDF document, two storage units may all can use arrive,
It can be used only and arrive one of storage unit.
Optionally, the second extraction unit 3 further includes table cutting acquiring unit, for determining in each N grades of logical page (LPAGE)
With the presence or absence of table, if there are table, the table is cut into table block, N+1 grades of titles is extracted and is under the jurisdiction of described
The content of text of N+1 grades of titles.Include table in N grades of logical page (LPAGE)s, in content corresponding with N grades of titles, and N+1 grades
Title in the table when, table cutting acquiring unit, direct cutting table may be used, then extract N+1 grades of titles and be subordinate to
In the content of text of the N+1 grades of titles.Table cutting acquiring unit sometimes can be used alone, single instead of N grades of extractions
Member, it is sometimes necessary to be used cooperatively with N grades of extraction units.
Optionally, first order title extraction unit may include:
Title line acquiring unit, for obtaining title line and title line Y axis coordinate in actual page in actual page;
Title line combining unit, for the Y axis coordinate when current head line and next title line in the same actual page
Difference be less than 3 Y-axis units when, next title line is merged with current head line;
First order title acquiring unit, the content of text conduct for obtaining a line nearest from title line on title line
First order title in actual page.
It is required that those skilled in the art can be understood that the technology in the embodiment of the present invention can add by software
The mode of general hardware platform realize.Based on this understanding, the technical solution in the embodiment of the present invention substantially or
Say that the part that contributes to existing technology can be expressed in the form of software products, which can deposit
Storage is in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used so that computer equipment (can be with
Be personal computer, server either network equipment etc.) execute certain part institutes of each embodiment of the present invention or embodiment
The method stated.
The same or similar parts between the embodiments can be referred to each other in this specification.Invention described above is real
The mode of applying is not intended to limit the scope of the present invention..