CN1320481C - Method for conducting title and text logic connection for newspaper pages - Google Patents

Method for conducting title and text logic connection for newspaper pages Download PDF

Info

Publication number
CN1320481C
CN1320481C CNB2004100914324A CN200410091432A CN1320481C CN 1320481 C CN1320481 C CN 1320481C CN B2004100914324 A CNB2004100914324 A CN B2004100914324A CN 200410091432 A CN200410091432 A CN 200410091432A CN 1320481 C CN1320481 C CN 1320481C
Authority
CN
China
Prior art keywords
text
title
chapter
word set
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2004100914324A
Other languages
Chinese (zh)
Other versions
CN1604073A (en
Inventor
贾娟
陈晓鸥
陈堃銶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB2004100914324A priority Critical patent/CN1320481C/en
Publication of CN1604073A publication Critical patent/CN1604073A/en
Application granted granted Critical
Publication of CN1320481C publication Critical patent/CN1320481C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The present invention belongs to a processing technology for intelligent letters and graph information, particularly to a method for logically correlating a title and a text of a newspaper layout. By aiming at the defect that the existing layout understanding technology only depends on a logic object of a type information classification layout and lacks semantic structure extraction for the multiple-text multiple-title newspaper layout, the present invention firstly uses a graph theory to establish a mathematics module, the one-to-one characteristic of matched granularity of a non-text range set and a text range set is described by using a bipartite graph match module, and a weighted bipartite graph is established according to a spatial relation. A natural language processing technology is firstly adopted to calculate an edge weight value of the bipartite graph, and a pairing saturated vertex of an optimal matching result is used as a title and a text in successful logical correlation. The present invention provides the method that an optimal matched Kuhn-Munkres algorithm and artificial intelligence are combined to solve the logical correlation problem of the title and the text, the matched accuracy rate is high, and the present invention can be used for a history data structured process and a metadata extracting process.

Description

A kind of newspaper layout is carried out the title method related with text logic connection
Technical field
The invention belongs to intelligent literal and graphic information processing technology, be specifically related to a kind of newspaper layout be carried out the title method related with text logic connection.
Background technology
Top line plays an important role in Content Management Systems such as classification, retrieval, Dublin Core and NewsML all title as a kind of important metadata, particularly in striding the medium publication, title is as the important element of metadata and XML message structure, the correctness related with text logic connection directly has influence on reusing and deep processing of information in the digital asset management system, as retrieval, issue and hyperlink etc. again.Logic association refers to, and to be exactly each literal piece that tiles on the newspaper layout two-dimensional space be title, text, header, speech etc. by its semantic function logical division, then the title of the same message of expression and the text item as a structure associated.As traditional media format, be different from books, magazine, the information of newspaper is propagated has intensive, promptly on a space of a whole page, carry out the composing of a plurality of chapters, in order to improve legibility, each chapter all has a title that its content is summarized, the position heading be embedded in chapter zone or with the chapter adjacency, have eye-catching characteristics such as the layout of a page without columns, Jia Heijia big font at form of expression heading.But in the newspaper layout of various carriers such as Jie of paper media, software for composing, PDF, the chapter text does not have the structurized related of inherence with title, just the tiling on the layout space is enumerated, and caption position arbitrarily, font size is fixing, fixing, a title and a plurality of text blocks position vicinity of row anyhow, make and judge that there are ambiguity in a title and which text matching, other class title piece such as header, speech etc. on pattern with the title homogeneity, only utilize style information correctly to carry out logical division to the literal piece.
In addition, people are by visual thinking ability and the semantic logic association that carries out text and title, but computing machine can't be from this structure connection of direct information " understanding ".Because the historical amount of assets of newspaper is huge, adopt artificial assistant interventional method cost not only consuming time but also too big, the logic association that how to make computer intelligence in printed page understanding and structuring restructuring procedure, carries out newspaper layout title and text automatically becomes active demand.
Title is related with text logic connection and need hocket to literal piece logical division, promptly at first rough sort literal piece is non-text block and text block, carry out logic association then, utilize the result of coupling to determine which non-text literal piece is real title again, but the logical division to title all utilizes style information independently to carry out at present, as document " Document page similarity based on layout visual saliency:Application to query by example and documentclassification " (Proceedings of the Seventh International Conference onDocument Analysis and Recognition.2003,1208~1212); And document TOC (TableOf Content) catalogue extracting method " Automated Detection and Segmentation of Tableof Contents Page from Document Images " (author is S.Mandal, S.P.Chowdhury and A.K.Das. are published in Proceedings of the Seventh International Conferenceon Document Analysis and Recognition, 2003,398~402.) the only suitable books space of a whole page is powerless to the newspaper of the complicated space of a whole page; Document " printed page analysis of complicated Chinese paper, understanding and reconstruct " (author Chen Ming, Ding Xiaoqing, strong, Tsing-Hua University's journal natural science edition the 41st the 1st phase of volume of calendar year 2001 of beam.The page number 29~32,59) Matching Model rule and method can only the processing rule zone common type, when the text zone be irregularly shaped or the position of title and text relation when complicated Matching Model not have the situation of description just can't correctly mate, another one title and a plurality of chapters position in abutting connection with the time have an ambiguity coupling that can lead to errors.It is good and bad that prior art lacks the quantitative total evaluation coupling of unified mathematical model, all do not consider semantic information, is not enough according to pattern and the complicated newspaper layout of position information process only.Because the processing of the logic association of title and text is an inverse process of writing title during the space of a whole page generates for text in the space of a whole page reconstruct, method " Description of the UAM system for generationg veryshort summaries at DUC-2004 " (the Enrique alfonseca that title in the natural language processing technique generates, Jose MariaGuirao, Antonio Moreno-Sandoval.Document Understanding Conference 2004) be worth using for reference.
Summary of the invention
At in the prior art to the less-than-ideal defective of newspaper layout title matching effect, the purpose of this invention is to provide and a kind of newspaper layout is carried out the title method related with text logic connection, this method can be carried out structure of an article extraction to newspaper layout, can improve the title matching effect greatly.
For reaching above purpose, the technical solution used in the present invention is: a kind of newspaper layout is carried out the title method related with text logic connection, may further comprise the steps:
(1) reads in newspaper document after the printed page analysis, each literal piece is categorized as text literal piece and non-text literal piece by line number amount in font style and the piece, text literal piece is divided into independently chapter zone of a plurality of contents by reading order and piece pattern;
(2) set up the weighting bipartite graph, two vertex sets of bipartite graph comprise all non-text literal piece and chapter zone respectively, and the limit of bipartite graph is corresponding in the neighbouring relations of space of a whole page two-dimensional space with non-text literal piece and chapter zone;
(3) weights on bipartite graph limit adopt natural language processing technique, determine by the non-text literal piece content of summit correspondence and the semanteme of chapter area contents, method is that to utilize title be the characteristics of article content theme summary, literal in the text literal piece is carried out obtaining word set a after the lexical analysis, total m different speech, and calculate the dispersion degree of each speech among the word set a and degree of finger altogether, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech, and degree of finger is represented with the number of times that this speech occurs in chapter altogether; Equally the literal in the non-text literal piece is carried out lexical analysis and obtain word set b, total n different speech, and calculate the relative dispersion degree of each speech in the chapter text and relative degree of finger altogether among the word set b, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech relatively, and degree of finger is represented with the number of times that this speech occurs in chapter relatively altogether; Before among the word set a n maximum dispersion degree and as the total points divergence of word set a, before among the word set a n maximum altogether degree of finger and as the degree of finger altogether of word set a, all relative dispersion degree and total relative dispersion degree among the word set b as word set b, all degree of finger relatively altogether and always relative degree of finger altogether among the word set b as word set b.The total relatively dispersion degree of the calculating of dispersion coefficient by word set b obtains divided by the total points divergence of word set a, and the total relatively altogether degree of referring to of the calculating that refers to coefficient altogether by word set b obtains divided by the degree of finger altogether of word set a; Title is to the speech coverage of chapter text, represents divided by the number of all speech of word set b with the number that the speech of word set b occurs in the chapter text.Dispersion coefficient, refer to that the linear weighted function of coefficient and speech coverage is the weights on limit altogether;
(4) utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, the non-text literal piece content of the saturated vertex correspondence in the non-text literal of the Optimum Matching result piece vertex set promptly is a title, and what saturated vertex correspondence in another chapter zone vertex set that the limit links to each other was arranged with it is exactly the text chapter of this title institute logic association, and the two is respectively as title in the XML structure of an article and the output of text item.
It is title by its semantic function logical division that above-mentioned logic association refers to each literal piece that tiles on the newspaper layout two-dimensional space, text, header, speech etc., then the expression title of same message and text associating as a structure, carrying out title when related with text logic connection, theory with bipartite graph in the graph theory, algorithm and result are incorporated on the tolerance of summary spreadability between literal piece content, specifically, it is related with text logic connection to be that Ku En-Man Kele (Kuhn-Munkres) algorithm with Optimum Matching in the graph theory is used for content-based title.
Effect of the present invention is: adopt method of the present invention, can be effectively carry out the structure of an article to newspaper layout by signal conditioning package and extract, improved the matching effect of text and title in the newspaper layout greatly.By to the modeling of problem with to human thinking's simulation, make matching accuracy rate very high, can be widely used in during the historical data structuring of digital asset management system and meta-data extraction handle.
Why the present invention has such effect, is because the present invention is directed to relation various characteristics in position between newspaper layout character area complexity and the literal piece, proposes a kind of new method to title logic association text in the newspaper layout.The present invention utilizes the bipartite graph matching mathematical model to describe man-to-man characteristics on title and the text granularity accurately, utilizing style information is the block sort of newspaper layout Chinese words non-positive collected works and positive collected works, and set up initial bipartite graph according to the spatial relationship between two set elements, particularly adopt natural language processing technique first, take all factors into consideration extraction type and two kinds of summaries of total junction type type, and to calculate title based on the length that refers to the speech chain altogether and dispersion degree be the limit weights of weighting bipartite graph to the semanteme summary coverage of text as the judge factor of logic association between non-text block and the text block, promptly is the incidence relation of title and text through the limit of the connection saturation point after the Optimum Matching.
Description of drawings
Fig. 1 is a process flow diagram of the present invention;
Fig. 2 is printed page analysis and sorted newspaper synoptic diagram;
Fig. 3 is the newspaper synoptic diagram with chapter zone behind the recovery reading order;
Fig. 4 is the bipartite graph synoptic diagram that non-text literal piece and chapter zone generate according to syntople;
Fig. 5 is Ku En-Man Kele (Kuhn-Munkres) Optimum Matching arithmetic result synoptic diagram.
Embodiment
Below in conjunction with accompanying drawing the present invention is done to describe further, process flow diagram of the present invention as shown in Figure 1:
(1) reads in newspaper document after the printed page analysis, the newspaper document comprises scanning paper medium newspaper and through document, PDF, professional software for composing such as Founder that OCR identification the obtains document that generates etc. of soaring, printed page analysis is bottom-up the space of a whole page to be divided into each piece zone, and physical classification is literal piece and image block.Each literal piece is categorized as text literal piece and non-text literal piece by line number amount in font style and the piece, as shown in Figure 2, solid-line rectangle is represented text literal piece, dashed rectangle is represented non-text literal piece, the syntople of text literal piece is expressed as digraph, and fractionation is converted into the weighting bipartite graph, adopt natural language processing technique to calculate bipartite graph limit weights, obtain a plurality of continuous sequences by Optimum Matching, each sequence is divided into a plurality of subsequences according to literal piece style information again, the zone that merges the subsequence correspondence promptly is chapter zone independently, the word flow that its corresponding content connects into is as the content in chapter zone, as shown in Figure 3, arrow is represented the priority of reading order, each continuous arrow sequence has been formed the chapter zone to text literal piece, the numbering in zone circle numeral chapter zone, and ordinary numbers is represented the numbering of non-text literal piece;
(2) set up the weighting bipartite graph, two vertex sets of bipartite graph comprise all non-text literal piece and chapter zone respectively, the limit of bipartite graph is corresponding in the neighbouring relations of space of a whole page two-dimensional space with non-text literal piece and chapter zone, as shown in Figure 4, left side vertex set is represented non-text literal piece, and the right vertex set is represented the chapter zone;
(3) weights on bipartite graph limit adopt natural language processing technique, determine by the non-text literal piece content of summit correspondence and the semanteme of chapter area contents, method is that to utilize title be the characteristics of article content theme summary, literal in the text literal piece is carried out obtaining word set a after the lexical analysis, total m different speech, and calculate the dispersion degree of each speech among the word set a and degree of finger altogether, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech, and degree of finger is represented with the number of times that this speech occurs in chapter altogether; Equally the literal in the non-text literal piece is carried out obtaining word set b after the lexical analysis, total n different speech, and calculate the relative dispersion degree of each speech in the chapter text and relative degree of finger altogether among the word set b, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech relatively, and degree of finger is represented with the number of times that this speech occurs in chapter relatively altogether; Before among the word set a n maximum dispersion degree and as the total points divergence of word set a, before among the word set a n maximum altogether degree of finger and as the degree of finger altogether of word set a, all relative dispersion degree and total relative dispersion degree among the word set b as word set b, all degree of finger relatively altogether and always relative degree of finger altogether among the word set b as word set b.The total relatively dispersion degree of the calculating of dispersion coefficient by word set b obtains divided by the total points divergence of word set a, and the total relatively altogether degree of referring to of the calculating that refers to coefficient altogether by word set b obtains divided by the degree of finger altogether of word set a; Title is to the speech coverage of chapter text, represents divided by the number of all speech of word set b with the number that the speech of word set b occurs in the chapter text.Dispersion coefficient, refer to that the linear weighted function of coefficient and speech coverage is the weights on limit altogether;
(4) utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, the non-text literal piece content of the saturated vertex correspondence in the non-text literal of the Optimum Matching result piece vertex set promptly is a title, and the saturated vertex correspondence in the regional vertex set of another chapter that has the limit to link to each other with it is exactly the text chapter of this title institute logic association, as shown in Figure 5, the left side vertex representation title that is linked to each other by the limit, the right vertex representation is the chapter text of logic association with it, are ingredients of same message as title 6 with text 7, and the two is respectively as title in the XML structure of an article and the output of text item.Optimum Matching result's unsaturation point corresponding character piece is neither the also non-text of title, just in the space of a whole page as the content of other types such as header, speech, not only solved page object logical division problem but also finished the logic association of title and text.The Kuhn_Munkres algorithm that calculates Optimum Matching is as follows:
1) provides initial label l ( x i ) = max j ω ij , l(y j)=0,i,j=1,2...,t?,t=max(n,m);
2) obtain limit collection E l={ (x i, y j) | l (x i)+l (y j)=ω Ij, G l=(X, Y k, E l) and G lIn one the coupling M;
3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
4) in X, look for a M unsaturation point x 0, make A ← { x 0, B ← φ, A, B are two set;
5) if N G l ( A ) = B , Then change the 9th) step, otherwise carry out next step, wherein, N G l ( A ) ⊆ Y k , Be with A in the node set of node adjacency;
6) look for a node y ∈ N G l ( A ) - B ;
7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;
8) there is one from x 0But the augmenting path P to y makes M ← M  E (P), changes the 3rd) step;
9) be calculated as follows a value: a = min x i ∈ A y j ∉ N G l { l ( x i ) + l ( y j ) - ω ij } , Revise label:
Figure C20041009143200091
Ask E according to l ' L 'And G L '
10) l ← l ', G l← G L ', change the 6th) and the step.

Claims (3)

1. one kind is carried out the title method related with text logic connection to newspaper layout, may further comprise the steps:
(1) reads in newspaper document after the printed page analysis, each literal piece is categorized as text literal piece and non-text literal piece by line number amount in font style and the piece, text literal piece is divided into independently chapter zone of a plurality of contents by reading order and piece pattern;
(2) set up the weighting bipartite graph, two vertex sets of bipartite graph comprise all non-text literal piece and chapter zone respectively, and the limit of bipartite graph is corresponding in the neighbouring relations of space of a whole page two-dimensional space with non-text literal piece and chapter zone;
(3) weights on bipartite graph limit adopt natural language processing technique, determine by the non-text literal piece content of summit correspondence and the semanteme of chapter area contents, method is that to utilize title be the characteristics of article content theme summary, literal in the text literal piece is carried out obtaining word set a after the lexical analysis, total m different speech, and calculate the dispersion degree of each speech among the word set a and degree of finger altogether, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech, and degree of finger is represented with the number of times that this speech occurs in chapter altogether; Equally the literal in the non-text literal piece is carried out lexical analysis and obtain word set b, total n different speech, and calculate the relative dispersion degree of each speech in the chapter text and relative degree of finger altogether among the word set b, dispersion degree is with the distance expression between the last sentence that occurs and occur for the first time in the chapter text of this speech relatively, and degree of finger is represented with the number of times that this speech occurs in chapter relatively altogether; Before among the word set a n maximum dispersion degree and as the total points divergence of word set a, before among the word set a n maximum altogether degree of finger and as the degree of finger altogether of word set a, all relative dispersion degree and total relative dispersion degree among the word set b as word set b, all degree of finger relatively altogether and always relative degree of finger altogether among the word set b as word set b, the total relatively dispersion degree of the calculating of dispersion coefficient by word set b obtains divided by the total points divergence of word set a, and the total relatively altogether degree of referring to of the calculating that refers to coefficient altogether by word set b obtains divided by the degree of finger altogether of word set a; Title is to the speech coverage of chapter text, represents divided by the number of all speech of word set b with the number that the speech of word set b occurs in the chapter text, dispersion coefficient, refers to that the linear weighted function of coefficient and speech coverage is the weights on limit altogether;
(4) utilize Ku En-Man Kele (Kuhn-Munkres) algorithm to carry out Optimum Matching to the weighting bipartite graph, the non-text literal piece content of the saturated vertex correspondence in the non-text literal of the Optimum Matching result piece vertex set promptly is a title, and what saturated vertex correspondence in another chapter zone vertex set that the limit links to each other was arranged with it is exactly the text chapter of this title institute logic association, and the two is respectively as title in the XML structure of an article and the output of text item;
It is title, text, header, speech by its semantic function logical division that above-mentioned logic association refers to each literal piece that tiles on the newspaper layout two-dimensional space, then the expression title of same message and text associating as a structure.
2. as claimed in claim 1ly a kind of newspaper layout is carried out the title method related with text logic connection, it is characterized in that: the newspaper document comprises scanning paper medium newspaper and the document that obtains through OCR identification in the step (1), PDF, the document that the specialty software for composing generates, printed page analysis is bottom-up the space of a whole page to be divided into each piece zone, and physical classification is literal piece and image block, each literal piece is categorized as text literal piece and non-text literal piece by line number amount in font style and the piece, the syntople of text literal piece is expressed as digraph, and fractionation is converted into the weighting bipartite graph, adopt natural language processing technique to calculate bipartite graph limit weights, obtain a plurality of continuous sequences by Optimum Matching, each sequence is divided into a plurality of subsequences according to literal piece style information again, the zone that merges the subsequence correspondence promptly is chapter zone independently, and the word flow that its corresponding content connects into is as the content in chapter zone.
3. as claimed in claim 1ly a kind of newspaper layout is carried out the title method related with text logic connection, it is characterized in that: in the step (4), Optimum Matching result's unsaturation point corresponding character piece is neither the also non-text of title, be header, the speech in the space of a whole page, not only solved page object logical division problem but also finished the logic association of title and text, Ku En-Man Kele (Kuhn-Munkres) algorithm that calculates Optimum Matching is as follows:
1) provides initial label l ( x i ) = max j ω ij , l ( y i ) = 0 , i , j = 1,2 . . . , t , t=max(n,m);
2) obtain limit collection E l={ (x i, y j) | l (x i)+l (y j)=ω Ij, G l=(X, Y k, E l) and G lIn one the coupling M;
3) as all nodes of the saturated X of M, then M promptly is the Optimum Matching of G, calculates and finishes, otherwise carry out next step;
4) in X, look for a M unsaturation point x 0, make A ← { x 0, B ← φ, A, B are two set;
5) if N G l ( A ) = B , Then change the 9th) step, otherwise carry out next step, wherein, N G l ( A ) ⊆ Y k , Be with A in the node set of node adjacency;
6) look for a node y ∈ N G l ( A ) - B ;
7) if y is the M saturation point, then find out the match point z of y, make A ← A ∪ z}, B ← B ∪ y} changes the 5th) step, otherwise carry out next step;
8) there is one from x 0But the augmenting path P to y makes M ← M  E (P), changes the 3rd) step;
9) be calculated as follows a value: a = min x i ∈ A y j ∉ N G l ( A ) { l ( x i ) + l ( y j ) - ω ij } , Revise label:
Figure C2004100914320003C6
Ask E according to l ' L 'And G L '
10) l ← l ', G l← G L ', change the 6th) and the step.
CNB2004100914324A 2004-11-22 2004-11-22 Method for conducting title and text logic connection for newspaper pages Expired - Fee Related CN1320481C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100914324A CN1320481C (en) 2004-11-22 2004-11-22 Method for conducting title and text logic connection for newspaper pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100914324A CN1320481C (en) 2004-11-22 2004-11-22 Method for conducting title and text logic connection for newspaper pages

Publications (2)

Publication Number Publication Date
CN1604073A CN1604073A (en) 2005-04-06
CN1320481C true CN1320481C (en) 2007-06-06

Family

ID=34667254

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100914324A Expired - Fee Related CN1320481C (en) 2004-11-22 2004-11-22 Method for conducting title and text logic connection for newspaper pages

Country Status (1)

Country Link
CN (1) CN1320481C (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271463B (en) * 2007-06-22 2014-03-26 北大方正集团有限公司 Structure processing method and system of layout file
CN101206639B (en) * 2007-12-20 2012-05-23 北大方正集团有限公司 Complex layout indexing method based on PDF
US8290268B2 (en) * 2008-08-13 2012-10-16 Google Inc. Segmenting printed media pages into articles
CN101727438B (en) * 2008-10-30 2012-07-18 北大方正集团有限公司 Method for automatically extracting layout information of digital newspaper
CN102262618B (en) * 2010-05-28 2014-07-09 北京大学 Method and device for identifying page information
CN102890827B (en) * 2011-10-09 2015-05-13 北京多看科技有限公司 Method for resetting scanned document
CN103577818B (en) * 2012-08-07 2018-09-04 北京百度网讯科技有限公司 A kind of method and apparatus of pictograph identification
CN102929843B (en) * 2012-09-14 2015-10-14 《中国学术期刊(光盘版)》电子杂志社有限公司 A kind of method that word is adapted system and adapted
CN103092828B (en) * 2013-02-06 2015-08-12 杭州电子科技大学 Based on the text similarity measure of semantic analysis and semantic relation network
CN104239282B (en) * 2014-09-09 2017-11-14 百度在线网络技术(北京)有限公司 The treating method and apparatus of e-book
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN108268429B (en) * 2017-06-15 2021-08-06 阿里巴巴(中国)有限公司 Method and device for determining network literature chapters
CN107358208B (en) * 2017-07-14 2018-07-13 北京神州泰岳软件股份有限公司 A kind of PDF document structured message extracting method and device
CN111143230B (en) * 2018-11-02 2022-03-29 群联电子股份有限公司 Data merging method, memory storage device and memory control circuit unit

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995002221A1 (en) * 1993-07-07 1995-01-19 Inference Corporation Case-based organizing and querying of a database
CN1393806A (en) * 2001-06-26 2003-01-29 索尼株式会社 Information treater and method, recording medium and system for providing electronic publishing data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995002221A1 (en) * 1993-07-07 1995-01-19 Inference Corporation Case-based organizing and querying of a database
CN1393806A (en) * 2001-06-26 2003-01-29 索尼株式会社 Information treater and method, recording medium and system for providing electronic publishing data

Also Published As

Publication number Publication date
CN1604073A (en) 2005-04-06

Similar Documents

Publication Publication Date Title
Mouchère et al. Advancing the state of the art for handwritten math recognition: the CROHME competitions, 2011–2014
CN1320481C (en) Method for conducting title and text logic connection for newspaper pages
CN103049435B (en) Text fine granularity sentiment analysis method and device
Smith et al. Detecting and modeling local text reuse
US20090144277A1 (en) Electronic table of contents entry classification and labeling scheme
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
Perez-Arriaga et al. TAO: system for table detection and extraction from PDF documents
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN109446423B (en) System and method for judging sentiment of news and texts
Al-Zaidy et al. Automatic summary generation for scientific data charts
CN106055667A (en) Method for extracting core content of webpage based on text-tag density
Boubaker et al. Online Arabic databases and applications
CN103246644A (en) Method and device for processing Internet public opinion information
CN112667940A (en) Webpage text extraction method based on deep learning
Klampfl et al. An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles
CN116994282B (en) Reinforcing steel bar quantity identification and collection method for bridge design drawing
CN117173730A (en) Document image intelligent analysis and processing method based on multi-mode information
CN104298985A (en) Iteration based image text region detecting method
CN118377950A (en) Webpage text extraction method and device
CN102103700A (en) Land mobile distance-based image spam similarity-detection method
Hirayama et al. Development of template-free form recognition system
Ishihara et al. Analyzing visual layout for a non-visual presentation-document interface
CN115618833A (en) Form and context analysis method and system for field of geoscience
Prieto et al. Information extraction in handwritten historical logbooks
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070606

CF01 Termination of patent right due to non-payment of annual fee