JP2009211480A

JP2009211480A - Structured document processing system, structured document processing method, and structured document processing program

Info

Publication number: JP2009211480A
Application number: JP2008054648A
Authority: JP
Inventors: Masakazu Moriguchi; 昌和森口; Isaomi Tatsumi; 勇臣辰巳
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-03-05
Filing date: 2008-03-05
Publication date: 2009-09-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide a structured document processing technology for mapping sections without rendering a structured document and mapping sections considering both a layout and a structure of the structured document. <P>SOLUTION: Weight is assigned to each tag which constitutes a structured document on the basis of the degree of influence to a layout and the degree of non-similarity among structured documents is calculated on the basis of the assigned weight and a tag structure within each structured document. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、構造化文書を処理する構造化文書処理システム、構造化文書処理方法及び構造化文書処理プログラムに関し、特にＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）のような特定の文書型定義（ＤＴＤ：ＤｏｃｕｍｅｎｔＴｙｐｅＤｅｆｉｎｉｔｉｏｎ）に基づいてレイアウトを形成する構造化文書において、特定の条件で分割された領域を、レイアウトと構造の両方が類似した別の領域とマッピングする構造化文書処理システム、構造化文書処理方法及び構造化文書処理プログラムに関する。 The present invention relates to a structured document processing system, a structured document processing method, and a structured document processing program for processing a structured document, and more particularly to a specific document type definition (DTD: Document Type) such as HTML (Hyper Text Markup Language). A structured document processing system, a structured document processing method, and a structured document processing method for mapping a region divided under a specific condition to another region similar in both layout and structure in a structured document that forms a layout based on (Definition) The present invention relates to a structured document processing program.

近年、大量の情報を含むＷｅｂコンテンツなどの構造化文書から、自動的に文書構造を解析して複数のセクションを抽出して利用するシステムが研究されている。例えば、セクションの中からユーザが必要な情報を、自動あるいは手動で選択して利用する情報提供システムがある。なお、セクションとは、構造化文書を細分化した領域のことで、構造化文書と同様にマークアップ言語（以下、タグ）で構成される。 In recent years, a system that automatically analyzes a document structure and extracts a plurality of sections from a structured document such as a Web content including a large amount of information has been studied. For example, there is an information providing system that automatically or manually selects and uses information required by a user from a section. The section is an area obtained by subdividing the structured document, and is configured by a markup language (hereinafter referred to as a tag) as in the structured document.

しかし、その情報提供システムなどのように、常に内容が変化していくＷｅｂコンテンツのセクションを利用したアプリケーションでは、時間経過に応じてそのセクションの表示位置が変わったり、消滅したりするため、ユーザに間違ったセクションの情報を提供してしまう場合がある。 However, in applications that use Web content sections whose contents change constantly, such as the information provision system, the display position of the section changes or disappears over time. May provide incorrect section information.

そこで、特許文献１では、構造化文書のレイアウトの特徴に注目し、セクションの抽出順序やセクションの表示座標、および見出しなどのレイアウト情報を用いることで、セクションのレイアウトが変わっても、セクションの位置を推定する方法が提案されている。 Therefore, in Patent Document 1, paying attention to the characteristics of the layout of the structured document, and using layout information such as the section extraction order, section display coordinates, and headings, the position of the section can be changed even if the section layout changes. A method has been proposed for estimating.

また、特許文献２で提案されている構造化文書同士の類似度を検出する技術をセクションに応用し、変化前と変化後とのセクションの文書構造を比較して、より類似した構造を持つセクション同士をマッピングさせる方法が考えられる。特に、構造化文書の文書構造がツリー型で表現できるため、そのツリーの編集距離を用いて構造化文書の類似性を判定する。
特開２００７‐２９３５４３号特開２００７−０５２５５６号 In addition, the technique for detecting the similarity between structured documents proposed in Patent Document 2 is applied to sections, and the document structures of sections before and after the change are compared, and sections having a more similar structure are compared. A method of mapping each other is conceivable. In particular, since the document structure of the structured document can be expressed in a tree form, the similarity of the structured document is determined using the editing distance of the tree.
JP 2007-293543 A JP2007-052556A

第１の問題点は、特許文献１のように座標などのレイアウトの特徴を用いて、より類似したレイアウトのセクション同士をマッピングする場合、構造化文書を少なくとも１度はレンダリングして、レイアウト情報を取得しなければならないということである。 The first problem is that, when sections of similar layouts are mapped using layout features such as coordinates as in Patent Document 1, a structured document is rendered at least once to obtain layout information. That means you have to get it.

第２の問題点は、レンダリングせずに、特許文献２のように文書構造の類似性を判定して、より類似した構造のセクション同士をマッピングする場合、レイアウトを考慮していないために適切なマッピングができないということである。 The second problem is that when the similarity of the document structure is determined and the sections having a more similar structure are mapped as in Patent Document 2 without rendering, the layout is not considered. It means that mapping is not possible.

本発明が解決しようとする課題は、構造化文書をレンダリングせずにセクションのマッピングを行える構造化文書処理システム、構造化文書処理方法および構造化文書処理プログラムを提供することにある。 An object of the present invention is to provide a structured document processing system, a structured document processing method, and a structured document processing program that can perform section mapping without rendering a structured document.

また、構造化文書のレイアウト及び構造の両方を考慮したセクションのマッピングを行える構造化文書処理システム、構造化文書処理方法および構造化文書処理プログラムを提供することにある。 It is another object of the present invention to provide a structured document processing system, a structured document processing method, and a structured document processing program that can perform section mapping in consideration of both the layout and structure of a structured document.

上記課題を解決するための本発明は、複数の構造化文書同士の非類似度を計算する構造化文書処理装置であって、構造化文書を構成している各タグに、レイアウトへの影響度に基づいて重みを割り当てるレイアウト判定手段と、前記割り当てた重み及び前記各構造化文書内のタグ構造に基づいて、構造化文書間の非類似度を計算する非類似度計算手段とを有することを特徴とする。 The present invention for solving the above problems is a structured document processing apparatus that calculates dissimilarities between a plurality of structured documents, and each tag constituting the structured document has an influence on the layout. Layout determination means for assigning weights based on the above, and dissimilarity calculation means for calculating dissimilarities between structured documents based on the assigned weights and the tag structure in each structured document. Features.

上記課題を解決するための本発明は、構造化文書のレイアウトと構造に基づいて構造化文書を対応付ける構造化文書処理システムであって、前記構造化文書を構成する各タグにレイアウトへの影響度に基づいて重みを割り当てるレイアウト判定手段と、前記割り当てられた重み及び前記構造化文書のタグ構造に基づいて、構造化文書を細分化したセクション間の非類似度を計算する非類似度計算手段と、前記計算された非類似度に基づいて構造化文書同士を対応付け、この構造化文書同士の対応付けを示す表示情報を生成するマッピング手段と前記生成した表示情報を、通信ネットワークを介して情報端末に送信する情報配信手段とを有することを特徴とする。 The present invention for solving the above-described problems is a structured document processing system that associates a structured document based on the layout and structure of the structured document, and each tag constituting the structured document has an influence on the layout. Layout determination means for assigning weights based on the above, and dissimilarity calculation means for calculating dissimilarities between sections obtained by segmenting the structured document based on the assigned weights and the tag structure of the structured document. The mapping means for associating structured documents with each other based on the calculated dissimilarity and generating display information indicating the association between the structured documents, and the generated display information via the communication network And an information distribution means for transmitting to the terminal.

上記課題を解決するための本発明は、複数の構造化文書同士の非類似度を計算する構造化文書処理方法であって、構造化文書を構成している各タグに、レイアウトへの影響度に基づいて重みを割り当てるレイアウト判定ステップと、前記割り当てた重み及び前記各構造化文書内のタグ構造に基づいて、構造化文書間の非類似度を計算する非類似度計算ステップとを有することを特徴とする。 The present invention for solving the above problem is a structured document processing method for calculating dissimilarity between a plurality of structured documents, and each tag constituting the structured document has an influence on the layout. A layout determination step for assigning weights based on: and a dissimilarity calculation step for calculating dissimilarities between structured documents based on the assigned weights and the tag structure in each structured document. Features.

上記課題を解決するための本発明は、複数の構造化文書同士の非類似度を計算するプログラムであって、前記プログラムは、情報処理装置に、構造化文書を構成している各タグに、レイアウトへの影響度に基づいて重みを割り当てるレイアウト判定処理と、前記割り当てた重み及び前記各構造化文書内のタグ構造に基づいて、構造化文書間の非類似度を計算する非類似度計算処理とを実行させることを特徴とする。 The present invention for solving the above problems is a program for calculating the dissimilarity between a plurality of structured documents, and the program is provided in each tag constituting the structured document in the information processing apparatus. Layout determination processing for assigning weights based on the degree of influence on layout, and dissimilarity calculation processing for calculating dissimilarities between structured documents based on the assigned weights and the tag structure in each structured document Are executed.

本発明によると、レンダリングしなくてもセクションのマッピングを行えることにある。その理由は、レイアウトではなくセクションの構成タグを比較対象とするためである。 According to the present invention, section mapping can be performed without rendering. The reason is that not the layout but the section configuration tag is to be compared.

また、本発明によると、セクションのレイアウト及び構造の両方に基づいたマッピングを行えることにある。その理由は、各タグのレイアウトへの影響度、およびセクションのタグ構造を判定要素とするためである。 Further, according to the present invention, mapping based on both the layout and structure of the section can be performed. The reason is that the degree of influence of each tag on the layout and the tag structure of the section are used as determination elements.

本発明の特徴を説明するために、以下において、図面を参照して具体的に述べる。 In order to explain the features of the present invention, it will be specifically described below with reference to the drawings.

本発明による構造化文書処理システムの特徴は、レイアウト判定部１０２と、非類似度計算部１０３と、マッピング部１０４とを有する点である。 A feature of the structured document processing system according to the present invention is that it includes a layout determination unit 102, a dissimilarity calculation unit 103, and a mapping unit 104.

レイアウト判定部１０２は、構造化文書の各タグのレイアウトへの影響度に基づいて、重みをタグに割り当てる。 The layout determination unit 102 assigns weights to tags based on the degree of influence of each tag of the structured document on the layout.

非類似度計算部１０３は、タグ構造、およびレイアウト判定部１０２で各タグに割り当てられた重みに基づいて、比較するセクション同士の非類似度を計算する。 The dissimilarity calculation unit 103 calculates the dissimilarity between sections to be compared based on the tag structure and the weight assigned to each tag by the layout determination unit 102.

マッピング部１０４は、非類似度計算部１０３で算出した非類似度に基づいて、適切に構造化文書をマッピングする。 The mapping unit 104 appropriately maps the structured document based on the dissimilarity calculated by the dissimilarity calculating unit 103.

図１は、本発明による構造化文書処理システムの構成の一例を示すブロック図である。本実施の形態では、構造化文書処理システムは、ハードウェアで構成することも可能であるが、以下ではプログラムに従って動作するパーソナルコンピュータなどの情報処理端末によって実現する場合を用いて説明する。尚、構造化文書処理システムは、構造化文書を複数のセクションに分割して配信するシステム等のビジネスモデルに適用されてもよい。この場合、構造化文書処理システムは、例えば、構造化文書をレンダリングするソフトウェアを搭載した携帯電話やＰＤＡ、パーソナルコンピュータ等のユーザ端末と、構造化文書を処理する構造化文書処理サーバとを含んでもよい。 FIG. 1 is a block diagram showing an example of the structure of a structured document processing system according to the present invention. In the present embodiment, the structured document processing system can be configured by hardware. However, in the following description, the structured document processing system will be described using a case where the structured document processing system is realized by an information processing terminal such as a personal computer that operates according to a program. The structured document processing system may be applied to a business model such as a system that distributes a structured document by dividing it into a plurality of sections. In this case, the structured document processing system may include, for example, a user terminal such as a mobile phone, a PDA, or a personal computer equipped with software for rendering the structured document, and a structured document processing server that processes the structured document. Good.

図１に示すように、本実施の形態では、構造化文書処理システムは、プログラム制御により動作するデータ処理装置１０と、情報を記憶する記憶装置１１とを有する。 As shown in FIG. 1, in the present embodiment, the structured document processing system includes a data processing device 10 that operates under program control, and a storage device 11 that stores information.

データ処理装置１０は、具体的には、プログラムに従って動作するパーソナルコンピュータやサーバ等によって実現される。 Specifically, the data processing apparatus 10 is realized by a personal computer, a server, or the like that operates according to a program.

データ処理装置１０は、文書入力部１００と、文書解析部１０１と、レイアウト判定部１０２と、非類似度計算部１０３と、マッピング部１０４と、出力部１０５とを有する。また、記憶装置１１は、具体的には、メモリやハードディスク装置等によって実現される。記憶装置１１は、セクション記憶部１１０と、マッピング記憶部１１１とを有する。 The data processing apparatus 10 includes a document input unit 100, a document analysis unit 101, a layout determination unit 102, a dissimilarity calculation unit 103, a mapping unit 104, and an output unit 105. The storage device 11 is specifically realized by a memory, a hard disk device, or the like. The storage device 11 includes a section storage unit 110 and a mapping storage unit 111.

文書入力部１００は、外部から構造化文書を取得し、文書解析部１０１に出力する機能を備える。例えば、文書入力部１００は、ユーザの操作に従って、記憶装置１１から構造化文書を読み出し、文書解析部１０１に出力する。また、例えば、文書入力部１００は、インターネット等の通信ネットワークを介して構造化文書（例えば、Ｗｅｂコンテンツなど）を受信し、文書解析部１０１に出力する。 The document input unit 100 has a function of acquiring a structured document from the outside and outputting it to the document analysis unit 101. For example, the document input unit 100 reads a structured document from the storage device 11 and outputs it to the document analysis unit 101 in accordance with a user operation. For example, the document input unit 100 receives a structured document (for example, Web content) via a communication network such as the Internet, and outputs the structured document to the document analysis unit 101.

文書解析部１０１は、文書入力部１００から取得した構造化文書を解析して、複数のセクションを抽出し、セクション記憶部１１０に記憶させる機能を備える。なお、取得した構造化文書をそのまま単一セクションとしてセクション記憶部１１０に記憶させてもよい。 The document analysis unit 101 has a function of analyzing a structured document acquired from the document input unit 100, extracting a plurality of sections, and storing them in the section storage unit 110. The acquired structured document may be stored in the section storage unit 110 as a single section as it is.

レイアウト判定部１０２は、セクション記憶部１１０からセクションを取得し、各セクションを構成するタグのレイアウトへの影響度に基づいて、各タグに重みを割り当てる機能を備える。また、レイアウト判定部１０２は、タグに重みを割り当てたセクションを、非類似度計算部に出力する機能を備える。なお、ここで説明するセクションには、セクションの集合であるセクショングループ（構造化文書そのものを含む）を含んでもよい。 The layout determination unit 102 has a function of acquiring a section from the section storage unit 110 and assigning a weight to each tag based on the degree of influence of the tags constituting each section on the layout. The layout determination unit 102 has a function of outputting a section in which a weight is assigned to a tag to the dissimilarity calculation unit. The section described here may include a section group (including the structured document itself) that is a set of sections.

例えば、レイアウト判定部１０２は、タグのレイアウト定義をＤＴＤ（ＤｏｃｕｍｅｎｔＴｙｐｅＤｅｆｉｎｉｔｉｏｎ）から取得し、各タグのレイアウトへの影響度をブロック要素（見出しや段落など、レイアウトを構成する基本要素）およびインライン要素（強調やリンクなど、表示情報に役割や機能を与える要素）の２種類に分類する。そして、レイアウト判定部１０２は、ブロック要素に重みを大きく与え、一方インライン要素には重みを小さく与えることによって、レイアウトへの影響度に基づいた重み付けをする。 For example, the layout determination unit 102 acquires a tag layout definition from a DTD (Document Type Definition), and determines the degree of influence of each tag on the layout of block elements (basic elements such as headings and paragraphs) and inline elements. (Elements that give roles and functions to display information such as emphasis and links) are classified into two types. Then, the layout determination unit 102 performs weighting based on the degree of influence on the layout by giving a large weight to the block element while giving a small weight to the inline element.

非類似度計算部１０３は、レイアウト判定部１０２から比較するセクション（少なくとも２以上のセクション）を取得して、セクションのタグ構造に基づいて、セクション間の構造の類似性を示す非類似度を計算する機能を備える。また、非類似度計算部１０３は、セクション間の非類似度を、マッピング部１０４に出力する機能を備える。 The dissimilarity calculation unit 103 acquires sections to be compared (at least two sections) from the layout determination unit 102, and calculates dissimilarity indicating the similarity of the structure between sections based on the tag structure of the sections. It has a function to do. Further, the dissimilarity calculation unit 103 has a function of outputting the dissimilarity between sections to the mapping unit 104.

例えば、非類似度計算部１０３は、各セクションのタグ構造をツリー型に変換して、ツリー同士の編集距離を計算する。その際、レイアウト判定部１０２で計算した重みをツリーの各ノードの編集コストとすることによって、レイアウトへの影響度およびセクションの構造の両方に基づいた非類似度を計算する。 For example, the dissimilarity calculation unit 103 converts the tag structure of each section into a tree type, and calculates the edit distance between the trees. At this time, the dissimilarity based on both the influence on the layout and the section structure is calculated by using the weight calculated by the layout determination unit 102 as the editing cost of each node of the tree.

マッピング部１０４は、非類似度計算部１０３から各セクションの非類似度を取得し、最も非類似度が小さい、即ち最も類似しているセクションの組み合わせをマッピングする（対応付ける）機能を備える。また、マッピング部１０４は、セクションのマッピング結果を、マッピング記憶部１１１に記憶させる機能を備える。例えば、マッピング部１０４は、構造化文書Ｄａと構造化文書Ｄｂとの比較において、最も非類似度が小さいセクションの組み合わせから順に、ＤａとＤｂとのセクションをすべてマッピングする。また、セクション数の違いからマッピングできずに残ってしまったセクションは、空セクションφとマッピングする。なお、マッピング部１０４は、セクションの相対位置を考慮し、マッピングしたセクションを基準に、構造化文書内のセクションの集合を２つのグループに分けて、その各グループ内でマッピングを行ってもよい。また、マッピング部１０４は、セクションの絶対位置を考慮し、構造化文書内のセクションの階層構造に基づいて、セクションの集合を複数のグループに分け、その各グループ内でマッピングを行ってもよい。 The mapping unit 104 has a function of acquiring the dissimilarity of each section from the dissimilarity calculating unit 103 and mapping (associating) the combination of sections having the smallest dissimilarity, that is, the most similar. Further, the mapping unit 104 has a function of storing the section mapping result in the mapping storage unit 111. For example, in the comparison between the structured document Da and the structured document Db, the mapping unit 104 maps all the sections of Da and Db in order from the combination of sections having the smallest dissimilarity. Further, a section that cannot be mapped due to a difference in the number of sections is mapped as an empty section φ. Note that the mapping unit 104 may divide the set of sections in the structured document into two groups based on the mapped sections in consideration of the relative positions of the sections, and perform mapping within each group. The mapping unit 104 may consider the absolute position of the section, divide the set of sections into a plurality of groups based on the hierarchical structure of the sections in the structured document, and perform mapping within each group.

出力部１０５は、マッピング記憶部１１１が記憶しているセクションのマッピング結果を表示情報として外部に出力する機能を備える。例えば、出力部１０５は、マッピング記憶部１１１から、ユーザが指定したセクションと、そのセクションとマッピングされているセクションとを抽出し、液晶表示部やディプレイ装置等の表示装置に表示させたり、通信ネットワークを介して情報端末に送信したりする。 The output unit 105 has a function of outputting the section mapping result stored in the mapping storage unit 111 to the outside as display information. For example, the output unit 105 extracts a section designated by the user and a section mapped with the section from the mapping storage unit 111 and displays the extracted section on a display device such as a liquid crystal display unit or a display device. Or sent to the information terminal via the network.

記憶装置１１は、セクション記憶部１１０と、マッピング記憶部１１１とを含む。 The storage device 11 includes a section storage unit 110 and a mapping storage unit 111.

セクション記憶部１１０は、文書解析部１０１が解析した構造化文書のセクションを、構造化文書毎に記憶する。 The section storage unit 110 stores a section of the structured document analyzed by the document analysis unit 101 for each structured document.

マッピング記憶部１１１は、マッピング部１０４が計算したセクションのマッピング結果を記憶する。 The mapping storage unit 111 stores the section mapping result calculated by the mapping unit 104.

次に、動作について説明する。図２は、構造化文書処理システムがレイアウトと文書構造に基づいてセクションをマッピングする処理の一例を示す流れ図である。 Next, the operation will be described. FIG. 2 is a flowchart illustrating an example of a process in which the structured document processing system maps sections based on the layout and the document structure.

まず、文書入力部１００は、構造化文書を取得する。例えば、文書入力部１００は、記憶装置１１に格納されている構造化文書を読み出す。また、例えば、文書入力部１００は、通信ネットワークを介して構造化文書（例えば、Ｗｅｂコンテンツなど）を受信する。 First, the document input unit 100 acquires a structured document. For example, the document input unit 100 reads a structured document stored in the storage device 11. For example, the document input unit 100 receives a structured document (for example, Web content) via a communication network.

次に、文書解析部１０１は、文書入力部１００から取得した構造化文書を解析して、複数のセクションを抽出し、構造化文書毎にセクション記憶部１１０に記憶させる（ステップＳ１１）。 Next, the document analysis unit 101 analyzes the structured document acquired from the document input unit 100, extracts a plurality of sections, and stores them in the section storage unit 110 for each structured document (step S11).

続いて、レイアウト判定部１０２は、セクション記憶部１１０からセクションを取得し、タグのレイアウトへの影響度に基づいて、各タグに重みを割り当てる（ステップＳ１２）。 Subsequently, the layout determination unit 102 acquires a section from the section storage unit 110, and assigns a weight to each tag based on the degree of influence of the tag on the layout (step S12).

次に、非類似度計算部１０３は、レイアウト判定部１０２から各構造化文書のセクションを取得し、タグに割り当てられた重みとタグの構造とに基づいて、比較元のセクションとこれ以外のセクションとの間の各非類似度を計算する（ステップＳ１３）。 Next, the dissimilarity calculation unit 103 acquires a section of each structured document from the layout determination unit 102, and based on the weight assigned to the tag and the tag structure, the comparison source section and other sections Each dissimilarity between is calculated (step S13).

続いて、マッピング部１０４は、非類似度計算部１０３からセクションの非類似度を取得して、その非類似度に基づいて複数の構造化文書のセクション同士をマッピングし、マッピング記憶部１１１に記憶させる（ステップＳ１４）。 Subsequently, the mapping unit 104 acquires the dissimilarity of the section from the dissimilarity calculation unit 103, maps the sections of the plurality of structured documents based on the dissimilarity, and stores them in the mapping storage unit 111. (Step S14).

また、出力部１０５は、マッピング記憶部１１１が記憶するセクションのマッピング結果を出力する。 The output unit 105 outputs the section mapping result stored in the mapping storage unit 111.

以上のように、本実施の形態によれば、レイアウトへの影響度に基づいてタグに重みを割り当て、その重みとセクションのタグ構造に基づいて非類似度を計算し、非類似度に基づいてセクションをマッピングすることにより、レンダリングせずにレイアウトや文書構造に基づいた適切なセクションのマッピングが可能となる。 As described above, according to the present embodiment, a weight is assigned to a tag based on the degree of influence on the layout, the dissimilarity is calculated based on the weight and the tag structure of the section, and based on the dissimilarity By mapping sections, appropriate section mapping based on layout and document structure can be performed without rendering.

例えば、ある２つの構造化文書の比較において、微妙にレイアウトが異なっているセクションがある場合、あるセクションを構成するタグの種類や構造が、他のセクションと比べて最も類似していれば、それをマッピングすることができる。また、定性的にしか把握できなかった類似性を、非類似度という形で定量的に把握することができる。 For example, in the comparison of two structured documents, if there is a section with a slightly different layout, if the type and structure of the tags that make up a section are the most similar compared to other sections, Can be mapped. Further, the similarity that can only be grasped qualitatively can be quantitatively grasped in the form of dissimilarity.

次に、本発明による構造化文書処理システムの具体的な実施例について説明する。 Next, a specific embodiment of the structured document processing system according to the present invention will be described.

まず、構造化文書処理システムの第１の実施例について説明する。なお、本実施例における構造化文書処理システムは、第１の実施の形態で示した構造化文書処理システムに相当する。また、本実施例では、データ処理装置がパーソナルコンピュータであり、データ記憶装置が磁気ディスク装置であるものとする。 First, a first embodiment of the structured document processing system will be described. Note that the structured document processing system in this embodiment corresponds to the structured document processing system shown in the first embodiment. In this embodiment, it is assumed that the data processing device is a personal computer and the data storage device is a magnetic disk device.

パーソナルコンピュータ（データ処理装置）は、文書入力手段、文書解析手段、レイアウト判定手段、非類似度計算手段、マッピング手段、及び出力手段として機能する中央演算装置を含む。また、磁気ディスク装置（記憶装置）は、パーソナルコンピュータによって解析または計算されたセクション情報や非類似度情報を記憶する。なお、データ処理装置は、サーバや携帯電話等でもよく、端末の種類によらない。また、本実施例では、構造化文書の例として、Ｗｅｂコンテンツを対象とする。例えば、パーソナルコンピュータは、インターネットを介してＷｅｂコンテンツを受信する。 The personal computer (data processing apparatus) includes a central processing unit that functions as a document input unit, a document analysis unit, a layout determination unit, a dissimilarity calculation unit, a mapping unit, and an output unit. The magnetic disk device (storage device) stores section information and dissimilarity information analyzed or calculated by a personal computer. The data processing apparatus may be a server, a mobile phone, or the like, and does not depend on the type of terminal. In this embodiment, Web content is targeted as an example of a structured document. For example, a personal computer receives Web content via the Internet.

本実施例では、まず、中央演算装置は、Ｗｅｂコンテンツを受信して、受信したＷｅｂコンテンツを解析してセクションを抽出する。そして、中央演算装置は、抽出したセクション情報を磁気ディスク装置に記憶させる。なお、受信したＷｅｂコンテンツをそのまま単一セクションとしてもよい。また、構造化文書は、レンダリングするために作成されたものであれば、ＨＴＭＬやＸＭＬなどの種類に寄らない。本実施例では、構造化文書としてＨＴＭＬを扱う。 In this embodiment, first, the central processing unit receives Web content, analyzes the received Web content, and extracts a section. The central processing unit stores the extracted section information in the magnetic disk device. The received Web content may be used as a single section as it is. In addition, as long as the structured document is created for rendering, it does not depend on the type such as HTML or XML. In this embodiment, HTML is handled as a structured document.

図３は、１つの構造化文書の文書構造例を示す図であり、図４は、２つの構造化文書が解析されて複数のセクションが抽出された後のセクションのレイアウト構成を示す図である。本実施例では、中央演算装置は、図３のように、構造化文書Ｄ１から複数のセクションを抽出し、図４のように、セクションの抽出順序に基づいてそれぞれセクション１〜８のように番号を割り振る。もう一方の構造化文書Ｄ２でも同様にセクションを抽出し、そのセクション情報を磁気ディスク装置に記憶させる。 FIG. 3 is a diagram illustrating an example of a document structure of one structured document, and FIG. 4 is a diagram illustrating a layout configuration of sections after two structured documents are analyzed and a plurality of sections are extracted. . In the present embodiment, the central processing unit extracts a plurality of sections from the structured document D1 as shown in FIG. 3, and numbers such as sections 1 to 8 based on the section extraction order as shown in FIG. Is allocated. A section is similarly extracted from the other structured document D2, and the section information is stored in the magnetic disk device.

次に、中央演算装置は、磁気ディスク装置からセクション情報を取得し、各セクションのタグにレイアウトへの影響度に基づいて重みを割り当てる。本実施例では、中央演算装置は、各タグに割り当てる重みを、ＤＴＤに基づいて、ブロック要素（見出しや段落など、レイアウトを構成する基本要素）およびインライン要素（強調やリンクなど、表示情報に役割や機能を与える要素）の２つの種類に分けて計算する。なお、磁気ディスク装置から取得するセクションは、構造化文書を解析したセクションに限らず、他のセクションや、構造化文書そのものでもよい。また、取得するセクションあるいは構造化文書は複数でもよい。 Next, the central processing unit acquires section information from the magnetic disk device, and assigns weights to the tags of each section based on the degree of influence on the layout. In this embodiment, the central processing unit assigns weights to each tag to display information such as block elements (basic elements constituting a layout such as headings and paragraphs) and inline elements (emphasis and links) based on DTD. It is divided into two types of elements and elements that give functions). The section acquired from the magnetic disk device is not limited to the section obtained by analyzing the structured document, but may be another section or the structured document itself. Further, a plurality of sections or structured documents may be acquired.

図５は、図３の構造化文書Ｄ１のタグをブロック要素およびインライン要素別に分け、重みとしてそれぞれ１００と１を割り当てた例を示す説明図である。なお、図５の重みの値は、レイアウトへの影響度を強く評価したいため、ブロック要素を示す“div”等に対して「１００」、インライン要素を示す“h”、“a”、“img”等に対して「１」のように、重みの差を大きくしたが、例えば、レイアウトへの影響度を緩く評価するならば、ブロック要素を「３」、インライン要素を「１」のように、重みの差を小さくしてもよい。また、レイアウトではなく、タグの構造の類似性を強く評価したいならば、ブロック要素およびインライン要素を共に１にして、重みの差を無くしてもよい。また、レイアウトへの影響度を決めるタグの種類は、ブロック要素およびインライン要素という定義を使用せずに、ＤＴＤの別の定義や、ＣＳＳ（Cascading Style Sheets）などのＤＴＤ以外の構造化文書内の要素の表示を定義したレイアウト定義に従ってもよい。また、ユーザが予め定義したものでもよい。また、予めレンダリングした際に各タグのレイアウトへの影響度を計算し、その記憶している結果に基づいてもよい。 FIG. 5 is an explanatory diagram showing an example in which the tags of the structured document D1 in FIG. 3 are divided into block elements and inline elements, and 100 and 1 are assigned as weights, respectively. The weight value in FIG. 5 is “100” for “div” indicating a block element, “h”, “a”, “img” indicating an inline element because the influence on the layout is strongly evaluated. For example, if the influence on the layout is evaluated loosely, the block element is set to “3” and the inline element is set to “1”. The weight difference may be reduced. Further, if it is desired to strongly evaluate the similarity of the tag structure rather than the layout, both the block element and the inline element may be set to 1 to eliminate the weight difference. In addition, the types of tags that determine the degree of influence on layout are not defined as block elements or inline elements, but are defined in other DTD definitions or in structured documents other than DTD such as CSS (Cascading Style Sheets). The layout definition that defines the display of elements may be followed. Moreover, what the user defined in advance may be used. Further, the degree of influence of each tag on the layout when rendered in advance may be calculated and based on the stored result.

次に、中央演算装置は、セクションのタグ構造に基づいて、比較するセクションとの編集距離を計算する。本実施例では、タグの構造をツリー型に変換し、ツリー構造の編集距離を計算する。さらに、ノード一つの編集コストを各タグの重みとし、それに基づいて編集距離から非類似度を計算する。 Next, the central processing unit calculates an edit distance from the section to be compared based on the tag structure of the section. In this embodiment, the tag structure is converted into a tree type, and the edit distance of the tree structure is calculated. Furthermore, the dissimilarity is calculated from the edit distance based on the edit cost of one node as the weight of each tag.

図６は、図５のセクション１をツリー型に変換した例を示す説明図である。なお、図６のＲはツリーの根（Ｒｏｏｔ）を示している。また、図６では、重みの値が大きいノードは表示サイズを大きく、逆に重みの値が小さいノードは小さく表現している。 FIG. 6 is an explanatory diagram showing an example in which section 1 of FIG. 5 is converted into a tree type. Note that R in FIG. 6 indicates the root of the tree. In FIG. 6, a node having a large weight value has a large display size, and a node having a small weight value is small.

図７は、図６のセクション１のツリーを比較元として、別のセクションであるセクション２、セクション３、セクション４のそれぞれとの編集距離および非類似度の計算例を示す説明図である。本実施例では、ツリーの編集距離の計算において、レイアウトへの影響度を中心に計算するため、レイアウト要素（ブロック要素やインライン要素など）が同じならば、タグ名に関係しない。また、本実施例では、置換を使用せずに、置換＝削除＋挿入と見なす。さらに、本実施例では、レイアウトにツリーの兄弟順序も影響すると考え、ツリーの兄弟要素を入れ替えての構造一致は許可しない。 FIG. 7 is an explanatory diagram illustrating a calculation example of the edit distance and the dissimilarity with each of the sections 2, 3, and 4, which are different sections, using the tree of section 1 in FIG. 6 as a comparison source. In this embodiment, the calculation distance of the tree is calculated based on the influence on the layout, so if the layout elements (block elements, inline elements, etc.) are the same, they are not related to the tag name. In this embodiment, substitution is not used but substitution = deletion + insertion. Furthermore, in this embodiment, it is considered that the sibling order of the tree also affects the layout, and structural matching by exchanging sibling elements of the tree is not permitted.

例えば、セクション２のツリー構造を示した（ａ）のツリーとの編集距離の計算において、比較元のツリーからインライン要素“a”と“img”との２つを削除（ｄｅｌ）した構造と同等になるため、編集距離が２となり、重みに基づいた非類似度は２となる。 For example, in the calculation of the edit distance with the tree of (a) showing the tree structure of section 2, it is equivalent to the structure in which two inline elements “a” and “img” are deleted (del) from the comparison source tree Therefore, the edit distance is 2, and the dissimilarity based on the weight is 2.

同様に、セクション３のツリー構造を示した（ｂ）のツリーは、比較元のツリーから、インライン要素を１つ削除、１つ挿入（ｉｎｓ）、ブロック要素を１つ削除した構造と同等になるため、編集距離は３となり、重みに基づいた非類似度は１０２となる。 Similarly, the tree in (b) showing the tree structure of section 3 is equivalent to the structure in which one inline element is deleted, one is inserted (ins), and one block element is deleted from the comparison source tree. Therefore, the edit distance is 3, and the dissimilarity based on the weight is 102.

さらに、セクション４のツリー構造を示した（ｃ）のツリーは、比較元のツリーから、インライン要素を２つ削除、２つ挿入した構造と同等になるため、編集距離は４となり、重みに基づいた非類似度は４となる。 Furthermore, since the tree in (c) showing the tree structure of section 4 is equivalent to the structure in which two inline elements are deleted and inserted from the comparison source tree, the edit distance is 4, which is based on the weight. The dissimilarity is 4.

なお、類似判定を厳密にするため、同じ要素でもタグ名が異なっていれば編集コストに一定の重みを与えて置換を使用してもよい。また、類似判定を緩和するため、ツリーの兄弟要素を入れ替えての構造一致を許可してもよい。また、タグの構造を用いるのならば、ツリー型に変換しなくてもよい。例えば、ブロック要素を１、及びインライン要素を０として、二進法的な表現にタグ構造を変換し、変換した文字列の編集距離を求めてもよい。また、中央演算装置は、非類似度を計算した時点で処理を終えてもよい。この場合、磁気ディスク装置に記憶される情報は、非類似度となる。 In order to make the similarity determination strict, replacement may be used by giving a certain weight to the editing cost if the tag name is different even for the same element. In addition, in order to ease the similarity determination, structure matching may be permitted by replacing the sibling elements of the tree. If the tag structure is used, it is not necessary to convert it to a tree type. For example, assuming that the block element is 1 and the inline element is 0, the tag structure may be converted into a binary expression, and the edit distance of the converted character string may be obtained. Further, the central processing unit may finish the process when the dissimilarity is calculated. In this case, the information stored in the magnetic disk device is dissimilarity.

次に、中央演算装置は、最も非類似度が小さいセクションの組み合わせから順にマッピングする。本実施例では、マッピングしたセクションを基準にセクションを２つのグループに分割し、それぞれのグループでセクションをマッピングするという処理を繰り返す。 Next, the central processing unit maps in order from the combination of sections having the smallest dissimilarity. In the present embodiment, the process of dividing the section into two groups based on the mapped section and mapping the sections in each group is repeated.

図８は、図４の構造化文書Ｄ１とＤ２とのマッピング例を示す説明図である。例えば、まず、最も非類似度が小さいセクション４（Ｄ１）とセクション４（Ｄ２）との組み合わせをマッピングする。このマッピングによって、構造化文書が２つのグループに分割され、それぞれＤ１ではＧ１１とＧ１２、Ｄ２ではＧ２１とＧ２２となる。 FIG. 8 is an explanatory diagram showing an example of mapping between the structured documents D1 and D2 of FIG. For example, first, a combination of section 4 (D1) and section 4 (D2) having the smallest dissimilarity is mapped. This mapping divides the structured document into two groups, G11 and G12 for D1, and G21 and G22 for D2.

次に、それぞれのグループで最も非類似度が小さいセクションの組み合わせをマッピングする。例えば、図９のように、Ｇ１１：Ｇ２１では、セクション２（Ｄ１）とセクション３（Ｄ２）、およびＧ１２：Ｇ２２では、セクション８（Ｄ１）とセクション９（Ｄ２）との組み合わせをマッピングする。 Next, a combination of sections having the smallest dissimilarity in each group is mapped. For example, as shown in FIG. 9, in G11: G21, section 2 (D1) and section 3 (D2), and in G12: G22, combinations of section 8 (D1) and section 9 (D2) are mapped.

そして、Ｄ１ではグループＧ１１がＧ１１１とＧ１１２に、Ｇ１２がＧ１２１とＧ１２２に、一方Ｄ２ではＧ２１がＧ２１１とＧ２１２に、Ｇ２２がＧ２２１とＧ２２２に分割される。なお、Ｇ１２２、Ｇ２１２及びＧ２２２はセクションが存在しない空グループである。 In D1, the group G11 is divided into G111 and G112, G12 is divided into G121 and G122, while in D2, G21 is divided into G211 and G212, and G22 is divided into G221 and G222. G122, G212, and G222 are empty groups in which no section exists.

続いて、今までと同様にそれぞれのグループで最も非類似度が小さいセクションの組み合わせをマッピングする。例えば、図１０のように、Ｇ１１１：Ｇ２１１では、セクション１（Ｄ１）とセクション１（Ｄ２）、及びＧ１２１：Ｇ２２１では、セクション５（Ｄ１）とセクション７（Ｄ２）の組み合わせをマッピングする。 Subsequently, a combination of sections having the smallest dissimilarity in each group is mapped as before. For example, as shown in FIG. 10, in G111: G211, a combination of section 1 (D1) and section 1 (D2), and in G121: G221, a combination of section 5 (D1) and section 7 (D2) is mapped.

そして、Ｄ１ではグループＧ１１１がＧ１１１１とＧ１１１２に、Ｇ１２１がＧ１２１１とＧ１２１２に、一方Ｄ２ではＧ２１１がＧ２１１１とＧ２１１２に、Ｇ２２１がＧ２２１１とＧ２２１２に分割される。ここで、Ｇ１１２：Ｇ２１２では、セクション３（Ｄ１）の組み合わせの相手がＤ２に存在しないため、存在しないセクションφとマッピングされる。 In D1, the group G111 is divided into G1111 and G1112, G121 is divided into G1211 and G1212, while in D2, G211 is divided into G2111 and G2112, and G221 is divided into G2211 and G2212. Here, in G112: G212, since the partner of the combination of section 3 (D1) does not exist in D2, it is mapped to the section φ that does not exist.

以下も同様に分割されたグループでのマッピングを繰り返し、最終的に図１１のようなマッピング結果になる。 Similarly, the mapping in the divided groups is repeated, and finally the mapping result as shown in FIG. 11 is obtained.

なお、図１２のように、セクションの文書内における階層構造を利用してグループ分割してもよい。 In addition, as shown in FIG. 12, group division may be performed using a hierarchical structure in a section document.

例えば、深さ２まで探索すると、まず、図１３（ａ）のように、深さ１の階層において、グループのマッピングを行う。次に、図１３（ｂ）のように、深さ２の階層において、グループのマッピングを行う。その後に、それぞれの分割されたグループ内において、セクションのマッピングを行う。 For example, when searching to a depth of 2, first, group mapping is performed in a hierarchy of a depth of 1, as shown in FIG. Next, as shown in FIG. 13B, group mapping is performed in a hierarchy of depth 2. After that, section mapping is performed in each divided group.

上述した本発明は、レイアウトを持った構造化文書を複数のセクションに分割して利用するアプリケーションにおいて、レイアウトと文書構造に基づいて類似したセクションに適切にマッピングすることができるため、構造化文書の変化に強いマッピングが可能となる。また、構造化文書の類似性を定量的に把握できるようになる。 In the above-described invention, in an application that uses a structured document having a layout by dividing it into a plurality of sections, it can be appropriately mapped to similar sections based on the layout and the document structure. Mapping that is resistant to change is possible. In addition, the similarity of structured documents can be grasped quantitatively.

例えば、ブログなどのレイアウト構成がよく変化するＷｅｂコンテンツにおいて、特定のセクションの更新情報を管理するアプリケーションでは、他のセクションが削除あるいは追加され、セクションの構成が変化したとしても、その特定のセクションを一貫して管理することができる。 For example, in a web content such as a blog where the layout configuration changes frequently, an application that manages the update information of a specific section deletes or adds other sections, and even if the section configuration changes, the specific section is Can be managed consistently.

また、Ｗｅｂコンテンツ全体の更新情報を管理するアプリケーションでも、どのセクションが削除されたり追加されたりしたかを識別することができる。さらに、全くＵＲＬが異なるＷｅｂコンテンツ同士でも、その類似性をセクション単位で把握できるため、特定の類似性をもったセクションをすべてのＷｅｂコンテンツで非表示にするなどの処理が可能となる。 Further, even an application that manages update information of the entire Web content can identify which section has been deleted or added. Furthermore, since the similarity can be grasped in section units even between Web contents having completely different URLs, it is possible to perform processing such as hiding a section having a specific similarity in all the Web contents.

構造化文書処理システムの他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of a structured document processing system. 構造化文書処理システムがセクションをマッピングする処理の一例を示す流れ図である。It is a flowchart which shows an example of the process in which a structured document processing system maps a section. 構造化文書の文書構造の例を示す説明図である。It is explanatory drawing which shows the example of the document structure of a structured document. セクションのレイアウト構成を示す説明図である。It is explanatory drawing which shows the layout structure of a section. ブロック要素およびインライン要素のタグに対して、重みとしてそれぞれ１００と１を割り当てた例を示す説明図である。It is explanatory drawing which shows the example which allocated 100 and 1 as a weight with respect to the tag of a block element and an inline element, respectively. セクションをツリー型に変換する例を示す説明図である。It is explanatory drawing which shows the example which converts a section into a tree type. 編集距離および非類似度の計算例を示す説明図である。It is explanatory drawing which shows the example of calculation of edit distance and dissimilarity. 構造化文書Ｄ１とＤ２とのマッピング手順を示す説明図である。It is explanatory drawing which shows the mapping procedure of structured document D1 and D2. 構造化文書Ｄ１とＤ２とのマッピング手順を示す説明図である。It is explanatory drawing which shows the mapping procedure of structured document D1 and D2. 構造化文書Ｄ１とＤ２とのマッピング手順を示す説明図である。It is explanatory drawing which shows the mapping procedure of structured document D1 and D2. 構造化文書Ｄ１とＤ２とのマッピングが完了した例を示す説明図である。It is explanatory drawing which shows the example which mapping of the structured documents D1 and D2 was completed. 構造化文書Ｄ１の階層構造例を示す説明図である。It is explanatory drawing which shows the example of a hierarchical structure of the structured document D1. 構造化文書Ｄ１とＤ２とのの階層構造を利用したマッピング例を示す説明図である。It is explanatory drawing which shows the example of mapping using the hierarchical structure of the structured documents D1 and D2.

Explanation of symbols

１０データ処理装置
１１記憶装置
１００文書入力部
１０１文書解析部
１０２レイアウト判定部
１０３非類似度計算部
１０４マッピング部
１０５出力部
１１０セクション記憶部
１１１マッピング記憶部
10 data processing device 11 storage device 100 document input unit 101 document analysis unit 102 layout determination unit 103 dissimilarity calculation unit 104 mapping unit 105 output unit 110 section storage unit 111 mapping storage unit

Claims

A structured document processing apparatus that calculates dissimilarity between a plurality of structured documents,
Layout determination means for assigning a weight to each tag constituting the structured document based on the degree of influence on the layout;
A structured document processing apparatus comprising: a dissimilarity calculating unit that calculates dissimilarity between structured documents based on the assigned weight and the tag structure in each structured document.

2. The structured document processing apparatus according to claim 1, further comprising mapping means for associating structured documents having a small calculated dissimilarity.

3. The structured document processing apparatus according to claim 2, wherein the mapping means associates a plurality of structured documents in order from structured documents having a low dissimilarity.

3. The mapping unit according to claim 2, wherein the mapping unit divides each structured document into a plurality of groups on the basis of the first associated structured document, and associates the structured documents in each group with each other. 3. The structured document processing apparatus according to 3.

The dissimilarity calculating means calculates dissimilarity between sections of each structured document based on the assigned weight and a tag structure in a section obtained by subdividing the structured document. The structured document processing apparatus according to claim 1.

6. The structured document processing apparatus according to claim 5, further comprising document analysis means for analyzing the input structured document and extracting a section.

7. The structured document processing apparatus according to claim 1, wherein the layout determination unit uses a layout definition predetermined for the structured document as a weight calculation criterion.

7. The structured document processing apparatus according to claim 1, wherein the layout determination unit uses a definition predetermined by the user as a weight calculation criterion.

The layout determination means measures the influence of each tag on the layout when pre-rendering as a weight calculation criterion, and uses a definition defined by the system based on the measurement. The structured document processing apparatus according to any one of the above.

10. The structured document processing apparatus according to claim 1, wherein the dissimilarity calculating unit uses a structure obtained by converting a tag structure into a tree type for calculating the dissimilarity.

10. The structured information according to claim 1, wherein the dissimilarity calculating means uses a structure obtained by converting a tag structure into a single-byte character string for calculating the dissimilarity. Document processing device.

12. The structured document processing apparatus according to claim 10, wherein the dissimilarity calculating means uses the edit distance of the structure for calculating the dissimilarity.

A structured document processing system that associates structured documents based on the layout and structure of a structured document,
Layout determination means for assigning a weight to each tag constituting the structured document based on the degree of influence on the layout;
Dissimilarity calculation means for calculating dissimilarity between sections obtained by subdividing the structured document based on the assigned weight and the tag structure of the structured document;
Mapping means for associating structured documents with each other based on the calculated dissimilarity, and generating display information indicating the association between the structured documents, and the generated display information via an information terminal A structured document processing system comprising: an information distribution means for transmitting to the network.

A structured document processing method for calculating dissimilarity between a plurality of structured documents,
A layout determination step for assigning a weight to each tag constituting the structured document based on the degree of influence on the layout;
A structured document processing method comprising: a dissimilarity calculation step of calculating dissimilarity between structured documents based on the assigned weight and the tag structure in each structured document.

15. The structured document processing method according to claim 14, further comprising a mapping step of associating structured documents having a small calculated dissimilarity.

16. The structured document processing method according to claim 15, wherein the mapping step associates a plurality of structured documents in order from structured documents having a low dissimilarity.

16. The mapping step according to claim 15, wherein the mapping step divides each structured document into a plurality of groups on the basis of the first associated structured document, and associates the structured documents in each group with each other. The structured document processing method according to claim 16.

The dissimilarity calculating step calculates a dissimilarity between sections of each structured document based on the assigned weight and a tag structure in a section obtained by subdividing the structured document. The structured document processing method according to claim 14.

19. The structured document processing method according to claim 18, further comprising a document analysis step of extracting a section by analyzing the input structured document.

10. The structured document processing method according to claim 14, wherein the layout determination step uses a layout definition predetermined for the structured document as a weight calculation criterion.

20. The structured document processing method according to claim 14, wherein the layout determination step uses a definition predetermined by the user as a weight calculation criterion.

The layout determination step measures the influence of each tag on the layout when pre-rendering as a weight calculation criterion, and uses a definition defined by the system based on the measurement. The structured document processing method according to any one of the above.

The structured document processing method according to any one of claims 14 to 22, wherein the dissimilarity calculation step uses a structure obtained by converting a tag structure into a tree type for calculating the dissimilarity.

23. The structured document according to claim 14, wherein the dissimilarity calculating step uses a structure in which a tag structure is converted into a single-byte character string for calculating the dissimilarity. Processing method.

25. The structured document processing method according to claim 23, wherein the dissimilarity calculation step uses a structure edit distance for calculating the dissimilarity.

The structured document processing method according to any one of claims 15 to 25, further comprising an information transmission step of transmitting a result of associating structured documents with each other through a communication network in the mapping step.

A program for calculating the dissimilarity between a plurality of structured documents, the program in an information processing device,
Layout determination processing for assigning a weight to each tag constituting a structured document based on the degree of influence on the layout;
A non-similarity calculation process for calculating a non-similarity between structured documents based on the assigned weight and the tag structure in each structured document.

28. The program according to claim 27, wherein a mapping process for associating structured documents with a low calculated dissimilarity is executed.

29. The program according to claim 28, wherein the mapping process associates a plurality of structured documents in order from structured documents having a low dissimilarity.

30. The mapping process according to claim 28, wherein the mapping process divides each structured document into a plurality of groups on the basis of the structured document associated first, and associates the structured documents in each group with each other. 29. The program according to 29.

The dissimilarity calculation processing calculates a dissimilarity between sections of each structured document based on the assigned weight and a tag structure in a section obtained by subdividing the structured document. The program according to any one of claims 27 to 30.

32. The program according to claim 31, wherein a document analysis process for analyzing the input structured document and extracting a section is executed.

The program according to any one of claims 27 to 32, wherein the layout determination processing uses a layout definition predetermined for the structured document as a weight calculation criterion.

The program according to any one of claims 27 to 32, wherein the layout determination process uses a definition predetermined by a user as a weight calculation criterion.

The layout determination process uses a definition defined by the system based on measurement of an influence degree of each tag on the layout when rendering in advance as a weight calculation criterion. A program according to any of the above.

36. The program according to claim 27, wherein the dissimilarity calculation process uses a structure obtained by converting the tag structure into a tree type for calculating the dissimilarity.

The program according to any one of claims 27 to 35, wherein the dissimilarity calculation processing uses a structure obtained by converting a tag structure into a single-byte character string for calculation of dissimilarity.

38. The program according to claim 36 or claim 37, wherein the dissimilarity calculation processing uses the edit distance of the structure for calculating the dissimilarity.

The program according to any one of claims 28 to 38, wherein an information transmission process for transmitting a result of association in the mapping process via a communication network is executed.