JP2010267092A

JP2010267092A - Information processor and information processing method

Info

Publication number: JP2010267092A
Application number: JP2009118047A
Authority: JP
Inventors: Hitoshi Uchida; 均内田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-05-14
Filing date: 2009-05-14
Publication date: 2010-11-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technology to improve the conversion efficiency from text XML containing large-size text data as attribute values or content of element, into binary XML. <P>SOLUTION: An information processor includes: a means for acquiring a structured document in a text format; a means for analyzing the structured document and detecting the text data; and a generation means for converting the structured document by converting the text data into slave elements and for generating a structured document in a binary format based on the converted structured document. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、構造化文書を処理する技術に関するものである。 The present invention relates to a technique for processing a structured document.

これまで、XMLデータのフォーマットは、一般的にテキスト形式であった。しかし、XMLデータは、データ構造を表現するために冗長なデータを必要とし、コンピュータが読み書きするのに時間がかかるという問題があるため、近年では、バイナリXML技術が注目されている。 Until now, the format of XML data was generally a text format. However, since XML data requires redundant data to represent the data structure, and it takes time for a computer to read and write, binary XML technology has attracted attention in recent years.

例えば、W3Cの標準仕様としてEfficient XML Interchange (htt:／／www.w3.org／XML／EXI／)がある。また、ISOの標準仕様としては、Fast Infoset(http:／／www.iso.org／iso／en／CatalogueDetailPage.CatalogueDetail?CSNUMBER=41327&scopelist=PROGRAMME)がある。 For example, there is Efficient XML Interchange (htt: //www.w3.org/XML/EXI/) as a standard specification of W3C. Further, as a standard specification of ISO, there is Fast Infoset (https://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=41327&scopelist=PROGRAMME).

これらの技術は、XMLデータに含まれる要素名や属性名などの各ボキャブラリを、XMLデータ内での出現順に番号を振って符号化することによって、データサイズを小さくすることが可能であった。符号とボキャブラリとの対応を示す表は、符号化テーブルと呼ばれる。バイナリXML形式のデータは、従来のテキスト形式のXMLデータに比べて、高速に読み書きすることが可能であった。 With these technologies, it is possible to reduce the data size by encoding each vocabulary such as an element name and an attribute name included in XML data in the order of appearance in the XML data. A table indicating the correspondence between codes and vocabularies is called a coding table. Binary XML format data could be read and written faster than conventional text format XML data.

また、属性値や要素内容を従来のような文字列型のデータではなく、整数型、浮動小数点型や、独自形式の圧縮アルゴリズムを用いてエンコードすることにより、データサイズを更に小さくすることが可能であった。また、それらデータを解析する際には、従来のようなアプリケーションによる文字列データから数値データへの変換処理の必要がなくなるので、解析にかかる処理時間を更に短くすることができた。 In addition, the data size can be further reduced by encoding attribute values and element contents using integer type, floating point type, or proprietary compression algorithms instead of conventional string data. Met. Further, when analyzing these data, it is not necessary to perform conversion processing from character string data to numerical data by a conventional application, so that the processing time required for the analysis can be further shortened.

また、XMLデータの文書構造と、含まれる属性値や要素内容の値のデータ型を、スキーマを用いることによって定義することができた。一般的には、それらスキーマは、XMLデータの文書構造と、値のデータ型が厳密にスキーマの定義内容と合っているかどうかを検証するために使用される。W3Cのスキーマの標準仕様としては、XML Schema(http:／／www.w3.org／XML／Schema)がある。また、ISOのスキーマの標準仕様としては、RELAX NG(http:／／www.oasis-open.org／committees／relax-ng／)がある。 In addition, the document structure of XML data and the data types of the included attribute values and element content values could be defined using a schema. In general, these schemas are used to verify the document structure of XML data and whether the data type of values exactly matches the schema definition. As a standard specification of the W3C schema, there is XML Schema (https://www.w3.org/XML/Schema). As a standard specification of the ISO schema, there is RELAX NG (https://www.oasis-open.org/committees/relax-ng/).

また、スキーマを利用した他の技術としては、データバインディングがある。これは、開発時に、スキーマからXMLデータを格納するクラスファイルを自動生成しておき、実行時には、アプリケーションがそのクラスを利用することによって、アプリケーションの開発コストを小さくするものである。スキーマから生成したクラスにXMLデータをバインディングする際には、スキーマで定義した各データ型に基づいて、含まれるテキストデータを変換し、各メンバに格納する。よって、従来のテキストベースのXMLパーサとは異なり、アプリケーションはスキーマから生成したクラスを用いることによって、XMLデータに含まれるテキストデータを各データ型の値として扱うことが可能になる。一般的にはSun MicorsystemsのJAXB(https:／／jaxb.dev.java.net／)や、Relaxer(http:／／www.relaxer.jp／)が利用されている。 Another technique using a schema is data binding. This is to reduce the development cost of an application by automatically generating a class file for storing XML data from a schema at the time of development and using the class at the time of execution. When binding XML data to a class generated from a schema, the included text data is converted based on each data type defined in the schema and stored in each member. Therefore, unlike a conventional text-based XML parser, an application can handle text data included in XML data as a value of each data type by using a class generated from a schema. Generally, JAXB (https://jaxb.dev.java.net/) of Sun Micorsystems and Relaxer (https://www.relaxer.jp/) are used.

また、KDDIのXEUSは、属性値や要素内容において、カンマなどで区切られた数値の配列に対しては、各数値をそれぞれ符号化することによって、XMLデータを効率的に圧縮することが可能であった。 In addition, KDDI's XEUS can efficiently compress XML data by encoding each numerical value for an array of numerical values separated by commas in attribute values and element contents. there were.

また、スキーマ情報を用いることによって、文書構造の異なる複数のXMLデータ間の変換用スタイルシートを自動生成するものがあった（特許文献１）。また、アプリケーションからXMLデータへのアクセス効率を改善するために、XMLデータに含まれる階層構造が浅くなるように、タグ名を変換するものがあった（特許文献２）。 In addition, there is one that automatically generates a conversion style sheet between a plurality of XML data having different document structures by using schema information (Patent Document 1). In addition, in order to improve the access efficiency from an application to XML data, there is one that converts tag names so that the hierarchical structure included in the XML data becomes shallow (Patent Document 2).

特開２００５−３５２９４５号公報JP-A-2005-352945 特開２００２−２９７５６９号公報JP 2002-297469 A

従来のバイナリXMLでは、属性値や要素内容などに大きなサイズのテキストデータが含まれている場合、バイナリXML形式に符号化してもそれらの圧縮効率が良くないので、データサイズを小さくすることが難しかった。 In conventional binary XML, if text data of a large size is included in attribute values or element contents, it is difficult to reduce the data size because the compression efficiency is not good even if encoded in binary XML format. It was.

ここで、W3C標準のSVGを用いて図形オブジェクト（図１（ｂ））を記述したテキスト形式のXMLデータ（図１（ａ））について説明する。このテキスト形式のXMLデータをバイナリXML形式に符号化しても、path要素のd属性の値はそのままテキストデータとして符号化され、バイナリXMLデータ内の文字列テーブルに格納される。よって、XMLデータ内に繰り返し出現する可能性の低い、サイズの大きなテキストデータは、従来のバイナリXML技術を用いても効率的に圧縮することが難しかった。特にSVGにおいては、path要素のｄ属性の値に長いテキストデータを用いて図形オブジェクトを記述する場合が多いので、バイナリXMLによる圧縮効果を得ることが難しかった。 Here, text format XML data (FIG. 1A) describing a graphic object (FIG. 1B) using W3C standard SVG will be described. Even if this XML data in text format is encoded in binary XML format, the value of the d attribute of the path element is encoded as text data as it is and stored in the character string table in the binary XML data. Therefore, it has been difficult to efficiently compress large-size text data that is unlikely to appear repeatedly in XML data, even using conventional binary XML technology. In particular, in SVG, it is often difficult to obtain a compression effect by binary XML because graphic objects are often described using long text data as the value of the d attribute of the path element.

本発明は以上の問題に鑑みてなされたものであり、サイズの大きなテキストデータを属性値又は要素内容として含むテキストＸＭＬからバイナリXMLへの変換効率を改善するための技術を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a technique for improving the conversion efficiency from text XML including binary text data as attribute values or element contents to binary XML. To do.

本発明の目的を達成するために、例えば、本発明の情報処理装置は以下の構成を備える。即ち、テキスト形式の構造化文書を取得する手段と、前記構造化文書を解析し、テキストデータを検出する手段と、前記テキストデータを子要素に変換することで前記構造化文書を変換し、変換した構造化文書に基づいてバイナリ形式の構造化文書を生成する生成手段とを備えることを特徴とする。 In order to achieve the object of the present invention, for example, an information processing apparatus of the present invention comprises the following arrangement. A means for obtaining a structured document in text format; a means for analyzing the structured document to detect text data; and converting the structured document by converting the text data into child elements; And generating means for generating a structured document in binary format based on the structured document.

本発明の構成により、サイズの大きなテキストデータを属性値又は要素内容として含むテキストＸＭＬからバイナリXMLへの変換効率を改善することができる。 With the configuration of the present invention, it is possible to improve the conversion efficiency from text XML including large text data as attribute values or element contents into binary XML.

テキスト形式のＸＭＬデータとその描画結果を示す図。The figure which shows the XML data of a text format, and its drawing result. 実施形態に係るシステムの構成例を示す図。The figure which shows the structural example of the system which concerns on embodiment. ＰＣ１０１のハードウェア構成例を示すブロック図。The block diagram which shows the hardware structural example of PC101. ＰＣ１０１の処理機能ブロック図。The processing function block diagram of PC101. ＰＣ１０１が行う処理のフローチャート。The flowchart of the process which PC101 performs. ステップＳ１１３の詳細を示すフローチャート。The flowchart which shows the detail of step S113. 部分スキーマの一例を示す図。The figure which shows an example of a partial schema. （ａ）はXPath式の例を示す図、（ｂ）はバイナリ形式のＸＭＬデータを示す図。(A) is a figure which shows the example of an XPath expression, (b) is a figure which shows the XML data of a binary format. 各イベントの符号を示す図。The figure which shows the code | symbol of each event. 名前テーブルを示す図。The figure which shows a name table. （ａ）は値テーブルを示す図、（ｂ）は符号のテーブルを示す図。(A) is a figure which shows a value table, (b) is a figure which shows the table | surface of a code | symbol.

以下、添付図面を参照し、本発明の好適な実施形態について説明する。なお、以下説明する実施形態は、本発明を具体的に実施した場合の一例を示すもので、特許請求の範囲に記載の構成の具体的な実施例の１つである。 Preferred embodiments of the present invention will be described below with reference to the accompanying drawings. The embodiment described below shows an example when the present invention is specifically implemented, and is one of the specific examples of the configurations described in the claims.

本実施形態では、比較的大きなサイズのテキストデータを属性値として有する属性を含む要素が記された構造化文書としてのテキストXMLを、圧縮効率良くバイナリXMLに変換する為の技術について説明する。 In the present embodiment, a technique for converting text XML as a structured document in which an element including an attribute having text data of a relatively large size as an attribute value is written into binary XML with high compression efficiency will be described.

先ず、本実施形態に係るシステムについて、図２を用いて説明する。図２に示す如く、本実施形態に係るシステムは、本実施形態に係る情報処理装置としてのＰＣ（パーソナルコンピュータ）１０１、ネットワークの一例であるＬＡＮ１０２、デジタルカメラ１０３、複合機１０４、ファイルサーバ１０５により構成されている。 First, the system according to the present embodiment will be described with reference to FIG. As shown in FIG. 2, the system according to the present embodiment includes a PC (personal computer) 101 as an information processing apparatus according to the present embodiment, a LAN 102 as an example of a network, a digital camera 103, a multifunction peripheral 104, and a file server 105. It is configured.

デジタルカメラ１０３は、ＰＣ１０１で生成したＸＭＬデータと共に用いる画像を撮像により取得するためのもので、デジタルカメラ１０３により撮像された画像のデータは、ＬＡＮ１０２を介してＰＣ１０１やファイルサーバ１０５に転送される。もちろん、デジタルカメラ１０３の使用方法はこれに限定するものではない。 The digital camera 103 is for acquiring an image to be used together with the XML data generated by the PC 101 by imaging, and the image data captured by the digital camera 103 is transferred to the PC 101 and the file server 105 via the LAN 102. Of course, the method of using the digital camera 103 is not limited to this.

複合機１０４は、ＰＣ１０１で生成したＸＭＬデータと共に用いる画像を複写により取得するためのもので、複合機１０４により複写された画像のデータは、ＬＡＮ１０２を介してＰＣ１０１やファイルサーバ１０５に転送される。もちろん、複合機１０４の使用方法はこれに限定するものではない。 The multifunction device 104 is for obtaining an image to be used together with the XML data generated by the PC 101 by copying, and the image data copied by the multifunction device 104 is transferred to the PC 101 and the file server 105 via the LAN 102. Of course, the method of using the multifunction machine 104 is not limited to this.

ファイルサーバ１０５は、デジタルカメラ１０３や複合機１０４から転送された画像のデータを保持すると共に、ＰＣ１０１から転送されたＸＭＬデータも保持する。ファイルサーバ１０５が保持する画像のデータやＸＭＬデータに対しては、ＬＡＮ１０２を介して適宜アクセスすることができる。 The file server 105 holds image data transferred from the digital camera 103 or the multifunction peripheral 104 and also holds XML data transferred from the PC 101. The image data and XML data held by the file server 105 can be appropriately accessed via the LAN 102.

ＰＣ１０１は、本実施形態に係る情報処理装置として用いるもので、テキスト形式のＸＭＬデータをバイナリ形式のＸＭＬデータに変換する後述の処理を実行する。次に、図３を用いて、ＰＣ１０１のハードウェア構成例について説明する。 The PC 101 is used as an information processing apparatus according to the present embodiment, and executes processing to be described later for converting text-format XML data into binary-format XML data. Next, a hardware configuration example of the PC 101 will be described with reference to FIG.

ＣＰＵ２０１は、ＲＯＭ２０２やＲＡＭ２０３に格納されているコンピュータプログラムやデータを用いてＰＣ１０１全体の制御を行うと共に、ＰＣ１０１が行うものとして後述する各処理を実行する。 The CPU 201 controls the entire PC 101 using computer programs and data stored in the ROM 202 and the RAM 203 and executes each process described later as what the PC 101 performs.

ＲＯＭ２０２は、コンピュータ読み取り可能な記憶媒体の一例であり、ＰＣ１０１の設定データやブートプログラムなどが格納されている。ＲＡＭ２０３は、コンピュータ読み取り可能な記憶媒体の一例であり、記憶部２０４からロードされたコンピュータプログラムやデータ、ＬＡＮＩ／Ｆ２０７、ＵＳＢＩ／Ｆ２０９を介して外部から受信したデータ等を一時的に記憶する為のエリアを有する。また、ＲＡＭ２０３は、ＣＰＵ２０１が各種の処理を実行する際に用いるワークエリアも有する。即ち、ＲＡＭ２０３は、各種のエリアを適宜提供することができる。 The ROM 202 is an example of a computer-readable storage medium, and stores setting data, a boot program, and the like of the PC 101. The RAM 203 is an example of a computer-readable storage medium, and temporarily stores computer programs and data loaded from the storage unit 204, data received from the outside via the LAN I / F 207, USB I / F 209, and the like. Have an area for The RAM 203 also has a work area used when the CPU 201 executes various processes. That is, the RAM 203 can provide various areas as appropriate.

記憶部２０４は、コンピュータ読み取り可能な記憶媒体の一例であり、ハードディスクドライブ装置に代表される、大容量情報記憶装置である。記憶部２０４には、ＯＳ（オペレーティングシステム）や、ＰＣ１０１が行うものとして後述する各処理をＣＰＵ２０１に実行させるためのコンピュータプログラムやデータが保存されている。このコンピュータプログラムやデータは、ＣＰＵ２０１による制御に従って適宜ＲＡＭ２０３にロードされ、ＣＰＵ２０１による処理対象となる。 The storage unit 204 is an example of a computer-readable storage medium, and is a large-capacity information storage device represented by a hard disk drive device. The storage unit 204 stores an OS (Operating System) and computer programs and data for causing the CPU 201 to execute each process described below as performed by the PC 101. This computer program and data are appropriately loaded into the RAM 203 under the control of the CPU 201 and are processed by the CPU 201.

操作部２０５は、キーボードやマウスなどにより構成されており、ＰＣ１０１の操作者が操作することで、各種の指示をＣＰＵ２０１に対して入力することができる。表示部２０６は、ＣＲＴや液晶画面などにより構成されており、ＣＰＵ２０１による処理結果を画像や文字などでもって表示することができる。 The operation unit 205 is configured by a keyboard, a mouse, and the like, and can input various instructions to the CPU 201 when operated by an operator of the PC 101. The display unit 206 is configured by a CRT, a liquid crystal screen, or the like, and can display a processing result by the CPU 201 using an image, text, or the like.

ＬＡＮＩ／Ｆ２０７は、ＰＣ１０１をＬＡＮ１０２に接続する為のもので、ＰＣ１０１はこのＬＡＮＩ／Ｆ２０７を介してＬＡＮ１０２に接続されている機器との通信を行う。ＵＳＢＩ／Ｆ２０９は、ＰＣ１０１をＵＳＢ回線２１０に接続する為のもので、ＰＣ１０１はこのＵＳＢＩ／Ｆ２０９を介してＵＳＢ回線２１０に接続されている機器との通信を行う。上記各部は何れも共通のバスに接続されており、互いに通信を行うことができる。 The LAN I / F 207 is for connecting the PC 101 to the LAN 102, and the PC 101 communicates with a device connected to the LAN 102 via the LAN I / F 207. The USB I / F 209 is for connecting the PC 101 to the USB line 210, and the PC 101 communicates with a device connected to the USB line 210 via the USB I / F 209. Each of the above units is connected to a common bus and can communicate with each other.

なお、以上で説明したシステムの構成やＰＣ１０１の構成は一例であって、後述する処理を実現することができ、且つ後述する処理によって得られるバイナリ形式のＸＭＬデータを用いた応用が可能であれば、他の構成を適用しても良い。 Note that the system configuration and the PC 101 configuration described above are merely examples, provided that the processing described later can be realized and the application using the binary format XML data obtained by the processing described later is possible. Other configurations may be applied.

次に、テキスト形式のＸＭＬデータからバイナリ形式のＸＭＬデータへの効率の良い圧縮符号化処理について、図５を用いて説明する。なお、図５に示したフローチャートに従った処理をＣＰＵ２０１に実行させるためのコンピュータプログラムやデータは記憶部２０４に保存されている。係るコンピュータプログラムやデータは、ＣＰＵ２０１による制御に従って適宜ＲＡＭ２０３にロードされ、ＣＰＵ２０１による処理対象となる。ＣＰＵ２０１がこのロードされたコンピュータプログラムやデータを用いて処理を実行することで、ＰＣ１０１は、以下に説明する各処理（図５に示した各ステップにおける処理）を実行することになる。 Next, an efficient compression encoding process from text format XML data to binary format XML data will be described with reference to FIG. Note that a computer program and data for causing the CPU 201 to execute processing according to the flowchart shown in FIG. Such computer programs and data are appropriately loaded into the RAM 203 under the control of the CPU 201 and are processed by the CPU 201. When the CPU 201 executes processing using the loaded computer program and data, the PC 101 executes each processing described below (processing at each step shown in FIG. 5).

また、以下では、説明を簡単にするために、圧縮符号化対象のテキスト形式のＸＭＬデータは、図１に示したＸＭＬデータであるものとする。もちろん、以下に説明する処理の本質は、他のテキスト形式のＸＭＬデータであっても同じである。 In the following description, it is assumed that the XML data in the text format to be compression-encoded is the XML data shown in FIG. Of course, the essence of the processing described below is the same for XML data in other text formats.

先ずステップＳ１０２では、ＣＰＵ２０１は、バイナリ形式のＸＭＬデータへの圧縮符号化対象となるテキスト形式のＸＭＬデータを選択する。係る選択は、ＰＣ１０１上で動作しているアプリケーションによって行っても良いし、操作部２０５を介して操作者が入力した選択指示に基づいて行っても良い。上述の通り、ここでは、図１に示したＸＭＬデータが選択されたものとする。係るＸＭＬデータには、効率的に圧縮することの難しいテキストデータ（Ｘｐａｔｈ式にマッチするテキストデータ）を属性値として有する属性（図１ではｄ属性）を含む要素（図１ではｐａｔｈ要素）が含まれている。 First, in step S102, the CPU 201 selects text-format XML data to be compressed and encoded into binary-format XML data. Such selection may be performed by an application running on the PC 101 or may be performed based on a selection instruction input by the operator via the operation unit 205. As described above, it is assumed here that the XML data shown in FIG. 1 has been selected. The XML data includes an element (path element in FIG. 1) including an attribute (d attribute in FIG. 1) having text data that is difficult to compress efficiently (text data that matches the Xpath expression) as an attribute value. It is.

次にステップＳ１０３では、効率的に圧縮することの難しいテキストデータに関するデータ型を記述した部分スキーマと、そのデータ型の情報のルートとなる定義名と、を選択し、この部分スキーマを解析し、ＲＡＭ２０３内にＤＯＭツリーを作成する。 Next, in step S103, a partial schema describing a data type related to text data that is difficult to compress efficiently and a definition name that is the root of information on the data type are selected, and the partial schema is analyzed. A DOM tree is created in the RAM 203.

なお、部分スキーマと、そのデータ型の情報のルートとなる定義名は、表示部２０６にこれらの選択を行うためのユーザインタフェース画面を表示し、ユーザがこの画面を見ながら操作部２０５を用いて選択指示を入力するようにしても良い。また、部分スキーマと、そのデータ型の情報のルートとなる定義名を、あらかじめアプリケーションロジックに組み込むことで、これらを選択するようにしても良い。 Note that the definition name that is the root of the partial schema and the data type information is displayed on the display unit 206 using the operation unit 205 while the user interface screen for making these selections is displayed. A selection instruction may be input. Alternatively, a partial schema and a definition name that becomes the root of information on the data type may be selected by incorporating them in the application logic in advance.

上記の通り、本実施形態では、効率的に圧縮することの難しいデータは、図１におけるｐａｔｈ要素のｄ属性の属性値である。また、RELAX NGを用いて記述した、図１におけるｐａｔｈ要素のｄ属性のデータ型を定義した部分スキーマの一例について、図７を用いて説明する。 As described above, in the present embodiment, the data that is difficult to compress efficiently is the attribute value of the d attribute of the path element in FIG. An example of a partial schema that defines the data type of the d attribute of the path element in FIG. 1 described using RELAX NG will be described with reference to FIG.

図７に例示した部分スキーマでは、define要素を用いて複数のデータ型を記述しているが、ｄ属性の値のデータ型のルートとなる定義名は、”SVG.PathData.datatype”である。然るに、図７の場合、ステップＳ１０３で選択する定義名は、”SVG.PathData.datatype”となる。 In the partial schema illustrated in FIG. 7, a plurality of data types are described using the define element, but the definition name that is the root of the data type of the value of the d attribute is “SVG.PathData.datatype”. However, in the case of FIG. 7, the definition name selected in step S103 is “SVG.PathData.datatype”.

図５に戻って、次に、ステップＳ１０４では、RELAX NG仕様書の第４章単純化に基づいて、選択した部分スキーマの文書構造を単純化する。そして、ステップＳ１０５では、XPath式を用いることにより、定義した部分スキーマに対応するテキストデータを選択する。テキスト形式のＸＭＬデータに出現する全てのpath要素のd属性を表現したXPath式の例は、図８（ａ）に示すようになっている。なお、XPath式を用いたテキストデータの選択については、表示部２０６に選択を行うためのユーザインタフェース画面を表示し、ユーザがこの画面を見ながら操作部２０５を用いて選択指示を入力するようにしても良い。また、XPath式をあらかじめアプリケーションロジックに組み込むことで、係る選択を行うようにしても良い。 Returning to FIG. 5, in step S104, the document structure of the selected partial schema is simplified based on Chapter 4 simplification of the RELAX NG specification. In step S105, text data corresponding to the defined partial schema is selected by using an XPath expression. An example of an XPath expression that expresses the d attribute of all the path elements that appear in the XML data in the text format is as shown in FIG. As for the selection of text data using the XPath expression, a user interface screen for selection is displayed on the display unit 206, and the user inputs a selection instruction using the operation unit 205 while viewing this screen. May be. Further, such selection may be performed by incorporating an XPath expression in the application logic in advance.

次に、ステップＳ１０６では、テキスト形式のＸＭＬデータを解析する。係る解析は、テキスト形式のＸＭＬデータの先頭から順次行うものとする。次に、ステップＳ１０７では、テキスト形式のＸＭＬデータからバイナリ形式のＸＭＬデータへの符号化が完了したか否かを判断する。係る判断の結果、完了している場合には本処理は終了し、完了していない場合には処理はステップＳ１０８に進む。 Next, in step S106, the XML data in the text format is analyzed. Such analysis is performed sequentially from the beginning of the XML data in text format. Next, in step S107, it is determined whether or not encoding from text-format XML data to binary-format XML data has been completed. As a result of the determination, when the processing is completed, the present processing is terminated. When the processing is not completed, the processing proceeds to step S108.

ステップＳ１０８では、テキスト形式のＸＭＬデータ中に、XPath式にマッチするテキストデータが含まれているか否かを判断する。係る判断の結果、含まれている場合（テキスト形式のＸＭＬデータからXPath式にマッチするテキストデータを検出可能である場合）には処理をステップＳ１１０に進め、含まれていない場合には処理をステップＳ１０９に進める。ステップＳ１０９では、テキスト形式のＸＭＬデータをそのまま従来通りにバイナリ形式のＸＭＬデータに符号化する。 In step S108, it is determined whether text data matching the XPath expression is included in the XML data in text format. If it is included (if it is possible to detect text data that matches the XPath expression from text-format XML data), the process proceeds to step S110. If not, the process proceeds to step S110. Proceed to S109. In step S109, the text-format XML data is encoded as it is into binary-format XML data in the conventional manner.

ステップＳ１１０では、テキスト形式のＸＭＬデータにおいて、XPath式にマッチしないデータを符号化する。図１のテキスト形式のＸＭＬデータの場合、最初の開始タグsvgはXPath式にマッチするテキストデータを含んでいないので、そのまま開始タグとして符号化する。そして、次のpath要素は、XPath式にマッチするｄ属性を含んでいるので、開始タグpathと、fill、stroke、stroke-width属性を符号化する。なお、XPath式にマッチしたｄ属性とその属性値は、符号化しないでＲＡＭ２０３に保存しておく。 In step S110, data that does not match the XPath expression is encoded in the XML data in text format. In the case of the XML data in the text format of FIG. 1, the first start tag svg does not include text data that matches the XPath expression, and is encoded as a start tag as it is. Since the next path element includes the d attribute that matches the XPath expression, the start tag path and the fill, stroke, and stroke-width attributes are encoded. The d attribute that matches the XPath expression and its attribute value are stored in the RAM 203 without being encoded.

次に、ステップＳ１１１では、XPath式にマッチしたテキストデータの所属先が、属性値か否かを判断する。係る判断の結果、XPath式にマッチしたテキストデータの所属先が属性値である場合には、処理をステップＳ１１２に進め、XPath式にマッチしたテキストデータの所属先が属性値ではない場合には、処理をステップＳ１１５に進める。 Next, in step S111, it is determined whether or not the text data that matches the XPath expression is an attribute value. As a result of the determination, if the affiliation destination of the text data matching the XPath expression is an attribute value, the process proceeds to step S112. If the affiliation destination of the text data matching the XPath expression is not an attribute value, The process proceeds to step S115.

ステップＳ１１２では、XPath式にマッチしたテキストデータの所属先としての属性値の属性名をタグ名とする開始タグを符号化する。図１のテキスト形式のＸＭＬデータの例では、XPath式にマッチしたテキストデータは属性値に所属し、その属性名が”ｄ”であるので、”ｄ”という名前を持つ開始タグ＜ｄ＞を符号化する。 In step S112, a start tag is encoded with the attribute name of the attribute value as the destination of the text data that matches the XPath expression as the tag name. In the example of the text format XML data in FIG. 1, since the text data matching the XPath expression belongs to the attribute value and the attribute name is “d”, the start tag <d> having the name “d” is set. Encode.

次に、ステップＳ１１３では、XPath式にマッチしたテキストデータを先頭から逐次解析し、選択した部分スキーマのデータ型の情報を用いることによって、そのテキストデータを子要素に変換する。 Next, in step S113, the text data matching the XPath expression is sequentially analyzed from the top, and the text data is converted into a child element by using the data type information of the selected partial schema.

ここで、ステップＳ１１３における処理の詳細について、図６を用いて説明する。まずステップＳ２０２では、部分スキーマのDOMツリー内の内部状態を初期化する。次に、ステップＳ２０３では、テキストデータの解析途中の状態を記憶しておくために、部分スキーマ内の任意のノードを指し示すポインタｐを、ＮＵＬＬや最初のノードへのポインタ値等に初期化する。 Details of the processing in step S113 will be described with reference to FIG. First, in step S202, the internal state in the DOM tree of the partial schema is initialized. Next, in step S203, in order to store the state during the text data analysis, a pointer p pointing to an arbitrary node in the partial schema is initialized to NULL, a pointer value to the first node, or the like.

次に、ステップＳ２０４では、XPath式にマッチしたテキストデータについての解析が完了したか否かを判断する。係る判断の結果、完了した場合にはステップＳ１１３の処理は終了し、ステップＳ１１４に処理を進める。一方、係る判断の結果、完了していない場合には、処理をステップＳ２０５に進める。なお、テキストデータの解析処理は、テキストデータに含まれる空白文字を区切りの識別子とし、この識別子でテキストデータを分割し、それぞれの分割テキストデータ毎に解析する。 Next, in step S204, it is determined whether or not the analysis for the text data matching the XPath expression is completed. If the determination is completed, the process of step S113 ends, and the process proceeds to step S114. On the other hand, if the result of the determination is not complete, the process proceeds to step S205. In the text data analysis process, a blank character included in the text data is used as a delimiter identifier, the text data is divided by this identifier, and analysis is performed for each divided text data.

ステップＳ２０５では、分割テキストデータが、部分スキーマのgroup要素以下で定義された列挙値に該当する変数であるか否かを判断する。係る判断の結果、該当する場合には、処理をステップＳ２０６に進め、該当しない場合には、処理をステップＳ２０７に進める。 In step S205, it is determined whether or not the divided text data is a variable corresponding to the enumerated value defined below the group element of the partial schema. As a result of the determination, if applicable, the process proceeds to step S206; otherwise, the process proceeds to step S207.

ステップＳ２０６では、部分スキーマのgroup要素以下で定義された列挙値に該当する変数としての分割テキストデータを、その列挙値をタグ名とする開始タグに符号化する。図７に示した部分スキーマを用いる場合、列挙値は、”Ｍ”、”ｍ”、”Ｌ”、”ｌ”であるので、最初に解析する分割テキストデータ”Ｍ”は、”moveType”を名前として持つdefine要素以下に対応する。然るにこのような場合には、最初に解析する分割テキストデータ”Ｍ”を、”Ｍ”をタグ名とする開始タグに符号化する。そしてその後、ステップＳ２０８では、解析途中の状態を記憶しておくために、例えば図７の場合には、部分スキーマ内の名前が”moveType”のdefine要素以下のchoice要素を指し示すように、ポインタｐを更新する。 In step S206, the divided text data as a variable corresponding to the enumerated value defined below the group element of the partial schema is encoded into a start tag having the enumerated value as a tag name. When the partial schema shown in FIG. 7 is used, the enumerated values are “M”, “m”, “L”, and “l”. Therefore, the divided text data “M” to be analyzed first is “moveType”. Corresponds to the following define elements that have names. In such a case, however, the divided text data “M” to be analyzed first is encoded into a start tag having “M” as a tag name. Then, in step S208, in order to store the state in the middle of the analysis, for example, in the case of FIG. 7, the pointer p so that the name in the partial schema points to the choice element below the define element of “moveType”. Update.

一方、ステップＳ２０７では、ポインタｐが指し示す部分スキーマ内の参照先のノードを解析する。そして、分割テキストデータを、そのテキストデータのデータ型に基づいて、１つのテキストイベントとして符号化する。図７に示した部分スキーマを用いる場合、最初の分割テキストデータ”Ｍ”を開始タグとして符号化した後、”Ｍ”に後続する分割テキストデータ”１００”は、部分スキーマで定義されているfloat型で１つのテキストイベントとして符号化する。 On the other hand, in step S207, the reference destination node in the partial schema indicated by the pointer p is analyzed. Then, the divided text data is encoded as one text event based on the data type of the text data. When the partial schema shown in FIG. 7 is used, after the first divided text data “M” is encoded as the start tag, the divided text data “100” following “M” is a float defined in the partial schema. Encode as one text event with type.

そしてその後ステップＳ２０８では、現在ポインタｐが指し示していたノードの兄弟ノードを参照するように、ポインタｐを更新する。従って、分割テキストデータ”１００”を符号化した後、ポインタｐは、”moveType”を名前に持つdefine要素以下の＜data type=”float”＞ノードを参照することになる。このようにして、１つのテキストデータを構成する全ての分割テキストデータについて順次、符号化する。 In step S208, the pointer p is updated so as to refer to the sibling node of the node currently pointed to by the pointer p. Therefore, after encoding the divided text data “100”, the pointer p refers to a <data type = “float”> node below the define element having “moveType” as a name. In this way, all divided text data constituting one text data is sequentially encoded.

図５に戻って、次に、ステップＳ１１４では、ステップＳ１１２で符号化した開始タグに対応する終了タグを符号化する。本実施形態では、”ｄ”というタグ名を有する終了タグ＜／d＞を符号化する。一方、ステップＳ１１５では、ステップＳ１１２，Ｓ１１４で符号化した開始タグと終了タグを出力しないで、ステップＳ１１３と同様の処理を行う。 Returning to FIG. 5, next, in step S114, the end tag corresponding to the start tag encoded in step S112 is encoded. In the present embodiment, an end tag </ d> having a tag name “d” is encoded. On the other hand, in step S115, the same process as in step S113 is performed without outputting the start tag and the end tag encoded in steps S112 and S114.

以上説明した符号化処理を図１に示したテキスト形式のＸＭＬデータに対して行うことで、図８（ｂ）に示すバイナリ形式のＸＭＬデータを作成することができる。図８（ｂ）に示したバイナリは、それぞれの単位に対応するバイナリである。 By performing the encoding process described above on the XML data in the text format shown in FIG. 1, the XML data in the binary format shown in FIG. 8B can be created. The binary shown in FIG. 8B is a binary corresponding to each unit.

図１のテキスト形式のＸＭＬデータのpath要素に含まれるd属性のデータは、path要素の子要素として含まれるように変換される。また、d属性の値のテキストデータは、”M”や”L”といったコマンドをグループ（テキストデータの構造単位）の単位として子要素として変換される。その要素内容には、”M”や”L”のコマンドに対して指定する座標情報が、１つ１つのテキストイベントとして格納される。アプリケーションが、”M”や”L”のコマンドに対して指定した座標情報をデコードした際には、各座標情報を別々のテキストイベントとして取得することができる。 Data of the d attribute included in the path element of the XML data in the text format in FIG. 1 is converted so as to be included as a child element of the path element. The text data of the value of the d attribute is converted as a child element using a command such as “M” or “L” as a unit of a group (text data structural unit). In the element contents, coordinate information designated for the “M” and “L” commands is stored as each text event. When the application decodes the coordinate information specified for the “M” or “L” command, each coordinate information can be acquired as a separate text event.

図８（ｂ）に示したバイナリ形式のＸＭＬデータに関する各イベントの符号を図９に示す。アプリケーションは事前に各イベントの符号を知っているものとする。また、図８（ｂ）で参照される名前テーブルを図１０に示す。この名前テーブルは、要素や属性の名前と、各名前に割り当てる符号と、各名前のデータサイズを格納したテーブルである。 FIG. 9 shows the signs of the events related to the XML data in the binary format shown in FIG. Assume that the application knows the sign of each event in advance. FIG. 10 shows a name table referred to in FIG. This name table is a table that stores names of elements and attributes, codes assigned to the names, and data sizes of the names.

また、図８（ｂ）に示したバイナリ形式のＸＭＬデータで参照される値テーブルを図１１（ａ）に示す。この値テーブルは、要素内容や属性値の値と、各値に割り当てる符号と、各値のデータ型、各値のデータサイズを格納したテーブルである。図１０，１１（ａ）に示したテーブルは、変換元となるテキスト形式のＸＭＬデータに依存するものであるので、テキスト形式のＸＭＬデータの内容が変われば、図１０，１１（ａ）に示したテーブルの内容もそれに依存して変わる。図１１（ｂ）は、図１１（ａ）に示したテーブルの各値のデータ型として参照する符号のテーブルであり、アプリケーションは事前に知っているものとする。 FIG. 11A shows a value table referred to by the binary format XML data shown in FIG. This value table is a table storing element contents and attribute value values, codes assigned to the respective values, data types of the respective values, and data sizes of the respective values. The tables shown in FIGS. 10 and 11 (a) depend on the text-format XML data as the conversion source. Therefore, if the contents of the text-format XML data change, the tables shown in FIGS. The contents of the table change accordingly. FIG. 11B is a code table that is referred to as the data type of each value in the table shown in FIG. 11A, and is assumed to be known in advance by the application.

次に、ＰＣ１０１の処理機能ブロックについて、図４を用いて説明する。構造化文書解析部３０４は、バイナリ形式のＸＭＬデータへの圧縮符号化対象となるテキスト形式のＸＭＬデータを取得し、解析する。 Next, processing function blocks of the PC 101 will be described with reference to FIG. The structured document analysis unit 304 acquires and analyzes text-format XML data to be compressed and encoded into binary-format XML data.

スキーマ選択部３０３は、効率的に圧縮することの難しいテキストデータに関するデータ型を記述した部分スキーマと、そのデータ型の情報のルートとなる定義名と、を選択する。 The schema selection unit 303 selects a partial schema describing a data type related to text data that is difficult to efficiently compress, and a definition name that is a root of information on the data type.

スキーマ解析部３０５は、この部分スキーマを解析し、ＲＡＭ２０３内にＤＯＭツリーを作成する。そして更に、スキーマ解析部３０５は、RELAX NG仕様書の第４章単純化に基づいて、選択した部分スキーマの文書構造を単純化する。 The schema analysis unit 305 analyzes this partial schema and creates a DOM tree in the RAM 203. Further, the schema analysis unit 305 simplifies the document structure of the selected partial schema based on Chapter 4 simplification of the RELAX NG specification.

テキストデータ選択部３０２は、XPath式を取得する。符号化文書生成部３０６は、XPath式を用いることにより、定義した部分スキーマに対応するテキストデータを選択する。符号化文書生成部３０６は、テキストデータ解析部３０７、テキストデータ符号化部３０９を有しており、各部は以下のように動作する。テキストデータ解析部３０７は、XPath式にマッチしたテキストデータの解析を行い、テキストデータ符号化部３０９は、上記各符号化処理を行う。 The text data selection unit 302 acquires an XPath expression. The encoded document generation unit 306 selects text data corresponding to the defined partial schema by using the XPath expression. The encoded document generation unit 306 includes a text data analysis unit 307 and a text data encoding unit 309, and each unit operates as follows. The text data analysis unit 307 analyzes text data that matches the XPath expression, and the text data encoding unit 309 performs each encoding process described above.

Claims

Means for obtaining a structured document in text format;
Means for analyzing the structured document and detecting text data;
An information processing apparatus comprising: generating means for converting the structured document by converting the text data into child elements, and generating a structured document in a binary format based on the converted structured document.

The information processing apparatus according to claim 1, wherein the text data is an attribute value included in the structured document.

The information processing apparatus according to claim 1, wherein the text data is element content included in the structured document.

The generating means includes
Means for obtaining a schema defining the structure of the text data;
4. The information processing apparatus according to claim 1, further comprising: a conversion unit configured to convert the structured document by converting the text data into a child element using the schema. 5.

The converting means includes
In the text data, for a variable corresponding to an enumerated value defined below the group element of the schema, means for converting the enumerated value into a start tag having a tag name;
5. The information processing apparatus according to claim 4, further comprising: means for setting a value subsequent to the variable in the text data to data subsequent to the start tag. 6.

Obtaining a structured document in text format;
Analyzing the structured document and detecting text data;
An information processing method comprising: a step of converting the structured data by converting the text data into child elements, and generating a structured document in a binary format based on the converted structured document.

The computer program for functioning a computer as each means of any one of Claims 1 thru | or 5.

A computer-readable storage medium storing the computer program according to claim 7.