JP3552318B2

JP3552318B2 - Document search method and system

Info

Publication number: JP3552318B2
Application number: JP00240595A
Authority: JP
Inventors: 敦畠山; 勝己多田; 寛次加藤; 悟志浅川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-01-11
Filing date: 1995-01-11
Publication date: 2004-08-11
Anticipated expiration: 2019-08-11
Also published as: JPH08190571A

Description

【０００１】
【産業上の利用分野】
本発明は、文書データベースを、所定の文字列すなわち検索語を指定して文書の全文を対象として検索することにより、所望の文書を検索する文書検索方法に係わるものである。特に大量な文書を高速な検索を行う場合に好適な情報検索方法に関し、大規模文書データベースに適用されるものである。
【０００２】
【従来の技術】
先に、文書の登録の際にキーワード付けを行う必要のないフルテキストサーチ方式を特願平２−１９３０１５号（特開平３−１７４６５２号公報参照）で提案した。この方式は、文書を単語単位に圧縮した凝縮本文と、文書中の使用文字を一文字単位で登録した文字成分表を用いて、検索語に関連しない文書をふるい落とすことによってサーチ速度を等価的に高め、フルテキストサーチを実用レベルで高速に行うことを目的としたものである。また、この文字成分表を改良し更に高速なフルテキストサーチを実現する連接文字成分表方式を特願平３−３４２６９５号（特開平５−１７４０６４号公報参照）で提案した。この従来技術で用いる連接文字成分表は、テキストの中に含まれる所定の長さの連接する文字列を重複なく全て取り出し、これらを含む文書の識別子情報をビット列で記述するものである。しかし、全ての連接文字について識別子情報をビット列で記述すると、文字の組み合わせの個数分だけビット列が必要となり、連接文字成分表が膨大な容量になる。そこで、この従来技術では、ハッシュ関数を用いて１個のビット列に複数個の連接文字を割り当てるようにして、容量を抑える工夫をしている。
【０００３】
【発明が解決しようとする課題】
しかしながら、従来のハッシュ関数を用いて１個のビット列に複数個の連接文字を割り当てた場合には、同じビット列にまったく別の連接文字の文書識別子情報も重畳されることになる。従って、ある連接文字を指定して該当するビット列から文書識別子情報を取り出した場合、その情報からはまったく別の連接文字を含む文書が得られる可能性がある。つまり、ハッシュ関数を用いた連接文字成分表による検索結果には検索ノイズが含まれることになる。このことは、大量の文書を登録する大規模な文書検索システムでは、検索語に関連しない不要な文書のふるい落とし、すなわち絞り込みが適切に行われない可能性があることを意味し、その場合には検索性能の低下につながる。
【０００４】
ハッシュ関数を用いずに、全ての連接文字についてそれぞれ１個のビット列を対応させることも考えられるが、その場合にはビット列のデータ量が膨大なものとなり、実用的ではない。具体的に説明すると、日本語で使用する文字コードは、現在約８，０００種類あるので、２文字の組み合わせとしての連接文字の種類は、８，０００×８，０００＝６，４００万種類となる。登録する文書数を１００万件とした場合、この６，４００万種類のそれぞれの連接文字に１００万ｂｉｔの文書識別子情報を対応させるので、６，４００万種類×１００万ｂｉｔ＝８ＴＢｙｔｅもの容量が必要になる。この文字成分表の大きさに対し、文書本体の大きさを２０ＫＢ／件としても、１００万件で、２０ＫＢ×１００万件＝２０ＧＢｙｔｅであり、圧倒的に文字成分表の容量のほうが大きくなってしまう。
【０００５】
すなわち、本発明の解決しようとする課題は、大規模な情報検索システムにおいても検索ノイズの少ない連接文字成分表を、実用的な容量で実現することにある。
【０００６】
【課題を解決するための手段】
本発明は、以下の構成を採ることにより上述の課題を解決する。
【０００７】
文書のテキストデータにおける複数の文字の共起関係を記述した連接文字を連接文字ファイルに重複なく格納する連接文字格納ステップと、前記連接文字ファイルに格納された連接文字を参照して、指定した条件式中の検索語に含まれる連接文字を含む文書を検索結果の候補とする文書検索方法において、連接文字格納ステップとして、テキストデータ中に現れる連接文字成分の種類および各連接文字成分の出現する文書数を算出し、算出された文書数が所定のしきい値より大きい場合は該当文書の文書番号に対応する位置を“１”とするビット列として登録し、しきい値より小さい場合には該当文書の文書番号をバイナリデータとして格納することを特徴とする。
【０００８】
より詳細に言うと以下の（１）〜（６）の各ステップに分けることができる。
【０００９】
（１）テキストデータ分割ステップ
（２）文書識別子情報作成ステップ
（３）文書識別子情報マージステップ
（４）検索語分割ステップ
（５）文書識別子情報探索ステップ
（６）文書識別子情報ＡＮＤステップ
（１）から（３）は文字成分表の登録のための処理であり、（４）から（６）はこれを用いた検索のための処理である。これより、各ステップの処理内容を説明する。
【００１０】
（１）テキストデータ分割ステップ
文字成分表への登録の際、文字の組合せの個数および各組合せに対応する文書識別視の記憶容量を抑えるために一回に処理する文書数を適切な数に分割する。分割する文書数は、予め設定してもよいし、登録に使用する計算機のメモリ容量から算出してもよい。
【００１１】
（２）文書識別子情報作成ステップ
（１）で分割した文書群のそれぞれについて別個に文書識別子情報を作成していく。具体的には、文書中に実際に現われた文字の組合せとその文字の組合せが現われた文書識別子の情報を対にして格納する。
【００１２】
（３）文書識別子情報マージステップ
（２）で作成した文書識別子情報を（１）で分割した文書群の数分マージして、登録文書全体の文字成分表を作成する。
【００１３】
（４）検索語分割ステップ
与えられた検索語を登録時と同じ方法で文字の組合せに分割する。
【００１４】
（５）文書識別子情報探索ステップ
（４）で分割した文字のそれぞれについて、文書識別子情報を探索する。
【００１５】
（６）文書識別子情報ＡＮＤステップ
（５）で得られた文書識別子情報のそれぞれについて、ＡＮＤ処理を行うことにより、与えられた検索語の全ての連接文字を含む文書を文字成分表サーチ結果として出力する。
【００１６】
【作用】
以下、これらのステップからなる本発明の文書検索方法の原理を説明した上で、その作用を説明する。
【００１７】
まず、本発明で用いる文字成分表の構成について説明する。本発明では、連接文字に対応する文書識別子情報を管理するのに、文字テーブル、ファイルポインタテーブルを用いる。図２は文字テーブルおよびファイルポインタテーブルの概要を示す図である。
【００１８】
たとえば、“構成”という文字列を含む文書を検索する場合には、まず文字テーブルについて“構”の文字に対応するレコードを参照してファイルポインタテーブルへのポインタ情報５８０を得る。次に、ファイルポインタテーブルの先頭から５８０バイト目からの各レコードを参照して、第二文字目が“成”のレコードを探索する。ファイルポインタテーブルには、各連接文字の第一文字目ごとに、先頭に第二文字目が０のレコードを格納しておく。第二文字目が０のレコードには、第一文字目の一文字を含んでいる全ての文書の文書識別子情報へのポインタを格納しておく。すなわち、第二文字目が０のレコードは、第一文字だけからなる単一文字に対応する文書識別子情報をアクセスするためのファイル識別子（以後ファイルＩＤとも呼ぶ）とファイル内バイト位置（以後オフセットとも呼ぶ）を格納する。したがって、各連接文字ごとに第二文字目が０のレコードが必ず存在するため、例えば、“構成”の連接文字を探索する場合は、“構”に対応するファイルポインタテーブルの先頭から５８０バイト目のレコードから探索を開始し、再び第二文字目が０になるまで探索を続け、もし“成”の文字が見つからない場合は、該当する連接文字がないと判断できる。図２の例では、“成”のレコードが存在するため、ここからファイルＩＤが１、オフセットが１０３４という文書識別子情報へアクセスするための情報を得ることができる。
【００１９】
文書識別子情報は、図３のように複数のファイルに分割格納する。ファイルポインタテーブルのファイルＩＤ情報により、どのファイルに文書識別子情報が格納されているかを特定する。なおかつ特定のファイルＩＤは、文書識別子情報をビット列で持つとあらかじめ決めておく。図３の例では、ファイル１が文書識別子情報をビット列で持つファイルとしている。図２の例では、連接文字“構成”に関する文書識別子情報へのアクセス情報として、ファイルＩＤが１、オフセットが１，０３４が得られる。したがって、ファイル１内の１，０３４バイト目からのビット列“０１１１０１０１０１．．．．”が文書識別子情報として得られることになる。このビット列は、先頭ビットから文書番号に対応して、“１”が連接文字“構成”を含む文書を示すことになる。すなわち、この例では、“構成”を含む文書の文書番号は、１、２、３、５、７、９．．．．となる。図３の他のファイル（ファイル２及びファイル３）は文書識別子情報をＩＤリストの形式で格納したものである。各ＩＤリストの先頭は格納してある文書番号の個数を示している。例えば、連接文字“構造”の場合、図２の例では、ファイルＩＤが２、オフセットが３４０であるので、ファイル２の先頭から３４０バイト目を参照することによって、連接文字“構造”を含む文書数が５６個あり、文書番号が５６２、１０３８、．．．であることがわかる。
【００２０】
このように、ファイルポインタテーブルには、データベース中に存在する連接文字のみを登録するので、データベース中に存在しない文字の組み合わせは全て排除できるという利点がある。したがって、文字テーブルやファイルポインタテーブルで実現している連接文字の管理情報を格納するファイル量やメモリ量を大幅に削減することができる。また、文書識別子情報をビット列あるいはＩＤリストの形式で格納し、多くの文書を格納する場合はビット列で、少ない文書を格納する場合はＩＤリストの形式で管理することによりファイル容量を大幅に削減することができる。具体的に説明すると、ビットリストの形式で文書識別子情報を格納するには、常にデータベースに登録した全件分のビット数が必要になるが、ＩＤリストの形式で文書識別子情報を格納する場合には、文書識別子を表わすビット数×登録文書数ですむことになる。例えば、データベースの全登録件数が１００万件で、一個の文書識別子情報を表わすのに３２ビットを割り当てるとすると以下の格納領域が必要となる。連接文字“構造”を含む文書を１０件登録する場合に、ビット列ならば、１００万ｂｉｔ＝１２５ＫＢの格納領域が必要となるが、ＩＤリスト形式ならば、３２ｂｉｔ×１０件＝４０Ｂの格納領域ですむことになる。一方、例えば、連接文字“構成”を含む文書が１００万件中で９０万件ある場合には、ビット列ならば、１００万ｂｉｔ＝１２５ＫＢの格納領域にすむのに対し、ＩＤリスト形式の場合、３２ｂｉｔ×９０万件＝３．６ＭＢの領域が必要となる。したがって、この１００万件を、文書識別子３２ビットで格納する場合には、１００万ｂｉｔ÷３２ｂｉｔ＝３１，２５０件を境として、これよりも登録件数が多い場合はビット列形式で、少ない場合はＩＤリスト形式で文書識別子情報を格納するのが、最も格納領域を有効に使用する方法である。
【００２１】
次に、このような文字成分表の登録の方法について、原理を説明する。文字テーブルとファイルポインタテーブルを用い、データベース中に用いられる連接文字のみを文字成分表に登録することにより、ファイル容量を実用容量に抑えることができることは既に説明した。
【００２２】
したがって、登録時に全ての連接文字成分について管理をしようとすると、メモリ容量が足りなくなり、文字成分表を作ることが不可能となる。磁気ディスクをワークにして情報を一旦退避する方法もあるが、アクセス速度が遅いので登録処理に極めて時間が掛かることになる。そこで、図４のように登録するテキストデータを分割して、分割したテキストデータ毎に文字成分表を作成し、最後にこれらをマージして全テキストデータの文字成分表を作成する。図４では、全部で２万４千件のテキストデータを８千件毎に分割して文字成分表を作成する例を示している。“構成”という連接文字について、最初の８千件のテキストデータでは、文書番号５０、１４５、２９０．．．．が文書識別子情報として蓄えられる。同様に、次の８千件、その次の８千件についても各分割したテキストデータ毎に文字成分表を作成する。最後に、それぞれで得られた文書識別子情報をマージして、本図の例では、“構成”の連接文字に対する文書識別子情報として、５０、１４５、２９０、８０９６、１２３６５、１７８５１、２２９８９．．．という情報を作成する。
【００２３】
検索の際には、入力された検索語を連接文字に分割し、それぞれの連接文字に対応する文書識別子情報を読み出してきて、それらの情報の積集合を取り、これを文字成分表の検索結果とする。すなわち、“建造物”という検索語については、“建造”と“造物”の２種類の連接文字について、それぞれ文字成分表の文書識別子情報を読み出してそれらの積を演算する。例えば、連接文字“建造”に対応する文書識別子情報が５６２、１０３８、２４５８．．．．で、連接文字“造物”に対応する文書識別子情報が２６１、５６２、２４５８．．．．の場合は、検索語“建造物”の文字成分表サーチ結果は文書番号で５６２、２４５８．．．．となる。
【００２４】
このように、各連接文字に対する文書識別子情報はノイズのない情報であるため、これらの文書識別子情報を論理式演算（ＡＮＤ）して得られる文字成分表サーチ結果も、従来のハッシングを行う文字成分表のサーチ結果に比べ、ハッシングに起因するノイズが除去されることになり、検索精度が大幅に向上できることになる。
【００２５】
【実施例】
以下、本発明の実施例について図を用いて詳細に説明する。
【００２６】
図１は、本実施例の構成を示す図である。本実施例は、登録検索用の端末１０１、１０２、．．．１１０、ネットワーク２００、文書サーバ１０００からなる。文書サーバ１０００には、ＬＡＮアダプタ１０１０、ＣＰＵ１０２０、ワークメモリ１０３０、文字テーブル１１００とファイルポインタテーブル１２００を格納するメモリ、テキストデータ分割プログラム１３１０、文書識別子情報作成プログラム１３２０、文書識別子情報マージプログラム１３３０、検索語分割プログラム１３４０、文書識別子情報探索プログラム１３５０、文書識別子情報ＡＮＤプログラム１３６０を格納するメモリ、文字成分表を分割して格納するファイル１４０１、１４０２、．．．、テキストデータ１４１０からなる。
【００２７】
データの登録時には、テキストデータ分割プログラム１３１０で登録する文書データを一定の件数に分割し、分割したそれぞれのテキストデータについて文書識別子情報作成プログラム１３２０で文書識別子情報を作成して、最後に分割して作成したそれぞれの文書識別子情報を文書識別子情報マージプログラム１３３０でマージして文字テーブル１１００、ファイルポインタテーブル１２００、文字成分表１４０１、１４０２、１４０３を作成する。
【００２８】
また、データの検索時には、各端末から与えられた検索語を検索語分割プログラム１３４０によって文書識別子情報を作成したときと同じアルゴリズムで連接文字に分割し、それぞれの連接文字について文書識別子情報探索プログラム１３５０で該当する文書識別子情報を文字成分表１４０１、１４０２、１４０３から取り出す。そして、検索語を構成する全ての連接文字に対応する文書識別子情報を文書識別子情報ＡＮＤプログラム１３６０によってＡＮＤすることで検索語を含む文書を文字成分表のサーチ結果とする。
【００２９】
まず、データの登録処理に従い、文字成分表の作成手順を説明し、次に検索処理に従って文字成分表による候補文書の抽出過程を説明する。作用の項でも説明したように、大量の文書について文字成分表を一度に登録するには、大量のメモリを使用しなければならないので、本実施例では８，０００件ごとに小さな文字成分表を作成し、最後に一つの文字成分表に統合する処理を行う。図５に、この文書識別子情報作成処理の手順を示す。まず、８，０００件のそれぞれの文書について（５０１０）連接文字の抽出（５０２０）を行い、切り出した連接文字についてその出現頻度情報を計数（５０３０）する。そして、算出した出現頻度にしたがって文書識別子情報を格納するメモリエリアをワークメモリ上に確保し、それぞれの連接文字の出現頻度情報が所定のしきい値より大きい場合にはビット列で各連接文字が出現する文書番号を文書識別子情報として登録（５０４０）していく。８，０００件の全ての文書について文書識別子情報を登録し終わったらファイルに文字テーブル、ファイルポインタテーブル、文書識別子情報を格納（５０５０）しメモリ領域を解放する。８，０００件単位にこのように小さな分割文字成分表を作成し、最後に各分割文字成分表をマージ（５０６０）してデータベース全体の文字成分表を作成する。
【００３０】
この分割文字成分表のマージ処理（５０６０）は、図６に示すとおり各分割文字成分表の文字テーブルとファイルポインタテーブルを参照し、それぞれの連接文字に対応する文書識別子情報を統合する形で進めていく。図６は二個の分割文字成分表を一個の文字成分表に統合する例を示している。具体的な処理の手順を図７に示す。まず、それぞれの分割文字成分表の文字テーブルを参照（７０１０）し、統合した文字テーブルを作成（７０３０）する。この時文字テーブルの各レコードについて（７０２０）、どちらか一方にしか登録されていないレコードについては、登録されている側に記録されているファイルポインタテーブルの各文字について（７０４０）内容を統合したファイルポインタテーブルに登録する（７０５０）とともに、ファイルポインタテーブルで管理されている文書識別子情報をマージ前の小さな文字成分表からマージ後の文字成分表へコピー（７０６０）していく。また、双方の文字テーブルに同じ文字が存在する場合には、記録されているファイルポインタテーブルの各文字について（７０７０）、ファイルポインタテーブルに記載された第二文字目を比較しながら統合したファイルポインタテーブルを作成（７０８０）していく。すなわち、ファイルポインタテーブルの第二文字目が一致しない場合には、該当する文書識別子情報をコピー（７０９０）し、一致する場合には双方の文書識別子情報をマージ（７１００）して格納する。
【００３１】
この文書識別子情報のマージ及びコピーの際には、マージ後の登録件数から所定の件数よりも多い場合にはビット列に、少ない場合にはＩＤリストの形式にして格納する。
【００３２】
以上のマージ処理アルゴリズムを図６を用いて具体的に説明する。“構”の文字は文字テーブル１および文字テーブル２のどちらにも存在する。したがって、“構”の文字に対応するファイルポインタテーブル１の内容とファイルポインタテーブル２の内容を統合ファイルポインタテーブルに登録していく。ファイルポインタテーブルにおける該当レコードの先頭の第二文字目が０のレコードは、“構”の一文字を含む文書の識別子情報をアクセスするための情報を格納している。この第二文字目が０のレコードはファイルポインタテーブル１とファイルポインタテーブル２の両方に存在するので、双方のファイルＩＤとオフセットで与えられる文書識別子情報をマージして統合文字成分表に登録する。“構”に対応するファイルポインタテーブルの第２レコード“成”についても同様である。第３レコードについてはファイルポインタテーブル１が“造”であるのに対して、ファイルポインタテーブル２では“築”と異なっている。したがって、それぞれの文書識別子情報をマージ前の小さな文字成分表から統合文字成分表へコピーする。
【００３３】
検索処理は、図８に示す手順で行う。まず、検索語から連接文字を切り出す（８０１０）。次に、切り出したそれぞれの連接文字について（８０２０）、文字テーブルを探索する（８０３０）。そして、該当するファイルポインタテーブルの各レコードについて、第二文字目の探索を行い（８０４０）該当するファイルＩＤとオフセットを得る。こうして、得られた文書識別子情報を格納したファイルとそのオフセット値より、該当する連接文字に対応するＩＤリストまたはビット列を読み出し、ＩＤリストの場合にはこれをビット列に変換することにより文書識別子情報を取得する（８０５０）。この文書識別子情報の取得の過程で該当する連接文字が文字成分表に登録されていない場合（８０６０、８０７０）には、すなわち検索語を構成する連接文字のうちどれか一つでも文字成分表に登録されていなければ、検索語を含む該当文書がないということを意味することになるため検索結果として０件という結果を、文書識別子情報探索プログラム１３５０がＬＡＮアダプタ１０１０を介して検索端末に返す。
【００３４】
検索語を構成する全ての連接文字について該当する文書識別子情報が得られた場合は、得られたそれぞれの文書識別子情報の積集合をとることによって、指定された検索語中の全ての連接文字を含む文書のみを抽出することができる。
【００３５】
このようにして得られた文字成分表の検索結果は、検索ノイズが非常に少ないので、文字成分表のサーチ結果を表示しても十分実用できる。もちろん、文字成分表のサーチ結果をもとに、文書本文を検索し実際に検索語を含む文書のみに絞り込むかあるいは、複数の検索語間の位置的関係を満たす文書を探すことも可能である。また、文字成分表の検索結果を一度検索端末に表示し、ユーザの指定により本文の探索を行うかどうかを決定してもよい。
【００３６】
以上、本実施例によれば、データベース中に存在する連接文字のみを登録するので、データベース中に存在しない文字の組み合わせは全て排除できるという利点がある。また、文書識別子情報をビット列とＩＤリストの形式で格納し、多くの文書識別子情報を格納する場合はビット列で、少ない文書識別子情報を格納する場合はＩＤリストの形式で格納することでファイル容量を大幅に削減することができる。
【００３７】
さらに、各連接文字に対して文書識別子情報は必ずその連接文字を含むノイズのない情報であるから、これらの文書識別子情報をＡＮＤして得られる文字成分表サーチ結果も、検索精度を大幅に向上することができる。
【００３８】
また、本発明によれば２文字以上の連接文字についても登録することにより、さらに文字成分表サーチの検索ノイズを少なくすることも可能である。
【００３９】
【発明の効果】
本発明によれば、文書識別子情報をビット列とＩＤリストのどちらかの形式で選択的に格納することにし、多くの文書識別子情報を格納する場合はビット列で、少ない文書識別子情報を格納する場合はＩＤリストの形式で格納することでファイル容量を大幅に削減することができる。
【００４０】
また、各連接文字に対して文書識別子情報は必ずその連接文字を含むノイズのない情報であり、これらの文書識別子情報を検索語の連接文字の個数分ＡＮＤするので、文字成分表サーチの検索精度を大幅に向上することができる。これにより、検索語間の位置的な条件などを検索する場合にも、より本文情報の検索範囲を狭めることができるという利点がある。
【００４１】
さらに、文字テーブル及びファイルポインタテーブルを用いることにより、データベース中に存在する連接文字のみを登録するので、データベース中に存在しない文字の組み合わせは全て排除できるので、連接文字を管理するために必要なメモリ量を少なくできるという利点がある。
【００４２】
さらにまた、文字成分表の登録の際に、登録文書を分割して小さな分割文字成分表を作成し、後でこれらの分割文字成分表をマージして目的の文字成分表を作成することにより、少ないメモリ容量でも効率的に大きなデータベースの文字成分表を作成することができる。
【図面の簡単な説明】
【図１】本発明の第一の実施例の構成図である。
【図２】文字成分表のテーブル構成図である。
【図３】文書識別子情報格納ファイルの概要を示す図である。
【図４】文字成分表登録処理の概要を示す図である。
【図５】登録処理の流れを示すＰＡＤ図である。
【図６】分割文字成分表の統合処理を示す概念図である。
【図７】統合処理の流れを示すＰＡＤ図である。
【図８】検索処理の流れを示すＰＡＤ図である。[0001]
[Industrial applications]
The present invention relates to a document search method for searching for a desired document by searching a document database by specifying a predetermined character string, that is, a search word, for the entire text of the document. In particular, the present invention relates to an information search method suitable for performing high-speed search of a large number of documents, and is applied to a large-scale document database.
[0002]
[Prior art]
Previously, a full text search method that does not require keyword assignment when registering a document was proposed in Japanese Patent Application No. 2-193015 (refer to Japanese Patent Application Laid-Open No. 3-174652). This method equivalently reduces the search speed by filtering out documents that are not related to the search word using a condensed text that compresses the document in word units and a character component table that registers the characters used in the document in character units. The purpose is to perform high-speed full text search at a practical level. Also, a concatenated character component table system which improves the character component table and realizes a higher-speed full-text search has been proposed in Japanese Patent Application No. 3-34269 (Japanese Patent Application Laid-Open No. 5-174064). The concatenated character component table used in this conventional technique is to extract all connected character strings of a predetermined length included in text without duplication, and to describe the identifier information of a document including them in a bit string. However, if the identifier information is described in the form of bit strings for all the connected characters, the number of bit strings is required as many as the number of character combinations, and the connected character component table has a huge capacity. Therefore, in this prior art, a plurality of concatenated characters are assigned to one bit string by using a hash function, so that the capacity is reduced.
[0003]
[Problems to be solved by the invention]
However, when a plurality of concatenated characters are assigned to one bit string using a conventional hash function, document identifier information of a completely different concatenated character is also superimposed on the same bit string. Therefore, when a certain concatenated character is designated and the document identifier information is extracted from the corresponding bit string, a document including a completely different concatenated character may be obtained from the information. That is, the search result based on the concatenated character component table using the hash function includes the search noise. This means that in a large-scale document search system that registers a large number of documents, unnecessary documents that are not related to the search term may be filtered out, that is, narrowing may not be performed properly. This leads to a decrease in search performance.
[0004]
It is conceivable to associate one bit string with each of all the connected characters without using a hash function. However, in that case, the data amount of the bit string becomes enormous, which is not practical. More specifically, since there are currently about 8,000 character codes used in Japanese, the number of connected characters as a combination of two characters is 8,000 × 8,000 = 64 million. Become. If the number of documents to be registered is assumed to be one million, the 64 million types of connected characters are associated with one million bits of document identifier information, so that the capacity of 64 million types × 1 million bits = 8 Tbytes. Will be needed. With respect to the size of the character component table, even if the size of the document main body is set to 20 KB / case, it is 1,000,000, that is, 20 KB × 1 million = 20 GBytes, and the capacity of the character component table is overwhelmingly larger. I will.
[0005]
That is, a problem to be solved by the present invention is to realize a connected character component table with little search noise even in a large-scale information search system with a practical capacity.
[0006]
[Means for Solving the Problems]
The present invention solves the above-mentioned problems by adopting the following configuration.
[0007]
A connected character storing step of storing a connected character describing a co-occurrence relationship of a plurality of characters in a text data of a document in a connected character file without duplication, and referring to the connected character stored in the connected character file, and specifying a specified condition. In a document search method in which a document including a concatenated character included in a search term in a formula is set as a candidate for a search result, as a concatenated character storage step, a type of a connected character component appearing in text data and a document in which each connected character component appears The number is calculated, and if the calculated number of documents is larger than a predetermined threshold, the position corresponding to the document number of the relevant document is registered as a bit string that sets it to “1”. Is stored as binary data.
[0008]
More specifically, it can be divided into the following steps (1) to (6).
[0009]
(1) Text data division step (2) Document identifier information creation step (3) Document identifier information merge step (4) Search term division step (5) Document identifier information search step (6) Document identifier information AND step (1) (3) is a process for registering a character component table, and (4) to (6) are processes for a search using this. Hereinafter, the processing content of each step will be described.
[0010]
(1) Text data division step When registering in the character component table, the number of documents to be processed at one time is divided into an appropriate number to reduce the number of character combinations and the storage capacity of the document identification view corresponding to each combination. I do. The number of documents to be divided may be set in advance or may be calculated from the memory capacity of a computer used for registration.
[0011]
(2) Document identifier information creating step Document identifier information is created separately for each of the document groups divided in (1). Specifically, a combination of a character actually appearing in the document and information of a document identifier in which the character combination appears is stored as a pair.
[0012]
(3) Document identifier information merging step The document identifier information created in step (2) is merged by the number of document groups divided in (1) to create a character component table of the entire registered document.
[0013]
(4) Search word division step The given search word is divided into character combinations in the same manner as at the time of registration.
[0014]
(5) Document identifier information search step The document identifier information is searched for each of the characters divided in (4).
[0015]
(6) Document identifier information AND By performing an AND process on each of the document identifier information obtained in the step (5), a document including all connected characters of a given search word is output as a character component table search result. I do.
[0016]
[Action]
Hereinafter, the principle of the document search method of the present invention including these steps will be described, and then the operation thereof will be described.
[0017]
First, the configuration of the character component table used in the present invention will be described. In the present invention, a character table and a file pointer table are used to manage the document identifier information corresponding to the connected characters. FIG. 2 is a diagram showing an outline of the character table and the file pointer table.
[0018]
For example, when searching for a document including the character string "composition", first, the record information corresponding to the character "composition" in the character table is referred to, and pointer information 580 to the file pointer table is obtained. Next, by referring to each record from the 580th byte from the head of the file pointer table, a record whose second character is "composed" is searched. In the file pointer table, a record in which the second character is 0 at the beginning is stored for each first character of each connected character. In the record in which the second character is 0, a pointer to the document identifier information of all documents including one character of the first character is stored. That is, a record in which the second character is 0 has a file identifier (hereinafter also referred to as a file ID) for accessing document identifier information corresponding to a single character consisting of only the first character, and a byte position in the file (hereinafter also referred to as an offset). Is stored. Therefore, since there is always a record in which the second character is 0 for each concatenated character, for example, when searching for a concatenated character of "composition", the 580th byte from the top of the file pointer table corresponding to "composition" , And the search is continued until the second character becomes 0 again. If the character “M” is not found, it can be determined that there is no corresponding concatenated character. In the example of FIG. 2, since there is a record of “completion”, information for accessing the document identifier information having the file ID of 1 and the offset of 1034 can be obtained from this record.
[0019]
The document identifier information is divided and stored in a plurality of files as shown in FIG. Based on the file ID information of the file pointer table, it is specified which file stores the document identifier information. In addition, it is determined in advance that the specific file ID has document identifier information in a bit string. In the example of FIG. 3, file 1 is a file having document identifier information in a bit string. In the example of FIG. 2, as the access information to the document identifier information related to the concatenated character “composition”, a file ID of 1 and an offset of 1,034 are obtained. Therefore, the bit string "0111010101 ..." from the 1,034th byte in the file 1 is obtained as the document identifier information. In this bit string, “1” indicates a document including the concatenated character “constitution”, corresponding to the document number from the first bit. That is, in this example, the document numbers of the documents including the “structure” are 1, 2, 3, 5, 7, 9,. . . . It becomes. The other files in FIG. 3 (file 2 and file 3) store document identifier information in the form of an ID list. The head of each ID list indicates the number of stored document numbers. For example, in the case of the concatenated character “structure”, since the file ID is 2 and the offset is 340 in the example of FIG. 2, by referring to the 340th byte from the beginning of the file 2, the document including the concatenated character “structure” There are 56 documents and document numbers 562, 1038,. . . It can be seen that it is.
[0020]
As described above, since only the concatenated characters existing in the database are registered in the file pointer table, there is an advantage that all combinations of characters not existing in the database can be eliminated. Therefore, it is possible to greatly reduce the amount of files and the amount of memory for storing management information of connected characters realized by the character table and the file pointer table. Further, the file identifier is stored in the form of a bit string or an ID list, and the number of documents is stored in the form of a bit string. be able to. More specifically, storing document identifier information in the form of a bit list always requires the number of bits for all cases registered in the database. However, when storing document identifier information in the form of an ID list, Is the number of bits representing the document identifier x the number of registered documents. For example, if the total number of registrations in the database is one million and 32 bits are assigned to represent one piece of document identifier information, the following storage area is required. When registering 10 documents that include the concatenated character "structure", a bit string requires a storage area of 1 million bits = 125 KB, but an ID list format requires a storage area of 32 bits x 10 = 40 B Will be. On the other hand, for example, if there are 900,000 documents out of 1,000,000 containing the concatenated character "composition", a bit string requires only a storage area of 1,000,000 bits = 125KB. An area of 32 bits × 900,000 = 3.6 MB is required. Therefore, when this 1,000,000 cases are stored by using a document identifier of 32 bits, 1,000,000 bits ÷ 32 bits = 31,250 cases. If the number of registered cases is larger than this, a bit string format is used. Storing document identifier information in the form of a list is the most efficient way to use the storage area.
[0021]
Next, the principle of the method of registering such a character component table will be described. It has already been described that the file capacity can be reduced to a practical capacity by using the character table and the file pointer table and registering only the connected characters used in the database in the character component table.
[0022]
Therefore, if all the connected character components are managed at the time of registration, the memory capacity becomes insufficient, and it becomes impossible to create a character component table. There is also a method of temporarily saving information using a magnetic disk as a work, but the access speed is slow, so that the registration process takes a very long time. Therefore, the text data to be registered is divided as shown in FIG. 4, a character component table is created for each of the divided text data, and finally these are merged to create a character component table of all the text data. FIG. 4 shows an example in which a text component table is created by dividing a total of 24,000 text data into 8,000 text data. Regarding the connected character “composition”, the document numbers 50, 145, 290. . . . Is stored as document identifier information. Similarly, a character component table is created for each of the divided text data for the next 8,000 items and the next 8,000 items. Finally, the obtained document identifier information is merged, and in the example of this drawing, as the document identifier information for the concatenated character of “composition”, 50, 145, 290, 8096, 12365, 17851, 22989. . . To create information.
[0023]
At the time of search, the input search word is divided into concatenated characters, the document identifier information corresponding to each concatenated character is read, the intersection of those information is taken, and this is used as the search result of the character component table. And That is, for the search word "building", the document identifier information of the character component table is read for each of the two types of connected characters "building" and "building", and the product of them is calculated. For example, if the document identifier information corresponding to the connection character “build” is 562, 1038, 2458. . . . , 261, 562, 2458. . . . In the case of, the search result of the character component table of the search term “building” is 562, 2458. . . . It becomes.
[0024]
As described above, since the document identifier information for each concatenated character is information without noise, the character component table search result obtained by performing a logical expression operation (AND) on these document identifier information is also the same as the conventional character component for hashing. As compared with the table search results, noise due to hashing is removed, and search accuracy can be greatly improved.
[0025]
【Example】
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0026]
FIG. 1 is a diagram illustrating the configuration of the present embodiment. In the present embodiment, terminals 101, 102,. . . 110, a network 200, and a document server 1000. The document server 1000 includes a LAN adapter 1010, a CPU 1020, a work memory 1030, a memory for storing a character table 1100 and a file pointer table 1200, a text data division program 1310, a document identifier information creation program 1320, a document identifier information merge program 1330, and a search. A memory for storing the word division program 1340, the document identifier information search program 1350, the document identifier information AND program 1360, and files 1401, 1402,. . . , Text data 1410.
[0027]
At the time of data registration, the document data to be registered is divided into a predetermined number by the text data division program 1310, and document identifier information is created by the document identifier information creation program 1320 for each of the divided text data. The created document identifier information is merged by the document identifier information merge program 1330 to create a character table 1100, a file pointer table 1200, and character component tables 1401, 1402, 1403.
[0028]
At the time of data search, a search word given from each terminal is divided into concatenated characters by the same algorithm as when document identifier information is created by the search word division program 1340, and a document identifier information search program 1350 is provided for each concatenated character. The corresponding document identifier information is extracted from the character component tables 1401, 1402, and 1403. Then, the document identifier information corresponding to all the concatenated characters constituting the search word is ANDed by the document identifier information AND program 1360 to obtain a document including the search word as a search result of the character component table.
[0029]
First, a procedure for creating a character component table will be described according to data registration processing, and then a process of extracting a candidate document from the character component table according to search processing will be described. As described in the operation section, a large amount of memory must be used to register a character component table for a large number of documents at a time. Therefore, in this embodiment, a small character component table is created every 8,000 cases. Create and, finally, perform processing to integrate into one character component table. FIG. 5 shows the procedure of the document identifier information creation processing. First, (5010) connected character extraction (5020) is performed for each of the 8,000 documents, and the appearance frequency information of the extracted connected character is counted (5030). Then, a memory area for storing document identifier information is secured in the work memory according to the calculated appearance frequency, and when the appearance frequency information of each connected character is larger than a predetermined threshold, each connected character appears in a bit string. The document number to be registered is registered as document identifier information (5040). When document identifier information has been registered for all 8,000 documents, the character table, file pointer table, and document identifier information are stored in the file (5050), and the memory area is released. Such a small divided character component table is created for every 8,000 records, and finally, the divided character component tables are merged (5060) to create a character component table for the entire database.
[0030]
The merge process (5060) of the divided character component table proceeds as shown in FIG. 6 by referring to the character table and the file pointer table of each divided character component table and integrating the document identifier information corresponding to each connected character. To go. FIG. 6 shows an example in which two divided character component tables are integrated into one character component table. FIG. 7 shows a specific processing procedure. First, the character table of each divided character component table is referred to (7010), and an integrated character table is created (7030). At this time, for each record of the character table (7020), for a record registered in only one of them, for each character of the file pointer table recorded on the registered side (7040) In addition to registration in the pointer table (7050), the document identifier information managed in the file pointer table is copied from the small character component table before merging to the character component table after merging (7060). If the same character exists in both character tables, the file pointer integrated by comparing the second character described in the file pointer table for each character of the recorded file pointer table (7070). A table is created (7080). That is, if the second character of the file pointer table does not match, the corresponding document identifier information is copied (7090), and if they match, both document identifier information are merged (7100) and stored.
[0031]
At the time of merging and copying of the document identifier information, if the number of registrations after the merge is larger than a predetermined number, the number is stored in a bit string, and if the number is smaller, the ID is stored in the form of an ID list.
[0032]
The above merge processing algorithm will be specifically described with reference to FIG. The character “composition” exists in both the character table 1 and the character table 2. Therefore, the contents of the file pointer table 1 and the contents of the file pointer table 2 corresponding to the characters of "composition" are registered in the integrated file pointer table. The record in which the second character at the head of the corresponding record in the file pointer table is 0 stores information for accessing the identifier information of the document including one character of “composition”. Since the record in which the second character is 0 exists in both the file pointer table 1 and the file pointer table 2, the document identifier information given by both the file ID and the offset is merged and registered in the integrated character component table. The same applies to the second record “composed” of the file pointer table corresponding to “composition”. For the third record, the file pointer table 1 is “build”, whereas the file pointer table 2 is different from “build”. Therefore, each document identifier information is copied from the small character component table before merging to the integrated character component table.
[0033]
The search process is performed according to the procedure shown in FIG. First, connected characters are cut out from the search word (8010). Next, the character table is searched for each of the cut-out connected characters (8020) (8030). The second character is searched for each record of the corresponding file pointer table (8040), and the corresponding file ID and offset are obtained. From the file storing the obtained document identifier information and the offset value, the ID list or bit string corresponding to the corresponding concatenated character is read, and in the case of the ID list, this is converted into a bit string to convert the document identifier information into a bit string. Acquisition (8050). If the corresponding concatenated character is not registered in the character component table in the process of obtaining the document identifier information (8060, 8070), that is, even one of the concatenated characters constituting the search word is included in the character component table. If it is not registered, it means that there is no corresponding document containing the search word, so the document identifier information search program 1350 returns a result of 0 as a search result to the search terminal via the LAN adapter 1010.
[0034]
When the corresponding document identifier information is obtained for all the concatenated characters constituting the search word, all intersection characters in the specified search word are obtained by taking the intersection of the obtained respective document identifier information. Only the documents that contain it can be extracted.
[0035]
Since the search result of the character component table obtained in this way has very little search noise, the search result of the character component table can be sufficiently used. Of course, based on the search result of the character component table, it is also possible to search the document body and narrow down to only documents that actually include the search term, or to search for a document that satisfies the positional relationship between a plurality of search terms. . Further, the search result of the character component table may be displayed once on the search terminal, and it may be determined whether or not to search the body according to the specification of the user.
[0036]
As described above, according to the present embodiment, since only the concatenated characters existing in the database are registered, there is an advantage that all combinations of characters not existing in the database can be excluded. Also, the file capacity is stored by storing the document identifier information in the form of a bit string and an ID list, storing a large amount of document identifier information in a bit string, and storing a small amount of document identifier information in the ID list form. It can be significantly reduced.
[0037]
Furthermore, since the document identifier information for each concatenated character is always noise-free information including the concatenated character, the search accuracy of the character component table search result obtained by ANDing these document identifier information is greatly improved. can do.
[0038]
Further, according to the present invention, by registering two or more consecutive characters, it is possible to further reduce the search noise of the character component table search.
[0039]
【The invention's effect】
According to the present invention, the document identifier information is selectively stored in either a bit string format or an ID list format. A large number of document identifier information is stored in a bit string, and a small number of document identifier information is stored. By storing in the form of an ID list, the file capacity can be significantly reduced.
[0040]
In addition, for each connected character, the document identifier information is always noise-free information including the connected character, and the document identifier information is ANDed by the number of connected characters of the search word. Can be greatly improved. As a result, there is an advantage that the search range of the text information can be narrowed even when searching for positional conditions between search words.
[0041]
Furthermore, by using the character table and the file pointer table, only the concatenated characters that exist in the database are registered, so that all combinations of characters that do not exist in the database can be excluded, and the memory required for managing the concatenated characters is used. There is an advantage that the amount can be reduced.
[0042]
Furthermore, when registering the character component table, the registered document is divided to create a small divided character component table, and these divided character component tables are later merged to create a target character component table. A large database character component table can be efficiently created with a small memory capacity.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a first embodiment of the present invention.
FIG. 2 is a table configuration diagram of a character component table.
FIG. 3 is a diagram showing an outline of a document identifier information storage file.
FIG. 4 is a diagram illustrating an outline of a character component table registration process.
FIG. 5 is a PAD diagram showing the flow of a registration process.
FIG. 6 is a conceptual diagram showing a process of integrating a divided character component table.
FIG. 7 is a PAD diagram showing a flow of an integration process.
FIG. 8 is a PAD diagram showing a flow of a search process.

Claims

In the document search method,
Calculate the number of document files including the concatenated characters to be searched from the document files stored in the recording device for each of the concatenated characters,
If the calculated number of document files containing the concatenated characters is greater than a threshold, create document identifier information using a bit string corresponding to the number of the document file from the first bit,
If the calculated number of document files including the concatenated characters is smaller than a threshold, create document identifier information in the form of a list of document file numbers including the concatenated characters,
The created document identifier information is stored in a recording device,
Cut out concatenated characters from the entered search term,
Reading the document identifier information for each of the cut-out connected characters,
A document search method characterized by extracting a document including a concatenated character included in the input search word by obtaining an intersection of the read document identifier information.

In the document search method,
Calculate the number of document files containing connected characters to be searched from the document files stored in the recording device for each type of the connected characters,
If the calculated number of document files containing the concatenated characters is greater than a threshold, create document identifier information using a bit string corresponding to the number of the document file from the first bit,
If the calculated number of document files including the concatenated characters is smaller than a threshold, create document identifier information in the form of a list of document file numbers including the concatenated characters,
The created document identifier information is stored in a recording device,
A document search method, wherein a document search is performed based on the document identifier information.

In the document search method,
From the document file stored in the recording device, calculate the appearance frequency of the connected character to be searched for each type of the connected character,
If the calculated appearance frequency of the document file containing the concatenated character is greater than a threshold, create document identifier information using a bit string corresponding to the number of the document file from the first bit,
If the calculated appearance frequency of the document file including the concatenated character is smaller than a threshold, create document identifier information in the form of a list of document file numbers including the concatenated character;
The created document identifier information is stored in a recording device,
A document search method, wherein a document search is performed based on the document identifier information.

In the document search method,
The computer stores a character table including the first character of the concatenated character in an element of the array in the recording device,
A file pointer table including a second character of the concatenated character and pointer information to a file storing document identifier information that is information on a document file including the concatenated character is stored in the recording device,
Storing the pointer information from the character table to the file pointer table in the character table so as to associate the first character of the connected character included in the character table with the second character of the connected character,
For the concatenated characters included in the specified search word, refer to the pointer information of the character table associated with the first character of the concatenated characters included in the search word,
Reading a document identifier information including a concatenated character included in the search word by referring to a file pointer table storing a second character of the concatenated character included in the search word based on the referred pointer information. Document search method to be performed.

In the document search method,
Divide the document file stored in the recording device,
Create document identifier information of a concatenated character to be searched for each of the divided document files,
The document identifier information created for each of the divided document files is merged and stored in the recording device,
A document search method, wherein a search process is performed based on the stored document identifier information.

In a document search system,
Means for calculating, for each type of the concatenated characters, the number of document files containing concatenated characters to be searched from the document files stored in the recording device,
Means for creating document identifier information using a bit string corresponding to the number of the document file from the first bit, if the calculated number of document files containing the concatenated character is greater than a threshold,
Means for creating document identifier information in the form of a list of document file numbers including the concatenated character, if the calculated number of document files including the concatenated character is smaller than a threshold,
Means for storing the created document identifier information in a recording device;
Means for performing a document search based on the document identifier information.