JP3985483B2

JP3985483B2 - SEARCH DEVICE, SEARCH SYSTEM, SEARCH METHOD, PROGRAM, AND RECORDING MEDIUM USING LANGUAGE SENTENCE

Info

Publication number: JP3985483B2
Application number: JP2001297675A
Authority: JP
Inventors: 俊今井
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-09-27
Filing date: 2001-09-27
Publication date: 2007-10-03
Anticipated expiration: 2021-09-27
Also published as: JP2003108583A

Description

【０００１】
【発明の属する技術分野】
本発明は、言語文を用いて検索を行なう技術に関し、詳しくは検索された情報と検索文との相関を評価する技術に関する。
【０００２】
【従来の技術】
データベースなどの検索は、通常、検索用の単語を入力し、この単語が含まれるデータ、あるいはその単語が含まれないデータといった条件で行なわれる。一つの単語で検索した結果、あまりに多くのデータがヒットした場合には、さらに単語を追加して絞り込み検索を行なったり、いくつかの単語による各々の検索結果を対象として、アンド条件やオア条件などを指定して、検索対象を絞り込むといったことも行なわれる。こうした単語を用いて所望の結果を得るには、ある程度の訓練を必要とした。
【０００３】
そこで、従来から、よりよい検索手法を求めて、様々な提案がなされている。例えば、検索すべき単語が指定されると、その単語と意味を同じくする単語をシソーラスを用いて特定し、その単語についても検索を行なうことで、高精度の検索を行なおうとする技術が提案されている。シソーラスを用いれば、検索しようとする単語の上位概念で検索することもできる。このほか、自然言語を用いて検索を行なおうとするもの（例えば、特開平１−１８００４６号公報に開示された「自然言語理解方法および情報検索装置」、特開２００１−１４１６５号公報に開示された応答装置など）が提案されている。これらは、あらかじめ検索の対象（プラントの監視や保守など）に即して、検索のシナリオを作り、これに沿って検索を進めるという技術である。こうした検索技術では、単に単語を用いたものと比べると、検索しようとするものは、自然な言語文で検索作業を進めることができる。
【０００４】
【発明が解決しようとする課題】
しかしながら、かかる検索技術では、次の点で問題があり、特に大量のデータ、例えばインターネットなどのネットワークに接続されたサイトの情報を検索するといったシステムでは、未だ十分な検索技術が提案されているとは言えなかった。まず、単語とシソーラスを用いた検索では、結局単語による検索であることに変わりはないので、多数のデータが検索により見いだされた場合など、絞り込みを行なわねばならず、検索に熟練を要する点は従前と同じであった。このため、シソーラスを用いて、検索精度を高めることは困難であった。
【０００５】
また、自然言語文を用いて検索を行なうものは、検索対象の特徴などを生かしたシナリオを事前に作成した上で検索を行なっており、自然言語を用いた検索が、事前に想定したパターンをはずれると、対応できないという問題があった。このため、例えばインターネット上のサイト検索などのように、事前のシナリオが想定できない対象に対しては、自然言語文を用いた検索ができなかった。
【０００６】
本発明の装置は、こうした問題を解決し、自然言語文を用いて、高精度の検索を実現することを目的とする。
【０００７】
【課題を解決するための手段およびその作用・効果】
上記課題の少なくとも一部を解決する本発明の装置は、
言語文を用いて検索を行なう装置であって、
検索用の検索文を入力する検索文入力手段と、
該入力された検索文を利用して、検索を行なう検索手段と、
少なくとも前記検索された対象に含まれる文である対象文を解析して、少なくとも一つの述部を含む構文上の最小単位である部分文を抽出すると共に、該抽出された部分文を、文における役割に着目して、少なくとも条件部と結論部とに分類する第１の分類手段と、
前記検索文を解析して、前記部分文を抽出すると共に、該抽出された部分文を、文における役割に着目して、少なくとも条件部と結論部とに分類する第２の分類手段と、
前記検索文と前記対象文から抽出された部分文に含まれる自立語が、前記分類された条件部と結論部のいずれに属するかを判定し、前記判定結果に基づいて前記対象文に前記検索文に対する類似度を付与し、前記対象文を前記類似度が大きい順に配列する対象文評価手段と
を備えたことを要旨としている。
【０００８】
また、この装置に対応した方法の発明は、
言語文を用いて、コンピュータが検索を行なう方法であって、
検索用の検索文をキーボードなどの入力手段から入力し、
該入力された検索文を利用して、コンピュータが検索を行ない、
少なくとも前記検索された対象に含まれる文である対象文を解析して、コンピュータが、少なくとも一つの述部を含む構文上の最小単位である部分文を抽出すると共に、該抽出された部分文を、文における役割に着目して、少なくとも条件部と結論部とに分類し、
前記検索文を解析して、コンピュータが、前記部分文を抽出すると共に、該抽出された部分文を、文における役割に着目して、少なくとも条件部と結論部とに分類し、
前記検索文と前記対象文から抽出された部分文に含まれる自立語が、前記分類された条件部と結論部のいずれに属するかを判定し、前記判定結果に基づいて前記対象文に前記検索文に対する類似度を付与し、前記対象文を前記類似度が大きい順に配列すること
を要旨としている。
【０００９】
かかる装置および方法によれば、検索をしようとする者により入力された検索用の検索文を利用して、検索を行なう。このとき、検索された対象に含まれる文である対象文が取得され、この対象文を解析して、少なくとも一つの述部を含む構文上の最小単位である部分文を抽出すると共に、該抽出された部分文を、文における役割に着目して、少なくとも条件部と結論部とに分類する。同様に、検索文を解析して、部分文を抽出すると共に、抽出された部分文を、文における役割に着目して、少なくとも条件部と結論部とに分類し、検索文と対象文から抽出された部分文に含まれる自立語が、分類された条件部と結論部のいずれに属するかを判定することにより、対象文に検索文に対する類似度を付与するから、この類似度に基づいて対象文を配列することができる。
【００１０】
【発明の他の態様】
また、こうした検索装置の発明は、その実現形態として、サーバ上で実現したり、サーバコンピュータとクライアントコンピュータとが協動するシステムとして実現することもできる。また、コンピュータに上で動作するプログラムにより上記の検索方法を実現することができるので、本発明をプログラムとして、あるいはそのプログラムが記録された記録媒体（例えばフレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、磁気テープなど）として把握することもできる。プログラムは、記録媒体に記録して扱うこともできるが、ネットワーク上のサーバなどにおき、これをネットワークを介してダウンロードして、クライアント側のコンピュータで実行するという扱いにすることもできる。
【００１１】
【発明の実施の形態】
以下、本発明の実施の形態について説明する。図１は、本発明の実施の形態の一つとしての検索システム１００を示すブロック図である。図示する各ブロックは、実際には、サーバコンピュータ２００とクライアントコンピュータ３００とから構成されている。両コンピュータは、ネットワーク１１０を介して接続されている。サーバコンピュータ２００は、検索用エンジンを搭載しており、クライアントコンピュータ３００は、サーバコンピュータ２００に対して検索の要求を出力し、検索結果をサーバ２００から受け取って表示する。サーバコンピュータ２００やクライアントコンピュータ３００の具体的なハードウェア構成の詳細などは後述する実施例に譲り、ここではブロックレベルで構成とその作用を説明する。
【００１２】
図１に示すように、クライアントコンピュータ３００は、検索用の検索文を自然言語の一つである日本語で受け付ける検索文入力部３１０、この検索文を解析する検索文解析部３２０、解析された検索文から検索用の単語列を取り出してこれをサーバコンピュータ２００に出力する検索語出力部３３０、更にサーバコンピュータ２００からの検索結果を受け取り画面に表示する結果表示部３４０を有する。他方、サーバコンピュータ２００は、クライアントコンピュータ３００からの検索語を受け取る検索語受信部２１０、受け取った検索語を用いて検索を行なう検索エンジン２２０、検索結果を文単位で取り出し形態素解析などを行なって解析する対象文解析部２３０、解析した対象文と検索語との対比を行なう対比実行部２４０、対比結果に従って対象文を配列する配列部２５０、配列済みの対象文を順次クライアントコンピュータ３００に送信する検索結果出力部２６０などを備える。
【００１３】
クライアントコンピュータ３００の検索文入力部３１０は、利用者がキーボードなどを用いて入力した日本語を受け付ける。インターネットなどのネットワークに接続されたサイトの検索を行なう場合には、ここの検索文入力部３１０は、通常のブラウザにより表示された検索語の入力ボックスに、ＩＭＥ（日本語入力メソッド）などを用いて日本語文を入力する処理に相当する。検索文入力部３１０を介して、例えば、「電源を入れると壊れた」などの自然な言語文が入力される。なお、本実施の形態では、インターネットを介して接続した故障診断サイトで、コンピュータの故障についての診断（原因や対処）を受ける場合を想定している。
【００１４】
利用者は、通常、自分のコンピュータの故障の状態については、言葉で表現できるが、その原因を特定して検索語を設定したり、単語を複数入力して、徐々に検索範囲を絞り込んだりすることは、困難なことが多い。そこで、この実施の形態では、利用者が通常している自然言語（この例では日本語）を用いて、自分で表現できる文の形で問い合わせを行なっている。こうした日本語により検索文が入力されると、検索文解析部３２０がこれを解析する。解析の内容については、後述する実施例で詳しく説明するが、通常は、まず検索文を形態素解析し、日本語として自然な文を構成する文節に切り分ける処理を行なう。文節に切り分けた後、検索に用いるべき単語を解析する。例えば、検索文が、上記の「電源を入れると壊れた」であれば、これを形態素解析して、「電源を」（名詞＋助詞）、「入れると」（動詞連用形＋助詞）、「壊れた」（動詞終止形＋過去を示す助動詞）、といった文節に分け、更にここから検索語として、「電源」、「入れる」、「壊れる」などを抽出するのである。検索文解析部３２０では、こうした単語の抽出に加えて、シソーラスを参照して、同義語や類義語（例えば、「電源」に対する「パワーサプライ」や、「壊れる」に対する「破損する」など）も、検索語として抽出するものとしても良い。また、形態素解析に加えて、文節間の係り受けや、更に文を構成する句や節などの部分文の構成をも解析するものとしても良い。
【００１５】
こうして抽出された検索語は、検索語出力部３３０により、ネットワークを介して出力され、パケットに付されたＩＰアドレスなどを用いて、故障診断プログラムが動作しているサーバコンピュータ２００に届けられる。サーバコンピュータ２００は、こうしてネットワークに出力された検索語を、検索語受信部２１０により受け取り、これを検索エンジン２２０に受け渡す。検索エンジン２２０は、受け取った検索語（通常複数個）を用いて、故障診断用の知識データベース２２５をアクセスし、検索語とを含む文を見いだすとこれを取り出す。上記の例であれば、「電源」「入れる」「壊れる」などの語のすべてを含む文、あるいは少なくとも一つ以上を含む文を、検索された対象文として取り出す。例えば、知識データベース２２５に収録されたデータベース内に、「電源を入れると、ＯＳが起動する前に、『ＮｏＤｉｓｋ』が表示されて、止まってしまいます。」や「コンピュータの使用中にハングアップして、コンピュータの電源を切ることもできません。」などの文があれば、検索エンジン２２０はこれらの文を、該当する検索対象文として検索することになる。
【００１６】
こうして得られた検索対象文を、対象文解析部２３０が解析する。この解析は、先に説明した検索文解析部３２０による解析とほぼ同一であり、形態素解析を基礎として、係り受けによる句構造の解析や、更に、少なくとも一つの述部を含む構文上の最小単位である部分文などが分析される。更に、対象文解析部２３０では、こうして取り出された部分文を文における役割に着目して分類する。役割としては、文における条件部か結論部かといった区分でも良いし、更にこれを細かく分けて例えば、条件部であれば、「条件」「理由」「逆接」「並列」などに分類しても良い。こうして検索対象文を分類した後、この分類に従って、検索対象文と検索語との対比を、対比実行部２４０が行なう。すなわち、検索語が、検索対象文の結論部に現われているか、条件部に現われているか、などを考慮しつつ、両者の対比するのである。なお、この例では、サーバコンピュータ２００は、複数の検索語を検索語受信部２１０により受信して検索エンジン２２０を動かすものとしたので、対比は、検索語と検索対象文とを単純に対比しているが、検索文を解析した結果も併せて受け取り、検索文の解析結果も用いて、対比を行なうものとしても良い。例えば、検索文における各語の構成（係り受けや節の役割など）を考慮して、検索対象文との対比を行なっても良い。なお、ここで言う「部分文」は、少なくとも一つの述部を含む構文上の最小単位であって、文法上は、主節や従属節、あるいは条件節など言われる単位に、ほぼ相当する概念である。
【００１７】
こうして対比を行なった後、サーバコンピュータ２００は、配列部２５０により検索対象文を振り分けて配列し、これを検索結果出力部２６０からクライアントコンピュータ３００側に返送する。配列としては、検索語に対して、より重要な相関を有すると考えられる検索対象文を上位に並べるのが好適である。単純に、相関の高い情報上位に配列するだけでなく、階層化して出力することも好適である。検索結果の出力は、検索語を送信してきたクライアントコンピュータ３００のＩＰアドレスを、パケットに付して、検索対象文をインターネットなどのネットワークに出力することにより行なわれる。こうしてネットワークに流されたデータは、ＩＰアドレスを手がかりに、検索語を出力したクライアントコンピュータ３００に確実に返送される。
【００１８】
この検索結果を受け取ったクライアントコンピュータ３００は、検索対象文が配列された検索結果を、ブラウザなどを用いて一覧表示する。通常、こうした故障診断では、検索結果には、ＵＲＬなどが付随しており、利用者は、検索結果を読んで、更に詳しく内容を知りたい場合には、このＵＲＬをクリックすることで、直ちにサーバコンピュータ２００内の必要な情報にジャンプして、詳しい故障診断の情報（故障の原因や対処方法など）を知ることができる。しかも、利用者からすると、いくつかの検索語を特定したり、これを順次入力して検索結果を絞り込んだりする必要がなく、自分か把握した範囲で、自然な日本語で状況を説明する文を入力すれば足りる、という大きな利点がある。更に、検索の結果も、自然な日本語文で、しかもより関連性が高いと考えられる文が上位に表示されるので、短時間に必要な情報にたどり着けるという利点が得られる。
【００１９】
上述した実施の形態おいて、検索対象文を形態素解析して文節を切り出したとき、この対象文に含まれる部分文の接続関係を示す接続詞、接続助詞を特定し、接続詞、接続助詞を用いて、部分文を抽出するものとしても良い。日本語では、接続詞や接続助詞が用いられる箇所までで部分文が構成されることが多く、しかも接続助詞などに着目すれば、その前が条件を示す部分文であるか、理由を示す部分文であるか、などを容易に認識することができる。
【００２０】
更に、上記の実施の形態において、検索文を解析する場合には、少なくとも一つの述部を含む構文上の最小単位である部分文を抽出すると共に、抽出された部分文から、結論部に相当する部分を特定し、検索自体を、特定された結論部に含まれる単語を用いて行なうものとしても良い。例えば「スイッチを入れたら、電源が壊れた」というような検索文の場合、「電源が壊れた」という結論部の方が故障診断にとっては有用なことが多いので、結論部の単語「電源」「壊れる」を用いて検索を行なうのである。もとより、アプリケーションによっては、条件部に相当する部分を特定して、条件部に含まれる単語を用いて検索するものとしても良い。例えば、中毒診断用のシステムでは、「乾電池を飲んだので、腹が痛い」といった検索文が入力された場合、条件部の方が有用と見なして「乾電池」「飲む」などの単語を抽出し、これを用いて検索を行なえばよい。
【００２１】
上記の実施の形態では、検索の対象は、データベースでとしてが、検索の対象は、ネットワーク上に置かれたサイトに含まれる情報であっても差し支えない。いわゆるネットワーク上の検索エンジンに適用すれば、多数に上る関連サイトを、より相関の高いものを優先して表示することも容易である。
【００２２】
また、上記の実施の形態では、検索システムは、サーバコンピュータ２００とクライアントコンピュータ３００とから構成したが、利用者が使用するコンピュータにデータベースや検索エンジンが置かれたいわゆるスタンドアロンの使用形態でも差し支えない。また、上記の実施の形態では、クライアントコンピュータ３００は、検索文を検索文解析部３２０により解析し、検索語として、サーバコンピュータ２００側に渡しているが、検索文をそのままクライアントコンピュータ３００側に出力し、サーバコンピュータ２００側で検索文の解析処理から行なうものとしても良い。この場合、検索文の解析を行なう能力はサーバコンピュータ２００側のプログラムやデータベースにより決定されるので、サーバコンピュータ２００側のプログラムを入れ替えるだけで、解析能力をアップグレードできるという利点が得られる。また、クライアントコンピュータ３００毎に解析能力が異なると言うこともない。もとより、検索文解析部３２０は、ブラウザにプラグインにより追加されるように構成し、これをサーバコンピュータ２００側からクライアントコンピュータ３００側に送信するものとしても良い。こうすれば、クライアントコンピュータ３００によらず、ほぼ同等の解析能力を用意することかできる。しかも、多数のクライアントコンピュータ３００からアクセスされるサーバコンピュータ２００側の負担を減らすことができる。
【００２３】
上述した実施の態様では、検索エンジン２２０により得られた検索対象文の解析をサーバコンピュータ２００側の対象文解析部２３０で行なっているが、解析をサーバコンピュータ２００側では行なわず、検索対象文をそのままクライアントコンピュータ３００に出力し、クライアントコンピュータ３００側で解析と対比を行なうものとしても良い。クライアントコンピュータ３００は、検索文が入力されたマシンであり、利用者に最も近い側に位置するので、利用者の要求にそって検索対象文を解析し、検索文との相関を判定して、所望の順序で表示することができる。例えば、音声認識を用いて検索文を入力するような構成を採用すれば、音声入力時の抑揚や強調された単語などの情報を、検索対象文と検索文との相関の判断において、考慮すると言ったことも可能である。また、検索対象文の解析をクライアントコンピュータ３００側で行なうものとすれば、複数のサーバコンピュータ２００上で検索エンジンを動かし、複数のサーバコンピュータ２００からの検索結果を受け取って、これをまとめて解析して、相関の程度により順序付けして表示するといったことも可能となる。
【００２４】
上記の実施の形態は、サーバコンピュータ２００とクライアントコンピュータ３００からなる検索システムとして説明したが、これらの検索などの機能をコンピュータ上で実現するプログラムを、ＣＤ−ＲＯＭなどの記録媒体上に記録した形態で、本発明を実施することも可能である。この場合、上述したように、サーバ側のプログラムとクライアント側のプログラムに分けて、それぞれ記録媒体上に記録して実施しても良いし、一つのプログラムあるいはプログラム群として記録しても良い。更には、サーバ側に必要なプログラムをおき、このサーバ側のプログラムと協働して動作するプログラムを、サーバ側にダウンロード可能に用意し、検索を行なおうとするクライアント側から読み出して、実行する形態で実施することも可能である。
【００２５】
【実施例】
以上説明した実施の形態を更に具体的に説明するために、その実施例について説明する。
（１）実施例の構成：
はじめに、実施例のハードウェア構成について、図２の概略構成図を用いて説明する。図２に示した実施例では、インターネットのようなネットワーク１０に接続されたサーバコンピュータ２０にプログラムをインストールし、このプログラムを実行することで、クライアントコンピュータ３０からの検索要求に応じて検索を実行する検索システムが具現化されている。検索システム５０におけるサーバコンピュータ２０（以下、これを検索用サーバと呼ぶ）は、それ自身スタンドアロンの検索装置として使用可能であるが、以下で説明するように、サーバとして他のクライアントコンピュータ（以下、これを単にクライアントと呼ぶ）３０から利用することが可能である。すなわち、ネットワーク１０に接続された多数のクライアント３０の利用者が、ネットワーク１０を介して検索用サーバ２０にアクセスすることで、自然言語を用いた検索とその結果の提供を受けることができる。入力部分については、検索用サーバ２０とクライアント３０はほぼ同じなので、ハードウェア構成については、検索用サーバ２０を例として説明を行なう。
【００２６】
検索用サーバ２０の内部構成を図２に基づいて説明する。検索用サーバ２０は、モデムやルータ１８を介してネットワーク１０とのデータのやり取りを制御するネットワークインタフェース（ＮＴ−Ｉ／Ｆ）２１、処理を行なうＣＰＵ２２、処理プログラムや固定的なデータを記憶するＲＯＭ２３、ワークエリアとしてのＲＡＭ２４、時間を管理するタイマ２５、モニタ２９への表示を司る表示回路２６、テキストデータをデータベースとして蓄積しているハードディスク（ＨＤ）２７、キーボード１１，マウス１２，マイク１３とのインタフェースを司る入力インタフェース（Ｉ／Ｆ）２８等を備える。なお、ハードディスク２７は、固定式のものとして記載したが、着脱式のものでも良いし、着脱式の記憶装置（例えばＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、フレキシブルディスクなど）を併用することも可能である。また、この実施例では、検索用サーバ２０の処理プログラムは、ＲＯＭ２３内に記憶されているものとしたが、ハードディスク２７に記憶しておき、起動時にＲＡＭ２４上に展開して実行するものとしても良い。あるいは、上述した着脱式の記録媒体から読み込むものとしても良い。更には、ネットワーク１０を介して、他のサーバから読み込んで実行するものとしても良い。同様に、以下に説明するように、ハードディスク２７に必要なデータの総てが記憶されている必要はなく、ネットワーク１０により接続される他のサーバに膨大なデータを分散して記憶、更新、管理する構成としても良い。
【００２７】
ハードディスク２７には、形態素解析辞書ＩＤＣ、文判定ルールＳＤＩ、シソーラスＴＳＲおよび検索対象データベースＤＢが記憶されている。この形態素解析辞書ＩＤＣは、いわゆる仮名漢字変換辞書とほぼ同一の内容を記憶した辞書であり、仮名漢字変換辞書とは、見出しと読みが逆になっている。このため、キーボード１１やネットワーク１０を介して入力されたかな文字列を解析して、仮名漢字文字列に変換する仮名漢字変換辞書をそのまま用い、読みと見出しの関係だけインデックスの形でもった形態とすることもできる。形態素解析辞書ＩＤＣの一例を、図３に示した。この例では、読みと表記と文法情報のみを示されているが、実際の形態素解析辞書ＩＤＣには、単語やその他の単語に関しての読み、表記、文法情報の他に、同一意味の口語、同意語、類義語、省略語、更には係り受けの情報などが関連付けられて記憶されている。この解析辞書ＩＤＣは、検索用サーバ２０では、クライアント３０から受け取った検索文を形態素解析する際に用いられる。この解析辞書ＩＤＣを用いることで、検索用サーバ２０は、受け取った検索文を精度良く解析することができる。例えば、解析するかな文字列が口語体の自然言語であっても、その口語体を、正確に解析することが可能である。
【００２８】
文判定ルールＳＤＩは、係り受けや部分文の関係を規定するルールを記憶した辞書である。係り受けについては、形態素解析でも用いられるが、ここでは、形態素解析により得られた文節同士の関係を特定するのに用いられている。更に、部分文の関係を規定するルールとは、大きくは、条件部か結論部かを特定するルールであり、条件部については、更に、条件、理由、逆接、並列などを区別するルールが格納されている。また、結論部については、結論に影響を与えない不要部を取り除くためのルールなども記憶されている。
【００２９】
単語シソーラスＴＳＲは、意味的な関係のある単語（例えば類義語、反意語など）を、その概念関係に従って整理した辞書である。概念関係としては、上位、下位、並列といった関係の他に、様々な関係が設けられ、多数の単語がこうした概念関係で整理されている。例えば「入れる」「切る」「回す」「ひねる」といった動詞について、「人間の動作」といった観点から、類義語として整理されている。
【００３０】
検索対象データベースＤＢは、利用者が検索しようとする対象そのものであり、この実施例では、故障解析・診断用のデータベースである。なお、こうしたデータベースＤＢは、本実施例では、ハードディスク２７内に記録されているが、インターネット上に存在する多数のサイトなどを検索対象データベースＤＢとして扱うことも勿論可能である。こうした場合には、巡回型検索エンジンにより、インターネット上のサイトのデータを検索し、インデックスの形で、検索用サーバ２０内のハードディスク２７にデータを蓄えておいても良いし、その都度、検索を行なっても良い。
【００３１】
（２）検索システムの動作−解析処理：
検索用サーバ２０とこれに接続されたクライアント３０からなる検索システム５０の動作について説明する。この実施例では、クライアント３０では、インターネット上のサイトの情報をブラウズするブラウザが動作しており、利用者が、検索用サーバ２０から送られたデータに基づいてこのブラウザに表示した検索用のボックスに、検索しようとする内容を自然な日本語で入力すると、これを解析することなく、そのままネットワーク１０を介して、検索用サーバ２０に送信している。実施の形態では、検索文の解析はクライアント３００側で行なったが、この実施例では、検索文の解析から、すべて検索用サーバ２０側で行なっている。クライアント３０側は、検索文の入力およびその出力と、検索結果の表示のみを担当している。
【００３２】
そこで、クライアント３０側の動作についての説明は簡略にとどめ、図４の説明図を用いて、検索用サーバ２０側の動作について詳しく説明する。検索用サーバ２０は、ネットワーク１０を介したクライアント３０側からの要求を受け取ると、図４に示した処理を開始する。検索用サーバ２０が実行する処理は大きくは解析処理と照合処理である。解析処理は、形態素解析処理（ステップＳ１００）、係り受け解析（ステップＳ１１０）、部分文の判定（ステップＳ１２０）から構成されている。他方、照合処理は、単語照合（ステップＳ１３０）、係り受け照合（ステップＳ１４０）、および部分文の照合（ステップＳ１５０）から構成されている。
【００３３】
図４に示した処理は、クライアント３０から検索文を受け取ったときに開始され、まず形態素解析処理が行なわれる（ステップＳ１００）。形態素解析処理は、上述したように、形態素解析辞書ＩＤＣを参照して行なわれる処理であり、クライアント３０から受け取った検索文から単語と文節を取り出す処理である。形態素解析処理（ステップＳ１００）の詳細を図５のフローチャートに示した。
【００３４】
形態素解析処理が開始されると、クライアント３０から受け取った検索文が解析の対象として特定され、この文の先頭からＭ文字目（Ｍ＝１，２，・・・・）からＬ文字分（Ｌ＝１，２，・・・）を取り出して解析辞書ＩＤＣを引く処理を行なう（ステップＳ１０２）。Ｍは、着目している文字列の先頭位置を、Ｌは、取り出す文字数を、それぞれ示していることになる。解析辞書の参照の手法は、まずＭ＝１、即ち先頭位置から、Ｌ＝１、即ち１文字分の文字を取り出し、辞書を参照して該当語を取り出す処理から開始する。Ｌを順次インクリメントしながら辞書ＩＤＣを参照し、該当する見出し語がなくなれば、着目する文字列の先頭位置Ｍをインクリメントし、再度文字数Ｌを１に戻して、辞書の検索を行なう。こうして着目する文字の位置か、解析しようとする文の文字数を超えたところで、辞書の参照をうち切る。
【００３５】
例えば、クライアント３０から「電源を入れたら壊れた」という検索文が入力された場合を想定すると、解析辞書ＩＤＣを参照すると、「電源を」「電」「源」」「源を」「を」「入れたら」「入れた」「ら」「入れ」「たら」「た」「入」「れたら」「壊れた」「壊れ」「た」「壊」「れた」「れ」といった語を切り出すことができる。ここで、「た」などの仮名一音も、語として切り出しているのは、過去形の助動詞「た」などが、文中に現れる可能性があるからである。
【００３６】
解析辞書ＩＤＣには、これらの語がその文法情報と共に記憶されている。そこで、切り出した語を次に文法情報に従って並べて、破綻しない配列を見い出す処理を行なう。かかる解析は、例えば複数文節最長一致法や最小コスト法といった手法が知られており、所定の語の組合わせのうちどれが最も日本語としてもっともらしいかを検定するのである。本実施例では、最小コスト法を採用しているので、こうして得られた多数の文字列を対象として、次にコスト計算を行なう（ステップＳ１０４）。コスト計算とは、文字列の配列に対して、日本語らしい配列ほど点数が低くなるように予め用意された文字列のコストを計算する処理である。その規則は大まかに言えば、自立語はコスト２、これに付属語が付属する場合はコスト０、といったものである。例えば、「電源を」を例にとると、「電源」＋「を」ではあれば、自立語＋付属語（助詞）の結びつきとなって、コスト２、「電」＋「源」＋「を」であれば、自立語＋自立語＋付属語（助詞）となってコストは４となるのである。最小コスト法のルールは、現実の日本語にあわせてチューニングされており、「まったく」＋「ない」などの共起関係にある単語が文中に生じる場合は、コスト「−１」など、様々な規則が用意されている。
【００３７】
こうして、逆引き辞書の参照により得られた全ての単語について、上記のコストを計算し、そのうちで最小のコストになる文を特定する処理を行なう（ステップＳ１０６）。上記の例では、「電」（自立語・名詞）＋「源」（自立語・名詞）＋「を」（付属語・助詞）よりも、「電源」（自立語・名詞）＋「を」（付属語・助詞）の方が、日本語として確からしいと判断するのである。もとより、この計算は、少なくとも文を単位として行なわれ、文全体で、コストが最小になるような単語の配列を選択する。従って、例えば共起関係によるコストの低減などがあれば、異なる組合わせが選択される場合も存在する。
【００３８】
こうして最小コスト法により最小コスト文が特定されると、結局検索文を構成する文節の組合わせが、その文法情報と共に得られたことになるので、次に、得られた文節を、図６に示す配列に格納する処理を行なう（ステップＳ１０８）。図６は、検索文を解析する際に用いられる配列の一例を示す説明図である。検索文は、全体としては、単語情報（図６）、文節情報（図７）、部分文情報（図８）という形態で解析され、記憶される。このうち、図６は、単語情報の内容（配列）を示しており、この配列は、単語、単語の読み、品詞から構成されている。以下、単語の配列は、Ｔ［ｔ］（ｔ＝０，１，・・・）として参照するものとする。
【００３９】
こうして形態素解析を完了すると、次に係り受け解析（ステップＳ１１０）を行なう（図４参照）。係り受け解析とは、文を構成する各文節の関係を特定する処理である。係り受け解析は、文節情報を特定するための処理である。係り受け解析を行なうことにより、文節間の関係を知ることができる。即ち、ある文節がどの文節に係っているかをしることができる。例えば、名詞＋「を」（助詞）は後方の最も近い述部にかかる、というルールから、「電源を」→「切ると」という関係が特定される。こうした係り受け解析により得られた文節情報は、配列Ｂに格納される。この配列Ｂ［ｂ］（ｂ＝０，１，・・・）の一例を図７に示した。この文節情報は、単語を示すインデックスである配列Ｂ［ｂ］、この配列Ｂ［ｂ］に所属している単語の番号ｔ、係り先文節の番号ｂ、係りもと文節の番号ｂから構成されている。図７の表中、「−」は該当する文節が存在しないことを示している。配列に所属している単語の番号ｔが与えられれば、図６に示した配列Ｔ［ｔ］を参照して、実際の単語を取得することができる。
【００４０】
係り受け解析（ステップＳ１１０）が完了すると、次に部分文の判定処理を行なう（ステップＳ１２０）。この処理は、係り受け解析により解析した文節同士の関係を利用して、１以上の文節からなる部分文同士の関係を特定するものである。ここで部分文とは、少なくとも一つの述部を含み構文上の最小単位である節とほぼ等しい概念である。部分文同士の関係は、図８に示したように、配列Ｓ［ｓ］（ｓ＝０，１，・・・・）として与えられ、配列Ｓ［ｓ］には、所属する文節の番号ｂ、結論部からの距離、条件部の意味が対応づけられる。これら、単語の配列Ｔ［ｔ］、文節の配列Ｂ［ｂ］、部分文の配列Ｓ［ｓ］の関係を図９に示した。図示するように、これらは、上位−下位の構成となっており、一つの部分文から、これに含まれる文節、単語などを自由に参照することができる。
【００４１】
部分文の切出の処理を図１０に示す。この処理は、文判定ルールＳＤＩを参照することにより行なわれる。文判定ルールＳＤＩの一例を図１１に示す。図１１は、図１０のフローチャートにおける判定単語列Ｒｍを示したものである。また、各見だしは、条件部の意味を示している。図１１の表中における「＊」は、いわゆるワイルドカードを示しており、どんな単語でも当てはまることを示している。また、「＊：＊：動詞」は品詞が動詞の単語であれば、読みや見出しは問わず当てはまることを示している。例えば、図１１中、符号ＩＮで示した文型は、条件部の意味は「条件」であり、「（＊：＊：動詞、＊：＊：活用語尾、と：＊：接続助詞）」という文型を指定しているから、動詞の後に活用語尾がついた上で、接続助詞「と」が接続される総ての部分文を示していることになる。動詞「入れ」＋活用形「る」＋接続助詞「と」は、この文型に一致することになる。
【００４２】
図１０に示した部分文の解析処理ルーチンについて説明する。このルーチンが開始されると、まず検索文から不要文を削除する処理を行なう（ステップＳ２００）。不要文とは、「どうしたらよいですか」と言った検索しようとする内容そのものとは関係がない部分である。これらの部分は、予め不要文のリストの形で記憶しておき、該当する文を削除するものとすればよい。例えば「電源を入れると壊れたのですが、どうしたらよいですか」といった検索部が与えられている場合には、形態素解析および係り受けの解析により、こうした不要文に相当する部分特定することができるので、これを削除するのである。削除した文節は、単語の配列（図６参照）や文節の配列（図７参照）や部分文の配列（図８参照）などから削除される。
【００４３】
次に、部分文の解析を開始するものとして、解析処理を行なう検索文を構成する全単語数ｎを設定し、着目する従属節の数を示す変数を初期化（ｊ←０）する処理を行なう（ステップＳ２１０）。次のステップＳ２２０では、図１１に示した条件部を示す文例の数を示す変数ｍを初期化し（ｍ←０）、以下、変数ｍが図１１に示した文例の総数になるまで、以下の処理を繰り返す。図１１に示した文例は、一つの文例が（）に括られている部分であり、先頭から順にｍ＝１，２，・・・として指定することができる。そこで、まずｍ番目の文例を、判定単語列Ｒｍとして取得し、併せてｋに判定単語列Ｒｍのを構成する単語数を設定する処理を行なう（ステップＳ２３０）。例えば、上述した（＊：＊：動詞、＊：＊：活用語尾、と：＊：接続助詞）という文例では、構成単語数ｋは、値３となる。
【００４４】
次に、検索文にその最後尾から着目し、そのｎ−ｋ＋１番目からｎまでの単語列Ｗ（n-k-1,n）を取得する処理を行なう（ステップＳ２４０）。対比する文例が、単語数ｋなので、検索文からも単語数ｋ個の単語からなる単語列を取り出すのである。単語列の取出は、単語を示す配列Ｔ［ｔ］を用いて容易に取り出すことができる。例えば「電源を入れると壊れた」という文が検索文として入力された場合には、後方から３個の単語として、「と」＋「壊れ」＋「た」が取得されることになる。こうして比較用の単語列が取得されると、次に、両者を照合する処理を行ない（ステップＳ２５０）、両者が一致しなければ、次の文例を取得するために変数ｍを値１だけインクリメントして（ステップＳ２６０）、図１１に示した文例が尽きるまで（ステップＳ２７０）、ステップＳ２３０に戻って処理を継続する。上記の例では、末尾からの単語の切出が一致することはないので、やがて全文例についての判断が、判定単語列Ｒｍと単語列Ｗ（n-k+1,n）との一致が得られないまま完了する。
【００４５】
そこで、次に着目する単語列を末尾から一つ手前に移動するために変数ｎを値１だけデクリメントし（ステップＳ２８０）、この変数ｎが値０より小さくなるまで（ステップＳ２９０）、ステップＳ２２０に戻って、変数ｍを初期化する処理から上記の各処理を繰り返す。この処理を繰り返す結果、やがて末尾から３番目の単語「と」からのｋ個の単語を取得するようになると、何番目かの文例Ｒｍ「（＊：＊：動詞、＊：＊：活用語尾、と：＊：接続助詞）」が、検索文の単語列Ｗ（n-k+1,n）である「入れ」＋「る」＋「と」と一致する（ステップＳ２５０）。このとき、処理はステップＳ３００以下に分岐し、従属節が一つ見つかったとして、従属節を示す変数ｊを値１だけインクリメントし（ステップＳ３００）、この従属節に関する情報を設定する処理を行なう（ステップＳ３１０）。従属節に関する情報の設定については、次の段落で説明する。この処理の後、着目している単語の位置を、ｋ−１個分だけ進め（ステップＳ３２０）、更に上述した変数ｎのデクリメント（ステップＳ２８０）から、上記の処理を繰り返す。従属節が一つ見つかっても更に処理を継続するのは、自然言語では、従属節が複数許されているからである。例えば、「突然パソコンが終了したので、電源を入れると壊れた」という検索文が入力された場合を想定すると、「突然パソコンが終了したので」と「電源を入れると」の二つが条件を示す従属節として設定されることになる。
【００４６】
従属節に関する情報の設定は、図８に示した結論部からの距離と条件部の意味の二つである。結論部自身は、距離０であり、ここから文頭に向けて、結論部（この文例では、「壊れた」）に近い従属節から、距離１、２・・・となる。また一致した判定単語列Ｒｍに付与されていた分類に従い、「条件」「理由」「逆接」「並列」などの区別が、従属節に関する情報として、配列Ｓ［ｓ］に対応づけて記憶される。
【００４７】
（３）検索システムの処理−照合処理：
以上の処理により、図４に示した解析処理が完了する。次に照合処理が行なわれる。照合処理は、入力された検索文と、これに基づいてデータベースＤＢから検索した検索対象文との照合を行なう処理であり、まず単語の照合処理を行なう（ステップＳ１３０）。ここでは、基本的には検索文に含まれていた単語を用いてデータベースＤＢを引く処理が行なわれるが、検索語がについてはシソーラスＴＳＲを参照し、類義語や同意語などを広く検索する。例えば、「電源」「入れる」という単語のみならず、「パワースイッチ」や「パワーサプライ」などの類義語や、「切る」に対して人間の身体動作として同じカテゴリに分類されている「切る」や「回す」なども検索の対象とされる。こうした検索処理により、データベースＤＢから多数の検索対象文が広汎に得られるから、自然言語により検索文が入力されても、検索漏れを生じることが少ない。
【００４８】
単語の照合処理は、更に次のように行なわれる。検索対象となった文に、
▲１▼検索文に含まれてる自立語が存在する場合には、類似点として値１を与え、▲２▼シソーラスＴＳＲにより上位概念が一致する単語が存在する場合には、値０．９を与える。
例えば、「電源を入れると壊れた」という検索文に対して、「ＰＣの電源を切ると」という文がデータベースＤＢ内に存在した場合には、単語「電源」については類似点として値１が与え「入れる」と「切る」については、共通の上位概念「身体動作」を持つので、類似点として値０．９を与えるのである。従って、この両文の類似点は、１＋０．９＝１．９となる。なお、こうした類似点の付与は、更に文末表現などに応じて細かく調整するものとしてもよい。例えば「壊れるようだ」とか「壊れるらしい」といった文が見い出された場合には、文末の関係表現に着目して、伝聞や推量であれば、値０．１ないし０．３をマイナスするといったルールを適用して、類似点を調整することも、二つの文の類似を判断する上で好適である。
【００４９】
次に係り受けの照合の処理を行なう（ステップＳ１４０）。この処理は、ある単語に着目したとき、その単語の係り先の単語も一致する場合には、その単語についての類似点を増加するのである。例えば、「電源を入れる」と「電源を切る」という二つの文の場合、「電源を」という文節を構成する単語「電源」は、「入れる」と「切る」の両方に係り受けの関係を持っており、しかも、「入れる」と「切る」は身体動作という点で同一のカテゴリに属する。こうした場合には、「電源」についての類似点として与えられた値１を５０％増加し、値１．５とする。なお、増加の仕方は、こうした５０％アップなどに限られるものではなく、所定の値（例えば０．５）を付与するといった手法でも差し支えない。また、係り受けの係り先の単語が完全一致の場合には、更に高い値を与えるようにすることも望ましい。この結果、先の単語の照合とあわせると、「電源を入れる」と「電源を切る」との類似点は、１．５＋０．９＝２．４となる。
【００５０】
係り受けの照合を行なった後、次に部分文の照合を行なう（ステップＳ１５０）。部分文の照合は、着目している部分文が、結論部に相当するか条件部に相当するかにより、類似点の増加を異ならせることにより、行なっている。この関係を図１２に示した。「電源を入れる」と「電源を切る」とを例文として用いるものとして、
▲１▼この両文が、検索文および対象文の結論部に存在していれば、類似点を１００％増加するものとし、
▲２▼一方が結論部に、他方が条件部に存在していれば、類似点を５０％増加するものとし、
▲３▼両文が、共に条件部に存在していれば、更に、両者の結論部からの距離を判定し、距離ｊが一致していれば、類似点を２０％増加するものとし、
▲４▼両文が共に条件部に存在しており、かつ結論部からの距離ｊが異なっていれば、類似点を１０％増加するものとする、
のである。
【００５１】
この結果、「電源を入れる」と「電源を切る」が共に結論部にあれば、類似点は、２．４×２＝４．８となり、一方が結論部に他方が条件部にあれば、２．４×１．５＝３．６となり、共に条件部にあって結論部からの距離が等しければ、２．４×１．２＝２．８８となり、結論部からの距離が異なってれば、２．６４となる。
【００５２】
もう少し複雑な例文を例に挙げて、類似点の計算したものを以下に説明する。検索文としてクライアント３０から入力した文が、「コンピュータの使用中にハングアップして、コンピュータの電源を切ることもできません」であり、データベースＤＢから、次の二つの文（Ａ）（Ｂ）が、検索により取り出されたとする。
（Ａ）ＰＣの電源をいれると、オペレーティングシステムが起動する前に、「ＮｏＳｙｓｔｅｍＤｉｓｋ」が表示されて起動が止まってしまいます。
（Ｂ）コンピュータの電源が切れません。
この二つの文について、単語の照合を行なうと、例文（Ａ）については、「電源」が完全に一致、「コンピュータ」と「ＰＣ」、「切る」と「入れる」が、シソーラスＴＳＲを参照して類似となる。従って、単語における類似点は、０．９＋１＋０．９＝２．８となる。他方、例文（Ｂ）については、「コンピュータ」「電源」「切る（否定）」が完全一致するので、類似点は３となる。
【００５３】
次に、係り受けによる照合を行なうと、例文（Ａ）については、「ＰＣ」と「電源」の係り先が同一カテゴリと判断できるので、両者の類似点を５０％増加して、０．９×１．５＋１×１．５＋０．９＝３．７５となる。他方、例文（Ｂ）については、「コンピュータ」と「電源」の係り先が同一と判断されるので、同様に５０％増加して、１×１．５＋１×１．５＋１＝４となる。
【００５４】
更に、部分文の一致について照合すると、例文（Ａ）については、「コンピュータを、使用中に→ハングアップして」は条件部にあり、「ＰＣの→電源を→切る（否定）」は結論部にあることから、類似点の総和３．７５を５０％増加して、最終的な類似点は、５．６３となる。従って、この例文（Ａ）と検索文との類似度は、類似点５．６３＋１＝６．６３として与えられる。他方、例文（Ｂ）については、単語が類似した部分文が共に結論部にあることから、類似点の総和４を１００％増加して、４×２＝８となり、検索文（類似点１）との類似度は、８＋１＝９となる。
【００５５】
この結果、例文（Ｂ）の方が例文（Ａ）より、検索文により高い相関を示すと判断して、検索用サーバ２０は、例文（Ｂ）を例文（Ａ）により上位に配列して、クライアント３０に出力する。検索用サーバ２０からのデータを受けて、クライアント３０上で動作しているブラウザは、図１３に例示するように、例文（Ｂ）を例文（Ａ）より上位に表示することになる。従って、検索を行なおうとしたものと、より相関の高い検索結果から順に参照することができ、所望の情報を一層容易に入力することができる。なお、上記の実施例では、検索の結果、類似度を判断して、より相関の高いと考えられる情報を上位に表示しているが、この場合に、類似点をあわせて表示したり、結論部で一致したか、条件部で一致したか等の情報を加えて表示するものとしてもよい。こうすれば、利用者は、検索結果を単に上位から順に眺めるだけでなく、どのような条件で一致した情報かを判断することができ、好適である。
【００５６】
また、上記実施例では、検索文の解析も検索用サーバ２０で行なったが、検索文の解析をクライアント３０側で行なうものとすることもできる。あるいは、検索用サーバ２０は、クライアント３０側から受け取った単語による検索だけを行ない、検索語に部分一致が見いだされたデータをすべてクライアント３０側に渡し、クライアント３０側で、図４に示した解析処理および照合処理のすべてを行なうものとしても良い。解析処理と照合処理を、検索用サーバ２０側とクライアント３０側にわけても良い。あるいは、クライアント３０と検索用サーバ２０との間に専用のサーバを設けて、ここで、解析処理や照合処理を行なっても良い。
【００５７】
上記実施例では、シソーラスＴＳＲを設けて、検索文に含まれる単語の類義語などを含めて広く検索を行ない、検索語の偏りなどによる検索漏れを防止しているが、検索を実行する前に、検索文を標準化することで検索漏れを防止しても良い。こうした標準化の処理としては、半角／全角文字の統一などの文字の標準化、送りがなや長音記号の有無などの表記の標準化、同一の意味の他の自立語への統一など自立語の標準化など、種々のレベルを考えることができる。検索前にこうした標準化を行なっておけば、シソーラスＴＳＲの参照を行なわないか、行なうとしても限定的なものにとどめることができる。
【００５８】
以上、本発明の実施の形態について説明したが、本発明はこうした実施の形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において、更に種々なる形態で実施し得ることは勿論である。例えば、本実施例の検索システムは、クライアント−サーバシステムとして実現したが、スタンドアロンのコンピュータで実現しても差し支えない。また、検索対象としては、ネットワーク上のサイトなどを対象とすることも可能である。更に、上記実施の形態や実施例では、検索結果を評価して対象文の振り分けを行なった後、これをクライアント側に出力しているが、検索結果の評価と対象文の振り分けまででとどめても差し支えない。評価され振り分けられた対象文を単に出力するだけでなく、評価され振り分けられた対象文を推論エンジンの推論対象として利用するなど、多様な応用が可能である。マイク１３を用いて検索文を音声認識により入力する構成や、検索結果を音声により報知する構成も可能である。
【図面の簡単な説明】
【図１】本発明の実施の形態としての検索システム１００の概略構成を示すブロック図である。
【図２】本発明の一実施例としての検索システム５０の構成を示す概略構成図である。
【図３】形態素解析辞書ＩＤＣの一部を例示する説明図である。
【図４】検索用サーバ２０が実行する検索処理の概要を示す説明図である。
【図５】形態素解析処理ルーチンを示すフローチャートである。
【図６】形態素解析により得られる単語の配列Ｔ［ｔ］の一例を示す説明図である。
【図７】係り受け解析により得られる文節の配列Ｂ［ｂ］の一例を示す説明図である。
【図８】部分文の解析により得られる部分文の配列Ｓ［ｓ］の一例を示す説明図である。
【図９】単語、文節、部分文の構成礼を示す説明図である。
【図１０】部分文の解析ルーチンを示すフローチャートである。
【図１１】部分文の解析に用いられる判定単語列Ｒｍを例示する説明図である。
【図１２】部分文の照合時における類似点の増加の条件とその割合を示す説明図である。
【図１３】検索結果の表示例を示す説明図である。
【符号の説明】
１０…ネットワーク
１１…キーボード
１２…マウス
１３…マイク
１８…ルータ
２０…検索用サーバ
２２…ＣＰＵ
２３…ＲＯＭ
２４…ＲＡＭ
２５…タイマ
２６…表示回路
２７…ハードディスク
２９…モニタ
３０…クライアント
３０…検索用サーバ
５０…検索システム
１００…検索システム
１１０…ネットワーク
２００…サーバコンピュータ
２１０…検索語受信部
２２０…検索エンジン
２２５…知識データベース
２３０…対象文解析部
２４０…対比実行部
２５０…配列部
２６０…検索結果出力部
３００…クライアントコンピュータ
３１０…検索文入力部
３２０…検索文解析部
３３０…検索語出力部
３４０…結果表示部
ＤＢ…検索対象データベース
ＩＤＣ…形態素解析辞書
Ｒｍ…判定単語列
ＳＤＩ…文判定ルール
ＴＳＲ…単語シソーラス[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for performing a search using a language sentence, and more particularly to a technique for evaluating a correlation between searched information and a search sentence.
[0002]
[Prior art]
A search of a database or the like is usually performed under conditions such as inputting a word for search and data including the word or data not including the word. If too much data is found as a result of a single word search, additional words can be added to perform a refined search, or each search result for several words can be searched for AND conditions or OR conditions, etc. The search target is also narrowed down by specifying. Some training was required to obtain the desired results using these words.
[0003]
Therefore, various proposals have been conventionally made in search of a better search method. For example, when a word to be searched is specified, a technique is proposed in which a word that has the same meaning as the word is specified using a thesaurus, and the word is also searched to perform a high-precision search. Has been. If a thesaurus is used, it is possible to perform a search using a superordinate concept of a word to be searched. In addition, those which are intended to perform a search using a natural language (for example, “Natural Language Understanding Method and Information Retrieval Device” disclosed in Japanese Patent Laid-Open No. 1-180046 and Japanese Patent Laid-Open No. 2001-14165). Response devices) have been proposed. These are techniques for creating a search scenario in advance according to the search target (plant monitoring, maintenance, etc.) and proceeding with the search scenario. In such a search technique, a search operation can be performed with a natural language sentence as compared with a simple search using words.
[0004]
[Problems to be solved by the invention]
However, such search technology has problems in the following points. In particular, in a system that searches a large amount of data, for example, information on a site connected to a network such as the Internet, sufficient search technology has been proposed. I could not say. First of all, search using words and thesaurus does not change that the search is based on words, so if a large amount of data is found by search, it must be narrowed down. It was the same as before. For this reason, it has been difficult to improve the search accuracy using a thesaurus.
[0005]
In addition, a search that uses natural language sentences is performed after creating a scenario that takes advantage of the characteristics of the search target in advance. There was a problem that it could not be handled if it came off. For this reason, for example, a search using a natural language sentence cannot be performed on a target that cannot be assumed in advance, such as a site search on the Internet.
[0006]
The apparatus of the present invention is intended to solve such problems and realize a high-precision search using a natural language sentence.
[0007]
[Means for solving the problems and their functions and effects]
An apparatus of the present invention that solves at least a part of the above problems
A device for performing a search using a language sentence,
Search sentence input means for inputting a search sentence for search;
A search means for performing a search using the input search sentence;
Analyzing at least a target sentence that is a sentence included in the searched target, and extracting a partial sentence that is a syntactic minimum unit including at least one predicate, and extracting the extracted partial sentence in the sentence Focusing on the role, first classification means for classifying at least a condition part and a conclusion part;
Analyzing the search sentence, extracting the partial sentence, and focusing on the role of the extracted partial sentence in at least a condition part and a conclusion part;
It is determined whether an independent word included in the partial sentence extracted from the search sentence and the target sentence belongs to the classified condition part or the conclusion part, and the search is performed on the target sentence based on the determination result. A target sentence evaluation unit that assigns similarity to sentences and arranges the target sentences in descending order of the degree of similarity;
The gist is that
[0008]
The invention of the method corresponding to this device is
A method in which a computer performs a search using a language sentence,
Enter search text for search from input means such as a keyboard,
The computer performs a search using the input search sentence,
The computer analyzes at least a target sentence that is a sentence included in the searched target, and the computer extracts a partial sentence that is a syntactic minimum unit including at least one predicate, and the extracted partial sentence is , Focusing on the role in the sentence, classifying it into at least a conditional part and a conclusion part,
Analyzing the search sentence, the computer extracts the partial sentence, and classifies the extracted partial sentence into at least a condition part and a conclusion part, focusing on the role in the sentence,
It is determined whether an independent word included in the partial sentence extracted from the search sentence and the target sentence belongs to the classified condition part or the conclusion part, and the search is performed on the target sentence based on the determination result. Assigning similarity to sentences and arranging the target sentences in descending order of similarity
Is the gist.
[0009]
According to such an apparatus and method, a search is performed by using a search sentence for search input by a person who intends to search. At this time, a target sentence that is a sentence included in the searched target is acquired, and the target sentence is analyzed to extract a partial sentence that is a syntactic minimum unit including at least one predicate, and the extraction is performed. The divided partial sentences are classified into at least a conditional part and a conclusion part, focusing on the role in the sentence. Similarly, by analyzing the search sentence and extracting the partial sentence, focusing on the role in the sentence, the extracted partial sentence is classified into at least a conditional part and a conclusion part, and extracted from the search sentence and the target sentence By determining whether the independent words included in the selected partial sentence belong to the classified condition part or conclusion part, the similarity to the search sentence is given to the target sentence. Sentences can be arranged.
[0010]
Other aspects of the invention
Moreover, the invention of such a search device can be realized on a server as a form of realization thereof, or can be realized as a system in which a server computer and a client computer cooperate. Further, since the above search method can be realized by a program running on a computer, the present invention is used as a program or a recording medium (for example, a flexible disk, a CD-ROM, a DVD-ROM, It can also be grasped as a magnetic tape. Although the program can be recorded on a recording medium and handled, it can also be handled by placing it on a server on the network, downloading it via the network, and executing it on the client computer.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below. FIG. 1 is a block diagram showing a search system 100 as one embodiment of the present invention. Each block shown is actually composed of a server computer 200 and a client computer 300. Both computers are connected via a network 110. The server computer 200 includes a search engine, and the client computer 300 outputs a search request to the server computer 200, receives the search result from the server 200, and displays it. Details of the specific hardware configuration of the server computer 200 and the client computer 300 will be given in the embodiments described later, and the configuration and operation thereof will be described here at the block level.
[0012]
As shown in FIG. 1, the client computer 300 has analyzed a search sentence input unit 310 that accepts a search sentence for search in Japanese, which is one of natural languages, and a search sentence analysis unit 320 that analyzes the search sentence. A search word output unit 330 extracts a search word string from the search sentence and outputs it to the server computer 200, and further includes a result display unit 340 that receives a search result from the server computer 200 and displays it on the screen. On the other hand, the server computer 200 includes a search word receiving unit 210 that receives a search word from the client computer 300, a search engine 220 that performs a search using the received search word, a search result is extracted in sentence units, and is analyzed by morphological analysis. Target sentence analysis unit 230, a comparison execution unit 240 that compares the analyzed target sentence and the search word, an arrangement unit 250 that arranges the target sentence according to the comparison result, and a search that sequentially transmits the arranged target sentences to the client computer 300 A result output unit 260 and the like are provided.
[0013]
The search sentence input unit 310 of the client computer 300 accepts Japanese input by the user using a keyboard or the like. When searching for a site connected to a network such as the Internet, the search text input unit 310 uses IME (Japanese input method) or the like in a search word input box displayed by a normal browser. This corresponds to the process of inputting Japanese sentences. For example, a natural language sentence such as “broken when the power is turned on” is inputted via the search sentence input unit 310. In the present embodiment, it is assumed that a failure diagnosis site connected via the Internet receives a diagnosis (cause and countermeasure) about a computer failure.
[0014]
Users can usually express the failure status of their computer in words, but specify the cause and set search terms, or enter multiple words and gradually narrow down the search range. It is often difficult. Therefore, in this embodiment, the query is made in the form of a sentence that can be expressed by itself using a natural language (in this example, Japanese) that is usually used by the user. When a search sentence is input in such Japanese, the search sentence analysis unit 320 analyzes this. The contents of the analysis will be described in detail in an embodiment to be described later. Usually, however, the retrieval sentence is first subjected to a morphological analysis, and a process of dividing the sentence into sentences constituting a natural sentence as Japanese is performed. After dividing into phrases, analyze the words to be used in the search. For example, if the search sentence is “broken when the power is turned on”, morphological analysis is performed, and “power is turned on” (noun + particle), “when turned on” (verb combined form + particle), “broken” It is divided into phrases such as “ta” (verb ending form + auxiliary verb indicating the past), and “power”, “turn on”, “break”, etc. are extracted as search terms from here. In addition to such word extraction, the search sentence analysis unit 320 refers to a thesaurus, and synonyms and synonyms (for example, “power supply” for “power supply” and “damage” for “break”) It may be extracted as a search term. Further, in addition to the morphological analysis, the dependency between phrases and the structure of partial sentences such as phrases and clauses constituting the sentence may be analyzed.
[0015]
The search term extracted in this way is output via the network by the search term output unit 330, and is delivered to the server computer 200 on which the failure diagnosis program is operating using the IP address attached to the packet. The server computer 200 receives the search term output to the network in this way by the search term receiving unit 210 and passes it to the search engine 220. The search engine 220 accesses the failure diagnosis knowledge database 225 using the received search words (usually a plurality of search words) and retrieves a sentence containing the search word when it is found. In the above example, a sentence including all of the words such as “power”, “turn on”, and “break”, or a sentence including at least one word is taken out as a searched target sentence. For example, in the database stored in the knowledge database 225, “When the power is turned on,“ NoDisk ”is displayed before the OS starts up and stops.” Or “Hang up while using the computer. If there is a sentence such as “The computer cannot be turned off”, the search engine 220 searches for these sentences as corresponding search target sentences.
[0016]
The target sentence analysis unit 230 analyzes the search target sentence thus obtained. This analysis is almost the same as the analysis by the search sentence analysis unit 320 described above. Based on the morphological analysis, the analysis of the phrase structure by dependency and the minimum unit of syntax including at least one predicate The partial sentence that is is analyzed. Further, the target sentence analysis unit 230 classifies the partial sentences extracted in this way by paying attention to the role in the sentence. The role may be classified as a conditional part or a conclusion part in a sentence, and further divided into, for example, a conditional part, for example, “condition”, “reason”, “reverse connection”, “parallel”, etc. good. After classifying the search target sentences in this manner, the comparison execution unit 240 compares the search target sentences with the search words according to the classification. That is, they are compared with each other while considering whether the search word appears in the conclusion part of the search target sentence or in the condition part. In this example, the server computer 200 receives a plurality of search terms by the search term receiving unit 210 and operates the search engine 220. Therefore, the comparison simply compares the search terms with the search target sentence. However, it is also possible to receive the analysis result of the search sentence and compare it using the analysis result of the search sentence. For example, it may be compared with the search target sentence in consideration of the structure of each word in the search sentence (such as dependency and role of clause). Note that the “partial sentence” here is a syntactic minimum unit including at least one predicate, and in the grammar, a concept roughly corresponding to a unit such as a main clause, a subordinate clause, or a conditional clause. It is.
[0017]
After performing the comparison in this manner, the server computer 200 sorts and arranges the search target sentences by the arrangement unit 250, and returns them to the client computer 300 side from the search result output unit 260. As an array, it is preferable to arrange search target sentences that are considered to have a more important correlation with the search word. It is also suitable not only to arrange the information with high correlation information but also to output it in a hierarchy. The search result is output by attaching the IP address of the client computer 300 that has transmitted the search word to the packet and outputting the search target sentence to a network such as the Internet. The data thus sent to the network is surely returned to the client computer 300 that has output the search word, using the IP address as a clue.
[0018]
Receiving this search result, the client computer 300 displays a list of search results in which search target sentences are arranged using a browser or the like. Normally, in such a failure diagnosis, a URL or the like is attached to the search result. When the user reads the search result and wants to know the details in more detail, the user clicks on this URL to immediately obtain the server. By jumping to necessary information in the computer 200, it is possible to know detailed failure diagnosis information (causes of failure, countermeasures, etc.). In addition, users do not need to specify several search terms or enter these in order to narrow down the search results, and sentences that explain the situation in natural Japanese as long as they know. There is a great advantage that it is sufficient to enter. Furthermore, since the search result is a natural Japanese sentence and a sentence that is considered to be more relevant is displayed at the top, there is an advantage that the necessary information can be reached in a short time.
[0019]
In the embodiment described above, when a sentence is extracted by morphological analysis of a search target sentence, a conjunction and a connection particle indicating connection relations of partial sentences included in the target sentence are specified, and the conjunction and the connection particle are used. It is also possible to extract partial sentences. In Japanese, a partial sentence is often composed up to the place where a conjunction or connection particle is used, and if you focus on a connection particle, etc., the sentence before that is a partial sentence that shows a condition or the reason Or the like can be easily recognized.
[0020]
Furthermore, in the above embodiment, when analyzing a search sentence, a partial sentence that is a syntactic minimum unit including at least one predicate is extracted and corresponds to a conclusion part from the extracted partial sentence. The part to be identified may be specified, and the search itself may be performed using words included in the identified conclusion part. For example, in the case of a search sentence such as "Power supply is broken when the switch is turned on", the conclusion part "Power supply broken" is often more useful for fault diagnosis, so the word "Power" in the conclusion part The search is performed using “break”. Of course, depending on the application, a part corresponding to the condition part may be specified and a search may be performed using a word included in the condition part. For example, in a system for diagnosing poisoning, if a search sentence such as “I swallowed a dry battery and my stomach hurts” is entered, the condition part is considered more useful and a word such as “dry battery” or “drink” is extracted. The search may be performed using this.
[0021]
In the above embodiment, the search target is a database, but the search target may be information included in a site placed on the network. When applied to a so-called search engine on a network, it is easy to display a large number of related sites by giving priority to those with higher correlation.
[0022]
In the above-described embodiment, the search system is configured by the server computer 200 and the client computer 300. However, a so-called stand-alone usage mode in which a database and a search engine are placed on a computer used by a user may be used. In the above-described embodiment, the client computer 300 analyzes the search sentence by the search sentence analysis unit 320 and passes it as a search word to the server computer 200 side. However, the client computer 300 outputs the search sentence as it is to the client computer 300 side. Alternatively, the server computer 200 may perform search sentence analysis processing. In this case, since the ability to analyze the search sentence is determined by a program or database on the server computer 200 side, an advantage is obtained that the analysis ability can be upgraded simply by replacing the program on the server computer 200 side. Further, it does not say that the analysis capability differs for each client computer 300. Of course, the search sentence analysis unit 320 may be configured to be added to the browser by a plug-in, and may be transmitted from the server computer 200 side to the client computer 300 side. In this way, almost the same analysis capability can be prepared regardless of the client computer 300. In addition, the load on the server computer 200 side accessed from a large number of client computers 300 can be reduced.
[0023]
In the embodiment described above, the search target sentence obtained by the search engine 220 is analyzed by the target sentence analysis unit 230 on the server computer 200 side. However, the analysis is not performed on the server computer 200 side, and the search target sentence is analyzed. The data may be output to the client computer 300 as it is, and the analysis and comparison may be performed on the client computer 300 side. Since the client computer 300 is a machine in which a search sentence is input and is located on the side closest to the user, the client computer 300 analyzes the search target sentence according to the user's request, determines the correlation with the search sentence, They can be displayed in the desired order. For example, if a configuration for inputting a search sentence using speech recognition is adopted, information such as inflections and emphasized words at the time of speech input is considered in determining the correlation between the search target sentence and the search sentence. It is also possible to say. If the client computer 300 analyzes the search target sentence, the search engine is operated on the plurality of server computers 200, and the search results from the plurality of server computers 200 are received and analyzed together. Thus, it is possible to display the images in order according to the degree of correlation.
[0024]
The above embodiment has been described as a search system including the server computer 200 and the client computer 300. However, a form in which a program that realizes these search functions on a computer is recorded on a recording medium such as a CD-ROM. Thus, the present invention can be implemented. In this case, as described above, the program on the server side and the program on the client side may be divided and recorded on a recording medium, or may be recorded as one program or a group of programs. Furthermore, a necessary program is set on the server side, and a program that operates in cooperation with the server side program is prepared so that it can be downloaded to the server side, and is read from the client side to be searched and executed. It is also possible to implement in the form.
[0025]
【Example】
In order to describe the embodiment described above more specifically, an example will be described.
(1) Configuration of the embodiment:
First, the hardware configuration of the embodiment will be described with reference to the schematic configuration diagram of FIG. In the embodiment shown in FIG. 2, a program is installed in a server computer 20 connected to a network 10 such as the Internet, and a search is executed in response to a search request from the client computer 30 by executing this program. A search system is embodied. The server computer 20 (hereinafter referred to as a search server) in the search system 50 can be used as a stand-alone search device itself. However, as will be described below, another client computer (hereinafter referred to as this server) is used as a server. Can be used from 30). That is, users of a large number of clients 30 connected to the network 10 can receive a search using a natural language and the results by accessing the search server 20 via the network 10. Since the search server 20 and the client 30 are substantially the same for the input part, the hardware configuration will be described by taking the search server 20 as an example.
[0026]
The internal configuration of the search server 20 will be described with reference to FIG. The search server 20 includes a network interface (NT-I / F) 21 that controls the exchange of data with the network 10 via a modem or router 18, a CPU 22 that performs processing, and a ROM 23 that stores processing programs and fixed data. A RAM 24 as a work area, a timer 25 for managing time, a display circuit 26 for managing display on a monitor 29, a hard disk (HD) 27 storing text data as a database, a keyboard 11, a mouse 12, and a microphone 13. An input interface (I / F) 28 that controls the interface is provided. Although the hard disk 27 is described as being fixed, it may be removable, or a removable storage device (for example, CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, A flexible disk or the like can also be used together. In this embodiment, the processing program of the search server 20 is stored in the ROM 23. However, the processing program may be stored in the hard disk 27 and expanded and executed on the RAM 24 at startup. . Alternatively, it may be read from the above-described removable recording medium. Further, it may be executed by reading from another server via the network 10. Similarly, as will be described below, it is not necessary to store all of the necessary data on the hard disk 27, and a huge amount of data is distributed, stored, updated, and managed to other servers connected by the network 10. It is good also as composition to do.
[0027]
The hard disk 27 stores a morphological analysis dictionary IDC, a sentence determination rule SDI, a thesaurus TSR, and a search target database DB. This morphological analysis dictionary IDC is a dictionary that stores almost the same contents as a so-called kana-kanji conversion dictionary, and headings and readings are opposite to those of the kana-kanji conversion dictionary. For this reason, a kana-kanji conversion dictionary that analyzes a kana character string input via the keyboard 11 or the network 10 and converts it into a kana-kanji character string is used as it is, and only the relationship between reading and heading is in the form of an index. It can also be. An example of the morphological analysis dictionary IDC is shown in FIG. In this example, only reading, notation, and grammatical information are shown, but in the actual morphological analysis dictionary IDC, in addition to reading, notation, and grammatical information about words and other words, the same meaning of spoken words, consent Words, synonyms, abbreviations, and dependency information are stored in association with each other. This analysis dictionary IDC is used when the search server 20 performs morphological analysis on the search sentence received from the client 30. By using this analysis dictionary IDC, the search server 20 can analyze the received search sentence with high accuracy. For example, even if the character string to be analyzed is a colloquial natural language, it is possible to accurately analyze the colloquial character.
[0028]
The sentence determination rule SDI is a dictionary that stores rules that define the relationship between dependency and partial sentences. Dependency is also used in morphological analysis, but here it is used to specify the relationship between clauses obtained by morphological analysis. Furthermore, the rule that defines the relationship between sub-sentences is largely a rule that specifies whether it is a condition part or a conclusion part. For the condition part, rules that distinguish conditions, reasons, reverse connection, parallelism, etc. are stored. Has been. For the conclusion part, a rule for removing an unnecessary part that does not affect the conclusion is also stored.
[0029]
The word thesaurus TSR is a dictionary in which semantically related words (for example, synonyms, antonyms, etc.) are organized according to their conceptual relationships. As a conceptual relationship, various relationships are provided in addition to a relationship such as upper, lower, and parallel, and a large number of words are arranged in such a conceptual relationship. For example, verbs such as “put”, “cut”, “turn”, and “twist” are arranged as synonyms from the viewpoint of “human motion”.
[0030]
The search target database DB is a target itself to be searched by the user. In this embodiment, the search target database DB is a database for failure analysis / diagnosis. In this embodiment, such a database DB is recorded in the hard disk 27, but it is of course possible to handle a large number of sites existing on the Internet as the search target database DB. In such a case, the data of the site on the Internet may be searched using a recursive search engine, and the data may be stored in the hard disk 27 in the search server 20 in the form of an index. You can do it.
[0031]
(2) Operation of search system-analysis processing:
An operation of the search system 50 including the search server 20 and the client 30 connected thereto will be described. In this embodiment, the client 30 operates a browser for browsing site information on the Internet, and the search box displayed on the browser by the user based on the data sent from the search server 20. If the content to be searched is input in natural Japanese, it is transmitted to the search server 20 via the network 10 without being analyzed. In the embodiment, the analysis of the search sentence is performed on the client 300 side, but in this embodiment, all of the analysis from the analysis of the search sentence is performed on the search server 20 side. The client 30 side is only responsible for the input and output of search sentences and the display of search results.
[0032]
Therefore, the operation on the client 30 side will be briefly described, and the operation on the search server 20 side will be described in detail with reference to the explanatory diagram of FIG. When receiving the request from the client 30 via the network 10, the search server 20 starts the process shown in FIG. The processing executed by the search server 20 is mainly analysis processing and collation processing. The analysis process includes a morphological analysis process (step S100), a dependency analysis (step S110), and a partial sentence determination (step S120). On the other hand, the collation process includes word collation (step S130), dependency collation (step S140), and partial sentence collation (step S150).
[0033]
The process shown in FIG. 4 is started when a search sentence is received from the client 30, and first, a morphological analysis process is performed (step S100). As described above, the morpheme analysis process is a process performed by referring to the morpheme analysis dictionary IDC, and is a process for extracting words and phrases from the search sentence received from the client 30. The details of the morphological analysis process (step S100) are shown in the flowchart of FIG.
[0034]
When the morphological analysis process is started, the search sentence received from the client 30 is specified as an object to be analyzed, and L characters (L) from the Mth character (M = 1, 2,...) From the head of this sentence. = 1, 2,...) And the process of subtracting the analysis dictionary IDC is performed (step S102). M indicates the head position of the character string of interest, and L indicates the number of characters to be extracted. The analysis dictionary reference method starts with M = 1, that is, L = 1, that is, one character is extracted from the head position, and the corresponding word is extracted with reference to the dictionary. The dictionary IDC is referenced while sequentially incrementing L. If there is no corresponding entry word, the head position M of the character string of interest is incremented, the number of characters L is returned to 1, and the dictionary is searched. When the position of the focused character or the number of characters of the sentence to be analyzed is exceeded, the dictionary reference is cut off.
[0035]
For example, assuming that a search sentence “broken when power is turned on” is input from the client 30, referring to the analysis dictionary IDC, “power on” “power” “source” “source” “on” “If you put” “I put it” “La” “I put it” “Tara” “Ta” “I entered” “If it was” “Broken” “Broken” “Ta” “Broken” “Re” “Re” Can be cut out. Here, the kana one sound such as “ta” is also extracted as a word because the past tense auxiliary verb “ta” or the like may appear in the sentence.
[0036]
These words are stored in the analysis dictionary IDC together with their grammatical information. Therefore, the extracted words are arranged next according to the grammatical information, and a process for finding an array that does not fail is performed. For this analysis, for example, methods such as a multiple phrase longest match method and a minimum cost method are known, and a test is performed to determine which combination of predetermined words is most likely Japanese. In this embodiment, since the minimum cost method is adopted, the cost calculation is performed next on the large number of character strings thus obtained (step S104). The cost calculation is a process of calculating the cost of a character string prepared in advance so that the score is lower for a Japanese character array than for a character string array. The rule is roughly: a self-supporting word has a cost of 2, and if an adjunct is attached to it, the cost is 0. For example, taking “power supply” as an example, if “power supply” + “on”, it becomes a link of independent words + adjuncts (particles) and costs 2, “electricity” + “source” + “ "Independent words + independent words + ancillary words (particles), the cost is 4. The rules of the minimum cost method are tuned according to the actual Japanese language, and when words having a co-occurrence relationship such as “nothing” + “no” occur in the sentence, there are various costs such as “−1”. Rules are prepared.
[0037]
In this way, the above-mentioned cost is calculated for all words obtained by referring to the reverse lookup dictionary, and a process for specifying a sentence having the lowest cost is performed (step S106). In the above example, rather than “Den” (independent words / nouns) + “Source” (independent words / nouns) + “O” (adjuncts / particles), “Power” (independent words / nouns) + “O” It is judged that the (adjunct / particle) is more likely to be Japanese. Of course, this calculation is performed at least for each sentence, and an array of words is selected so as to minimize the cost of the entire sentence. Therefore, for example, if there is a cost reduction due to a co-occurrence relationship, a different combination may be selected.
[0038]
When the minimum cost sentence is specified by the minimum cost method in this way, a combination of phrases constituting the search sentence is obtained together with the grammatical information, so the obtained phrases are shown in FIG. The process of storing in the array shown is performed (step S108). FIG. 6 is an explanatory diagram showing an example of an array used when analyzing a search sentence. As a whole, the search sentence is analyzed and stored in the form of word information (FIG. 6), phrase information (FIG. 7), and partial sentence information (FIG. 8). Among these, FIG. 6 shows the contents (array) of word information, and this array is composed of words, word readings, and parts of speech. Hereinafter, the word arrangement is referred to as T [t] (t = 0, 1,...).
[0039]
When morphological analysis is completed in this way, dependency analysis (step S110) is then performed (see FIG. 4). Dependency analysis is a process for specifying the relationship between clauses constituting a sentence. Dependency analysis is a process for specifying phrase information. By performing dependency analysis, the relationship between phrases can be known. That is, it is possible to determine which clause a certain clause is related to. For example, from the rule that the noun + “wo” (particle) is applied to the nearest predicate behind, the relationship “turn off power” → “turn off” is specified. The phrase information obtained by such dependency analysis is stored in the array B. An example of this array B [b] (b = 0, 1,...) Is shown in FIG. This phrase information is composed of an array B [b] which is an index indicating a word, a number t of words belonging to this array B [b], a related phrase number b, and a related phrase number b. ing. In the table of FIG. 7, “-” indicates that the corresponding clause does not exist. If a word number t belonging to the array is given, an actual word can be acquired with reference to the array T [t] shown in FIG.
[0040]
When the dependency analysis (step S110) is completed, a partial sentence determination process is performed (step S120). This process specifies the relationship between partial sentences composed of one or more phrases using the relationship between phrases analyzed by dependency analysis. Here, a partial sentence is a concept that is at least equal to a clause that includes at least one predicate and is the smallest syntactic unit. As shown in FIG. 8, the relationship between the partial sentences is given as an array S [s] (s = 0, 1,...), And the array S [s] has a clause number b belonging to it. The distance from the conclusion part and the meaning of the condition part are associated. The relationship between the word array T [t], the phrase array B [b], and the partial sentence array S [s] is shown in FIG. As shown in the figure, these have a higher-lower structure, and clauses, words, and the like included in the partial sentence can be freely referred to.
[0041]
FIG. 10 shows a partial sentence extraction process. This process is performed by referring to the sentence determination rule SDI. An example of the sentence determination rule SDI is shown in FIG. FIG. 11 shows the determination word string Rm in the flowchart of FIG. Further, each finding indicates the meaning of the condition part. “*” In the table of FIG. 11 indicates a so-called wild card, which indicates that any word is applicable. “*: *: Verb” indicates that if the part of speech is a verb word, it is applicable regardless of reading or heading. For example, in FIG. 11, the sentence pattern indicated by the symbol IN has a meaning of “condition” in the condition part, and the sentence pattern “(*: *: verb, *: *: inflection ending, and *: connection particle)”. This means that all sub-sentences to which the conjunctive particle “to” is connected are shown after the inflection ending after the verb. The verb “put” + inflection form “ru” + connecting particle “to” matches this sentence pattern.
[0042]
The partial sentence analysis processing routine shown in FIG. 10 will be described. When this routine is started, processing for deleting unnecessary sentences from the search sentence is first performed (step S200). The unnecessary sentence is a part that is not related to the content to be searched, such as "What should I do?" These portions may be stored in advance in the form of a list of unnecessary sentences, and the corresponding sentences may be deleted. For example, if a search part such as “What should I do when I turn on the power? What should I do?” Is given, it is possible to identify the parts corresponding to these unnecessary sentences by morphological analysis and dependency analysis. Delete it because it can. The deleted clause is deleted from the word arrangement (see FIG. 6), the phrase arrangement (see FIG. 7), the partial sentence arrangement (see FIG. 8), and the like.
[0043]
Next, as the process of starting the analysis of the partial sentence, a process of setting the number n of all words constituting the search sentence to be analyzed and initializing (j ← 0) a variable indicating the number of dependent clauses of interest. This is performed (step S210). In the next step S220, a variable m indicating the number of sentence examples indicating the condition part shown in FIG. 11 is initialized (m ← 0), and the following is performed until the variable m becomes the total number of sentence examples shown in FIG. Repeat the process. The sentence example shown in FIG. 11 is a part in which one sentence example is enclosed in parentheses (), and can be designated as m = 1, 2,. Therefore, first, the m-th sentence example is acquired as the determination word string Rm, and processing for setting the number of words constituting the determination word string Rm to k is performed (step S230). For example, in the sentence example (*: *: verb, *: *: inflection ending, and: *: connection particle) described above, the number of constituent words k is 3.
[0044]
Next, attention is paid to the search sentence from the tail thereof, and a process of obtaining the word string W (nk−1, n) from the (n−k + 1) th to n is performed (step S240). Since the sentence example to be compared is the number k of words, a word string composed of words of k words is extracted from the search sentence. The word string can be easily extracted using the array T [t] indicating the word. For example, when a sentence “broken when the power is turned on” is input as a search sentence, “to” + “broken” + “ta” are acquired as three words from the back. When the comparison word string is acquired in this way, the process of collating the two is then performed (step S250). If they do not match, the variable m is incremented by 1 to obtain the next sentence example. (Step S260), the process returns to step S230 and continues until the sentence example shown in FIG. 11 is exhausted (step S270). In the above example, the extraction of words from the end does not match, so the determination of the full text example will eventually result in a match between the determination word string Rm and the word string W (n−k + 1, n). Complete without being done.
[0045]
Therefore, the variable n is decremented by a value 1 in order to move the next word string of interest to the previous position from the end (step S280), and the process proceeds to step S220 until the variable n becomes smaller than the value 0 (step S290). Returning, the above processes are repeated from the process of initializing the variable m. As a result of repeating this process, when k words from the third word “to” from the end are acquired, the example sentence Rm “(*: *: verb, *: *: inflection ending, And “*: connection particle” ”matches“ insert ”+“ ru ”+“ to ”which is the word string W (n−k + 1, n) of the search sentence (step S250). At this time, the process branches to step S300 and subsequent steps, and if one subordinate clause is found, the variable j indicating the subordinate clause is incremented by 1 (step S300), and processing for setting information on the subordinate clause is performed (step S300). Step S310). The setting of information about subordinate clauses will be explained in the next paragraph. After this processing, the position of the focused word is advanced by k−1 (step S320), and the above processing is repeated from the decrement of the variable n described above (step S280). The reason why processing continues even if one subordinate clause is found is that multiple subordinate clauses are allowed in natural language. For example, assuming that a search sentence is entered that says “Since the computer suddenly shuts down, it breaks when you turn on the power”, there are two conditions: “Suddenly shuts down the computer” and “Turns on the power” It will be set as a subordinate clause.
[0046]
The setting of the information on the subordinate clause is two of the distance from the conclusion part and the meaning of the condition part shown in FIG. The conclusion part itself has a distance of 0, and the distance from the subordinate clause close to the conclusion part (in this example sentence, “broken”) is 1, 2,. Further, according to the classification given to the matched determination word string Rm, the distinction such as “condition”, “reason”, “reverse connection”, “parallel”, etc. is stored in association with the array S [s] as information on the subordinate clause. .
[0047]
(3) Search system processing-collation processing:
With the above processing, the analysis processing shown in FIG. 4 is completed. Next, collation processing is performed. The collation process is a process of collating an input search sentence with a search target sentence searched from the database DB based on the input search sentence. First, a word collation process is performed (step S130). In this case, the database DB is basically retrieved using the words included in the search sentence. However, the search term is referred to the thesaurus TSR, and synonyms and synonyms are searched widely. For example, not only the words “power” and “turn on” but also synonyms such as “power switch” and “power supply”, “cut” and “cut” “Turn” and the like are also searched. By such search processing, a large number of search target sentences are widely obtained from the database DB. Therefore, even if a search sentence is input in a natural language, there is little occurrence of a search omission.
[0048]
The word matching process is further performed as follows. In the sentence that was searched,
(1) If there is an independent word included in the search sentence, a value of 1 is given as a similarity point, and (2) if a word whose superordinate concept matches by the thesaurus TSR exists, a value of 0.9 is given. give.
For example, when a sentence “When the PC is turned off” exists in the database DB with respect to a search sentence “It is broken when the power is turned on”, the value “1” is set as a similarity for the word “Power”. The giving “put” and “cut” have a common superordinate concept “body motion”, and therefore give a value of 0.9 as a similarity. Therefore, the similarity between the two sentences is 1 + 0.9 = 1.9. Note that the addition of similar points may be further finely adjusted according to the sentence end expression. For example, if a sentence such as “It seems to break” or “It seems to break” is found, pay attention to the relational expression at the end of the sentence, and if it is a hearing or guess, a rule that subtracts the value 0.1 to 0.3 It is also suitable to determine the similarity of two sentences by adjusting the similarity by applying.
[0049]
Next, dependency checking is performed (step S140). In this process, when attention is paid to a certain word, if the word related to the word also matches, the similarity for the word is increased. For example, in the case of two sentences “turn on” and “turn off”, the word “power” that makes up the phrase “turn on” has a dependency relationship on both “turn on” and “turn off”. Moreover, “put” and “cut” belong to the same category in terms of physical movement. In such a case, the value 1 given as a similarity for “power” is increased by 50% to a value of 1.5. The method of increase is not limited to such 50% increase, and a method of giving a predetermined value (for example, 0.5) may be used. It is also desirable to give a higher value when the dependency destination word is a perfect match. As a result, when combined with the collation of the previous word, the similarity between “Turn on power” and “Turn off power” is 1.5 + 0.9 = 2.4.
[0050]
After checking the dependency, the partial sentence is checked next (step S150). The partial sentences are collated by varying the increase in similarities depending on whether the focused partial sentence corresponds to a conclusion part or a condition part. This relationship is shown in FIG. As an example of using "turn on power" and "turn off power"
(1) If these two sentences are present in the conclusion part of the search sentence and the target sentence, the similarity will be increased by 100%.
(2) If one is in the conclusion part and the other is in the condition part, the similarity will be increased by 50%.
(3) If both sentences are present in the condition part, the distance from both conclusion parts is further determined. If the distance j is the same, the similarity is increased by 20%.
(4) If both sentences are present in the condition part and the distance j from the conclusion part is different, the similarity is increased by 10%.
It is.
[0051]
As a result, if both “turn on power” and “turn off power” are in the conclusion part, the similarity is 2.4 × 2 = 4.8, and if one is in the conclusion part and the other is in the condition part, 2.4 × 1.5 = 3.6, and if both are in the condition part and the distance from the conclusion part is equal, then 2.4 × 1.2 = 2.88, and the distance from the conclusion part is different. For example, 2.64.
[0052]
Taking a slightly more complex example sentence as an example, the calculation of similarities is described below. The sentence entered from the client 30 as a search sentence is “You cannot hang up while using the computer and you cannot turn off the computer.” From the database DB, the following two sentences (A) and (B) Suppose that it is retrieved by a search.
(A) When the PC is turned on, “NoSystemDisk” is displayed before the operating system starts, and the startup stops.
(B) The computer cannot be turned off.
When the words of these two sentences are collated, in the example sentence (A), “power” completely matches, “computer” and “PC”, “cut” and “put” refer to the thesaurus TSR. Are similar. Therefore, the similarity between words is 0.9 + 1 + 0.9 = 2.8. On the other hand, for the example sentence (B), “computer”, “power”, and “turn off (negation)” completely match, so the similarity is 3.
[0053]
Next, when collation by dependency is performed, for example sentence (A), it is possible to determine that “PC” and “power supply” are in the same category. × 1.5 + 1 × 1.5 + 0.9 = 3.75. On the other hand, for the example sentence (B), since the “computer” and “power source” are determined to be the same, they are similarly increased by 50% to 1 × 1.5 + 1 × 1.5 + 1 = 4.
[0054]
Furthermore, when matching the partial sentences, for example sentence (A), “Computer is in use → hang up” is in the condition part, and “PC → turn off power → turn off (deny)” is the conclusion. Therefore, the total similarity of 3.75 is increased by 50%, and the final similarity is 5.63. Accordingly, the similarity between the example sentence (A) and the search sentence is given as a similarity point 5.63 + 1 = 6.63. On the other hand, for the example sentence (B), since there are both partial sentences with similar words in the conclusion part, the sum 4 of similar points is increased by 100% to 4 × 2 = 8, and the search sentence (similar point 1) Is 8 + 1 = 9.
[0055]
As a result, it is determined that the example sentence (B) is more highly correlated with the search sentence than the example sentence (A), and the search server 20 arranges the example sentence (B) in the higher order with the example sentence (A), Output to the client 30. Upon receiving data from the search server 20, the browser operating on the client 30 displays the example sentence (B) above the example sentence (A) as illustrated in FIG. Therefore, it is possible to refer to a search result and a search result having a higher correlation in order, and it is possible to input desired information more easily. In the above embodiment, as a result of the search, the degree of similarity is judged and information that is considered to be more highly correlated is displayed at the top. In this case, similarities are displayed together, It is good also as what adds and displays information, such as whether it matched in the part or the condition part. This is preferable because the user can not only look at the search results in order from the top, but also determine under what conditions the matched information.
[0056]
In the above embodiment, the search sentence is also analyzed by the search server 20, but the search sentence may be analyzed on the client 30 side. Alternatively, the search server 20 performs only the search by the word received from the client 30 side, and passes all the data in which the partial match is found in the search word to the client 30 side, and the analysis shown in FIG. All of the processing and collation processing may be performed. The analysis process and the collation process may be divided into the search server 20 side and the client 30 side. Alternatively, a dedicated server may be provided between the client 30 and the search server 20, and the analysis process and the collation process may be performed here.
[0057]
In the above embodiment, the thesaurus TSR is provided to perform a wide search including the synonyms of the words included in the search sentence, and the search omission due to the bias of the search word is prevented, but before executing the search, Search omission may be prevented by standardizing the search text. Such standardization processes include standardization of characters such as unification of half-width and full-width characters, standardization of notation such as the presence or absence of feed and long symbols, and standardization of independent words such as unification of the same meaning into other independent words. Can think of the level. If such standardization is performed before the search, the thesaurus TSR is not referred to or can be limited even if it is performed.
[0058]
As mentioned above, although embodiment of this invention was described, this invention is not limited to such embodiment at all, and in the range which does not deviate from the summary of this invention, it can implement in various forms. Of course. For example, the search system of the present embodiment is realized as a client-server system, but may be realized by a stand-alone computer. Further, as a search target, it is possible to target a site on the network. Furthermore, in the above-described embodiments and examples, after the search result is evaluated and the target sentence is distributed, this is output to the client side, but only the evaluation of the search result and the distribution of the target sentence are limited. There is no problem. Various applications are possible, such as not only outputting the evaluated and distributed target sentence but also using the evaluated and distributed target sentence as an inference target of the inference engine. A configuration in which a search sentence is input by voice recognition using the microphone 13 or a configuration in which a search result is notified by voice is also possible.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a search system 100 as an embodiment of the present invention.
FIG. 2 is a schematic configuration diagram showing a configuration of a search system 50 as an embodiment of the present invention.
FIG. 3 is an explanatory diagram illustrating a part of a morphological analysis dictionary IDC.
FIG. 4 is an explanatory diagram showing an outline of search processing executed by the search server 20;
FIG. 5 is a flowchart showing a morphological analysis processing routine.
FIG. 6 is an explanatory diagram showing an example of a word array T [t] obtained by morphological analysis.
FIG. 7 is an explanatory diagram showing an example of a phrase array B [b] obtained by dependency analysis;
FIG. 8 is an explanatory diagram showing an example of a partial sentence array S [s] obtained by analyzing a partial sentence;
FIG. 9 is an explanatory diagram showing the composition of words, phrases, and partial sentences.
FIG. 10 is a flowchart showing a partial sentence analysis routine;
FIG. 11 is an explanatory diagram illustrating a determination word string Rm used for partial sentence analysis;
FIG. 12 is an explanatory diagram showing a condition for increasing similarities and a ratio thereof when collating partial sentences;
FIG. 13 is an explanatory diagram illustrating a display example of a search result.
[Explanation of symbols]
10 ... Network
11 ... Keyboard
12 ... Mouse
13 ... Microphone
18 ... Router
20 ... Search server
22 ... CPU
23 ... ROM
24 ... RAM
25 ... Timer
26. Display circuit
27 ... Hard disk
29 ... Monitor
30 ... Client
30 ... Search server
50 ... Search system
100 ... Search system
110: Network
200: Server computer
210 ... Search term receiver
220 ... Search engine
225 ... Knowledge database
230 ... Target sentence analysis part
240 ... Comparison execution unit
250 ... arrangement part
260 ... Search result output part
300: Client computer
310 ... Search sentence input part
320 ... Search sentence analysis part
330 ... Search term output part
340 ... Result display section
DB ... Search target database
IDC ... Morphological analysis dictionary
Rm ... judgment word string
SDI ... Sentence decision rule
TSR ... Thesaurus

Claims

A device for performing a search using a language sentence,
Search sentence input means for inputting a search sentence for search;
A search means for performing a search using the input search sentence;
Analyzing at least a target sentence that is a sentence included in the searched target, and extracting a partial sentence that is a syntactic minimum unit including at least one predicate, and extracting the extracted partial sentence in the sentence Focusing on the role, first classification means for classifying at least a condition part and a conclusion part;
Analyzing the search sentence, extracting the partial sentence, and focusing on the role of the extracted partial sentence in at least a condition part and a conclusion part;
It is determined whether an independent word included in the partial sentence extracted from the search sentence and the target sentence belongs to the classified condition part or the conclusion part, and the search is performed on the target sentence based on the determination result. And a target sentence evaluation unit that assigns similarity to sentences and arranges the target sentences in descending order of similarity.

The search device according to claim 1,
The first classification means includes
Morphological analysis means for morphologically analyzing the target sentence and cutting out a clause;
A retrieval apparatus comprising: a partial sentence extracting unit that identifies a conjunction and a connection particle indicating a connection relation of partial sentences included in the target sentence by the morphological analysis, and extracts a partial sentence using the conjunction and the connection particle .

The search device according to claim 1,
A search apparatus comprising target sentence display means for displaying a target sentence included in a target searched by the search means in a manner corresponding to the degree of similarity given by the target sentence evaluation means.

The search device according to claim 1,
The classification means includes
Morphological analysis means for morphologically analyzing the target sentence and cutting out a clause;
A phrase relation specifying means for analyzing the dependency relation of the extracted phrases and specifying the relation between phrases;
The target sentence evaluation means is means for giving a similarity to the search sentence to the target sentence based on the dependency relationship between the specified clauses.

The search device according to claim 3, wherein the target sentence evaluation unit arranges the target sentences appearing in the conclusion part of the target sentence as sentences having a high degree of similarity.

The search device according to any one of claims 1 to 5,
The search device is a search device for performing a search using a word included in the classified conclusion section.

The search device according to claim 1, wherein the target sentence evaluation unit increases the similarity when the independent sentence included in the search sentence is included in the target sentence in the provision of the similarity.

The target sentence evaluation means, when the similarity is given, when the target sentence includes a word corresponding to a superordinate concept obtained by referring to a thesaurus based on an independent word included in the search sentence. The search device according to claim 1, wherein

The search device according to claim 1, wherein the second classifying unit performs a process of removing an unnecessary sentence registered in advance from the search sentence before extracting the partial sentence.

A search system that is realized by a client computer and a server computer connected via a network, and that searches and displays data on another computer connected via the network,
The server computer
Search means for searching for data on another computer connected via the network using a search sentence for search input on the client computer side;
Analyzing at least a target sentence that is a sentence included in the searched target, and extracting a partial sentence that is a syntactic minimum unit including at least one predicate, and extracting the extracted partial sentence in the sentence Focusing on the role, first classification means for classifying at least a condition part and a conclusion part;
Analyzing the search sentence, extracting the partial sentence, and focusing on the role of the extracted partial sentence in at least a condition part and a conclusion part;
It is determined whether an independent word included in the partial sentence extracted from the search sentence and the target sentence belongs to the classified condition part or the conclusion part, and the search is performed on the target sentence based on the determination result. A target sentence evaluation unit that assigns a similarity to a sentence and arranges the target sentences in descending order of the similarity;
A search system comprising: transmission means for sending the sorted target sentence in a form that can be displayed on the client computer in a predetermined structure to the client computer that has input the search sentence.

A search system that is realized by a client computer and a server computer connected via a network, and that searches and displays data on another computer connected via the network,
The server computer
Search means for searching for data on another computer connected via the network using a search sentence for search input on the client computer side,
The client computer is
Receiving means for receiving, as a result of the search from the server computer, a target sentence that is a sentence included in the searched target;
The target sentence is analyzed to extract a partial sentence that is a syntactic minimum unit including at least one predicate, and at least a conditional part is concluded from the extracted partial sentence by paying attention to a role in the sentence. First classifying means for classifying into parts,
Analyzing the search sentence, extracting the partial sentence, and focusing on the role of the extracted partial sentence in at least a condition part and a conclusion part;
It is determined whether an independent word included in the partial sentence extracted from the search sentence and the target sentence belongs to the classified condition part or the conclusion part, and the search is performed on the target sentence based on the determination result. A target sentence evaluation unit that assigns a similarity to a sentence and arranges the target sentences in descending order of the similarity;
A search system comprising: display means for displaying the distributed target sentence in a predetermined structure.

A method in which a computer performs a search using a language sentence,
Enter search text for search from input means such as a keyboard,
The computer performs a search using the input search sentence,
The computer analyzes at least a target sentence that is a sentence included in the searched target, and the computer extracts a partial sentence that is a syntactic minimum unit including at least one predicate, and the extracted partial sentence is , Focusing on the role in the sentence, classifying it into at least a conditional part and a conclusion part,
Analyzing the search sentence, the computer extracts the partial sentence, and classifies the extracted partial sentence into at least a condition part and a conclusion part, focusing on the role in the sentence,
It is determined whether an independent word included in the partial sentence extracted from the search sentence and the target sentence belongs to the classified condition part or the conclusion part, and the search is performed on the target sentence based on the determination result. A search method for assigning similarity to sentences and arranging the target sentences in descending order of the similarity.

A program for realizing a search function using a language sentence on a computer,
The ability to enter search text for searching,
A function for performing a search using the input search sentence;
Analyzing at least a target sentence that is a sentence included in the searched target, and extracting a partial sentence that is a syntactic minimum unit including at least one predicate, and extracting the extracted partial sentence in the sentence Focusing on the role, the function to classify at least the condition part and the conclusion part,
Analyzing the search sentence, extracting the partial sentence, and focusing the extracted partial sentence on the role in the sentence, and at least a condition part and a conclusion part;
It is determined whether an independent word included in the partial sentence extracted from the search sentence and the target sentence belongs to the classified condition part or the conclusion part, and the search is performed on the target sentence based on the determination result. The program which implement | achieves on the computer the function which assign | provides the similarity with respect to a sentence and arrange | positions the said target sentence in order with the said similarity high.

A recording medium on which the program according to claim 13 is recorded.