JP2015232855A

JP2015232855A - Event identity determination method, event identity determination device, and event identity determination program

Info

Publication number: JP2015232855A
Application number: JP2014120061A
Authority: JP
Inventors: 要船越; Kaname Funakoshi; 船越　　要; 義昌小池; Yoshimasa Koike
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-06-11
Filing date: 2014-06-11
Publication date: 2015-12-24
Anticipated expiration: 2034-06-11
Also published as: JP6209492B2

Abstract

PROBLEM TO BE SOLVED: To appropriately set a threshold value in determining identity of events expressed by a plurality of documents.SOLUTION: Event information that specifies events is stored in an event DB102 of an event identity determination device 100, and document information of electronic documents which is an extraction source of events is stored in a document DB103. A threshold value determination part 104 refers to the DB102, 103 in advance, calculates statistical data from a collection of the event information and a collection of the document information, and determines a threshold value of similarity between electronic documents to store the value in a threshold value storage part 105 in advance. An identity determination part 101 refers to the event DB102 and reads out event information of a subject to be determined. Based on the read-out event information, electronic documents are read out from the document DB103 and similarity between the electronic documents is calculated, the calculated similarity is compared with a threshold value in the threshold value storage part 105, and identity between the electronic documents is determined.

Description

本発明は、複数の電子文書（以下、文書とする。）の記述内容の同一性を判定する情報処理の技術に関する。 The present invention relates to an information processing technique for determining the identity of description contents of a plurality of electronic documents (hereinafter referred to as documents).

ブログに代表されるソーシャルテキストなどの文書の記述内容、即ち文書に記述された事実（以下、「イベント」と呼ぶ。）を抽出する方法として、該テキストからイベント情報を抽出する技術が提案されている。 As a method for extracting description contents of a document such as social text represented by a blog, that is, a fact described in the document (hereinafter referred to as “event”), a technique for extracting event information from the text has been proposed. Yes.

例えば非特許文献１には、テキストに含まれる名前，場所，日時の三つの組を構造情報の利用により抽出してイベント情報として保存する方法が提案されている。このような方法によれば、ブログなどのソーシャルテキストの文書に記述されたイベント情報を保存し、再利用することが可能である。 For example, Non-Patent Document 1 proposes a method of extracting three sets of a name, a place, and a date / time included in a text by using structural information and storing them as event information. According to such a method, event information described in a social text document such as a blog can be stored and reused.

ところが、テキストから抽出される複数のイベント情報が、同一のイベントを表現しているか否かについて判定する方法は提案されていない。この場合に考えられる方法の一つとしては、元となる文書間の類似度を計算し、該文書間が一定以上の類似度を持つ場合は同一のイベントについて記述された文書と判断し、抽出されたイベント情報を同一と認定する方法が考えられる。 However, a method for determining whether or not a plurality of pieces of event information extracted from text express the same event has not been proposed. One possible method in this case is to calculate the similarity between the original documents, and if the documents have similarities of a certain level or more, it is determined that the documents describe the same event and extracted. It is conceivable that the event information is recognized as the same.

文書間の類似度を計算する方法としては、情報検索分野で利用されているキーワードベクトルの比較がよく知られている（非特許文献２参照）。 As a method for calculating the similarity between documents, comparison of keyword vectors used in the information search field is well known (see Non-Patent Document 2).

数原良彦, 鈴木潤, 鷲崎誠司. 構造学習を用いたテキストからの地域イベント情報抽出. 人工知能学会全国大会2013Yoshihiko Nubara, Jun Suzuki, Seiji Amagasaki. Extraction of local event information from texts using structural learning. National Congress of the Japanese Society for Artificial Intelligence 2013 北研二, 津田和彦, 獅々堀正幹. 情報検索アルゴリズム. 共立出版, 2002.Kita Kenji, Tsuda Kazuhiko, Sasabori Masatomi. Information Retrieval Algorithm. Kyoritsu Shuppan, 2002.

ソーシャルテキスト中には、複数の文書に同一のイベントについて記述されていることが頻繁に発生している。そのため、ソーシャルテキスト中から抽出したイベントの情報を提供する際には、同一イベントを集約／排除するため、イベントの情報の同一性の判定を行うことが必要である。この場合、単純には名前，場所，日時などイベントを表現する情報が同一であれば、同一のイベントであるとみなすのが妥当と思われる。 In social text, the same event is frequently described in a plurality of documents. Therefore, when providing event information extracted from social text, it is necessary to determine the identity of event information in order to aggregate / exclude the same event. In this case, if the information expressing the event, such as name, location, date and time, is the same, it is reasonable to consider that they are the same event.

しかしながら、ソーシャルテキスト中からイベントの名前，場所，日時を抽出する際に名前や場所，日時それぞれに記述が異なる場合が多く、複数文書に記述された同一のイベントを同じイベントとして集約することが困難なことが少なくない。 However, when extracting the name, location, and date / time of an event from social text, the description is often different for each name, location, and date / time, making it difficult to aggregate the same event described in multiple documents as the same event. There are many things.

また、非特許文献２のようにキーワードベクトルを単純に同一性判定に利用する場合、同一であることを判別するための閾値を設定しなければならないが、イベント情報を対象として合理的に閾値を決定する方法は提案されていない。 In addition, when a keyword vector is simply used for identity determination as in Non-Patent Document 2, a threshold value for determining the identity must be set, but a reasonable threshold value is set for event information. No way to decide has been proposed.

本発明は、このような従来技術の問題を解決するためになされ、複数文書の表現するイベントの同一性を判定する際の閾値を適切に設定することを解決課題としている。 The present invention has been made in order to solve such a problem of the prior art, and an object of the present invention is to appropriately set a threshold for determining the identity of events expressed by a plurality of documents.

本発明のイベント同一性判定方法は、あらかじめイベントを特定するためのイベント情報の集合と、イベントの抽出元となった文書の文書情報の集合とから統計データを計算し、文書間における類似度の閾値を決定する閾値決定ステップと、判定対象のイベント情報に基づき文書を読み出して文書間の類似度を算出し、算出された類似度と前記閾値とを対比することで文書間の同一性を判定する同一性判定ステップと、を有することを特徴としている。 The event identity determination method of the present invention calculates statistical data from a set of event information for specifying an event in advance and a set of document information of a document from which the event is extracted, and calculates similarity between documents. Threshold determination step for determining a threshold, and documents are read based on event information to be determined, the similarity between the documents is calculated, and the similarity between the documents is determined by comparing the calculated similarity with the threshold. And an identity determination step.

本発明のイベント同一性判定装置は、あらかじめイベントを特定するためのイベント情報の集合と、イベントの抽出元となった文書の文書情報の集合とから統計データを計算し、文書間における類似度の閾値を決定する閾値決定部と、判定対象のイベント情報に基づき文書を読み出して文書間の類似度を算出し、算出された類似度と前記閾値とを対比することで文書間の同一性を判定する同一性判定部と、を備えることを特徴としている。 The event identity determination device of the present invention calculates statistical data from a set of event information for specifying an event in advance and a set of document information of a document from which an event is extracted, and calculates the similarity between documents. A threshold value determination unit that determines a threshold value, reads a document based on event information to be determined, calculates a similarity between documents, and determines the identity between documents by comparing the calculated similarity with the threshold. And an identity determining unit.

なお、本発明は、前記イベント同一性判定装置としてコンピュータを機能させるプログラムとして構成することもできる。このプログラムは、ネットワークや記録媒体などを通じて提供することができる。 The present invention can also be configured as a program that causes a computer to function as the event identity determination device. This program can be provided through a network or a recording medium.

本発明によれば、複数文書の表現するイベントの同一性を判定する際の閾値を適切に設定することができる。 According to the present invention, it is possible to appropriately set a threshold for determining the identity of events expressed by a plurality of documents.

本発明の実施形態に係るイベント同一性判定装置の構成図。The block diagram of the event identity determination apparatus which concerns on embodiment of this invention. 同一性判定部のバッチ処理フロー図。The batch processing flowchart of an identity determination part. 同増分処理フロー図。The same incremental processing flow diagram. 閾値決定部の処理フロー図。The processing flowchart of a threshold value determination part. 文書対の類似度分布（０．０１刻み）のグラフ。A graph of similarity distribution (in steps of 0.01) of document pairs. 図６の説明図。Explanatory drawing of FIG.

以下、本発明の実施形態に係るイベント同一性判定装置を説明する。このイベント同一性判定装置は、イベントの同一性判定においてイベントを記載した文書の特徴ベクトルを利用する。すなわち、特徴ベクトルを用いた類似度によりイベントの同一性を判定する際の閾値を最適化させている。 Hereinafter, an event identity determination device according to an embodiment of the present invention will be described. This event identity determination device uses a feature vector of a document describing an event in the event identity determination. That is, the threshold for determining the identity of an event is optimized based on the similarity using a feature vector.

≪構成例≫
図１に基づき前記イベント同一性判定装置の構成例を説明する。このイベント判定装置１００は、主にブログなどのソーシャルテキストの文書から抽出したイベント情報の提供に利用され、同一イベントの集約・排除のためにイベントの同一性を判定する。 ≪Configuration example≫
A configuration example of the event identity determination device will be described with reference to FIG. This event determination apparatus 100 is mainly used for providing event information extracted from social text documents such as blogs, and determines the identity of events for aggregation and exclusion of the same events.

このイベント同一性判定装置１００は、コンピュータにより構成され、ＣＰＵ，主記憶装置（ＲＡＭ，ＲＯＭ等），補助記憶装置（ハードディスクドライブ装置，「ＦｌａｓｈＳＳＤ」等）などのハードウェアリソースを備える。 The event identity determination device 100 is configured by a computer and includes hardware resources such as a CPU, a main storage device (RAM, ROM, etc.), and an auxiliary storage device (hard disk drive device, “Flash SSD”, etc.).

このハードウェアリソースとソフトウェアリソースとの協働の結果、前記イベント同一性判定装置１００は、同一性判定部１０１，イベントＤＢ１０２，文書ＤＢ１０３，閾値決定部１０４，閾値格納部１０５を実装する。この前記各ＤＢ１０２，１０３および前記格納部１０５は、それぞれ前記記憶装置に構築されているものとする。 As a result of the cooperation between the hardware resource and the software resource, the event identity determination apparatus 100 has an identity determination unit 101, an event DB 102, a document DB 103, a threshold determination unit 104, and a threshold storage unit 105. The DBs 102 and 103 and the storage unit 105 are assumed to be constructed in the storage device.

表１はイベントＤＢ１０２の格納データ例を示し、該ＤＢ１０２にはイベントを特定するイベント情報が格納されている。ここではイベント情報として、イベントを同定するためのイベントＩＤと、抽出されたイベントの名称と、該イベントの開催地と、該イベントの開催日時と、表示しないイベントか否かを示す非表示フラグと、抽出元となった元文書の文書ＩＤと、イベントＤＢ１０２に格納された更新日時とがペアに記録されている。 Table 1 shows an example of data stored in the event DB 102. The DB 102 stores event information for identifying events. Here, as event information, an event ID for identifying the event, the name of the extracted event, the venue of the event, the date and time of the event, and a non-display flag indicating whether or not the event is not displayed The document ID of the original document that is the extraction source and the update date and time stored in the event DB 102 are recorded in a pair.

表２は文書ＤＢ１０３の格納データ例を示し、該ＤＢ１０３には文書情報が格納されている。ここでは文書情報として、前記イベントＤＢ１０２の元文書ＩＤから参照可能な文書ＩＤ（例えばＵＲＬなど）と、文書ＤＢ１０３に格納された更新日時と、本文テキストとがペアに記録されている。この本文テキストとしては、テキストそのまま、あるいは事前にテキストを形態素解析されたものでよい。 Table 2 shows an example of data stored in the document DB 103. The DB 103 stores document information. Here, as document information, a document ID (for example, URL) that can be referred to from the original document ID of the event DB 102, an update date and time stored in the document DB 103, and a body text are recorded in pairs. The body text may be the text as it is, or text that has been morphologically analyzed in advance.

同一性判定部１０１は、イベントＤＢ１０２に格納されたイベント情報を読み出した後にそれぞれのイベントの抽出元となった文書を文書ＤＢ１０３から抽出し、抽出された文書間の類似度を計算する。この計算後に閾値格納部１０５から閾値を読み出し、文書間の類似度と前記閾値とを対比する。この対比の結果、文書間の類似度が閾値以上であれば文書同士を同一と判定し、イベントＤＢ１０２の表示フラグを更新する。 After reading the event information stored in the event DB 102, the identity determination unit 101 extracts the document from which the event is extracted from the document DB 103, and calculates the similarity between the extracted documents. After this calculation, the threshold value is read from the threshold value storage unit 105, and the similarity between documents is compared with the threshold value. As a result of this comparison, if the similarity between documents is equal to or greater than a threshold, it is determined that the documents are the same, and the display flag of the event DB 102 is updated.

閾値決定部１０４は、あらかじめ前記両ＤＢ１０２，１０３のそれぞれの格納データを利用して前記閾値を決定する。ここで決定された前記閾値は閾値格納部１０５に保存される。以下、前記各部１０１，１０４の処理内容を説明する。 The threshold value determination unit 104 determines the threshold value in advance using the stored data of both the DBs 102 and 103. The threshold value determined here is stored in the threshold storage unit 105. Hereinafter, processing contents of the respective units 101 and 104 will be described.

≪同一性判定部１０１≫
同一性判定部１０１の同一性判定は、前記各ＤＢ１０２，１０３の生成方法に応じてバッチ処理あるいは増分処理により実行される。すなわち、文書が定期的に解析され、複数のイベント情報が同時にイベントＤＢ１０２に格納される場合はバッチ処理を行う。 << Identity determining unit 101 >>
The identity determination of the identity determination unit 101 is executed by batch processing or incremental processing according to the generation method of the DBs 102 and 103. That is, when a document is periodically analyzed and a plurality of event information is simultaneously stored in the event DB 102, batch processing is performed.

一方、文書をストリームとして随時イベント抽出が処理され、一度に１つのイベント情報のみがイベントＤＢ１０２に格納される場合は、その都度増分処理を行う。なお、増分処理を行う場合は定期的にバッチ処理を実行し、イベントＤＢ１０２内のすべてのイベント情報について同一性判定を再計算することがある。 On the other hand, when event extraction is processed at any time using a document as a stream, and only one event information is stored in the event DB 102 at a time, increment processing is performed each time. In addition, when performing an incremental process, a batch process may be performed regularly and an identity determination may be recalculated about all the event information in event DB102.

（１）バッチ処理
図２に基づき同一性判定部１０１のバッチ処理を説明する。ここでは既に閾値格納部１０５には閾値が格納されているものとする。 (1) Batch processing The batch processing of the identity determination unit 101 will be described with reference to FIG. Here, it is assumed that the threshold value is already stored in the threshold value storage unit 105.

Ｓ２０１，Ｓ２０２：まず処理が開始されると、同一性判定部１０１は閾値格納部１０５から閾値を読み出す（Ｓ２０１）。この閾値は１つの数値とする。 S201, S202: When processing is started, the identity determination unit 101 reads a threshold value from the threshold value storage unit 105 (S201). This threshold is one numerical value.

つぎにイベントＤＢ１０２からイベント情報を読み出す（Ｓ２０２）。このとき非表示フラグが「０」のイベント情報のみを選択して読み出すことで計算量を削減することができる。 Next, event information is read from the event DB 102 (S202). At this time, it is possible to reduce the amount of calculation by selecting and reading only the event information whose non-display flag is “0”.

Ｓ２０３，Ｓ２０４：同一性判定部１０１は、Ｓ２０２で読み出した各イベント情報の元文書ＩＤに基づき文書ＤＢ１０３から文書情報を読み出す（Ｓ２０３）。その後にイベント情報および文書情報に基づき各文書の特徴ベクトルを生成し、生成した特徴ベクトルを中間ファイル、即ち文書の特徴ベクトルファイル２１０に格納する（Ｓ２０４）。 S203, S204: The identity determination unit 101 reads out document information from the document DB 103 based on the original document ID of each event information read out in S202 (S203). Thereafter, a feature vector of each document is generated based on the event information and the document information, and the generated feature vector is stored in the intermediate file, that is, the feature vector file 210 of the document (S204).

Ｓ２０５：同一性判定部１０１は、前記特徴ベクトルファイル２１０に格納されたすべての特徴ベクトルについてベクトル間の類似度を計算する。ここで計算されたベクトル間の類似度が閾値「θ」以上であれば、同一のイベントを扱った文書と判定する。 S205: The identity determination unit 101 calculates similarity between vectors for all feature vectors stored in the feature vector file 210. If the similarity between the vectors calculated here is equal to or greater than the threshold “θ”, it is determined that the documents handle the same event.

ここで同一と判定されたイベントはグループ化し、同一イベントグループと呼ぶ。同一イベントグループの内、更新日に基づいて１つのイベントを選択し、選択されたイベントの非表示フラグを「０」のままとする。一方、同一イベントグループ内の選択されたイベント以外の他のイベントは、「１」以上に設定してイベントＤＢ１０２のレコードを更新し、処理を終了する。 Here, the events determined to be the same are grouped and called the same event group. One event is selected based on the update date in the same event group, and the non-display flag of the selected event is kept “0”. On the other hand, other events other than the selected event in the same event group are set to “1” or more, the record of the event DB 102 is updated, and the process is terminated.

（２）増分処理
図３に基づき同一性判定部１０１の増分処理を説明する。ここでは処理が始まる前において、既にバッチ処理によりイベントＤＢ１０２内の既存のイベント情報について文書の特徴ベクトルが計算されているものとする。ここで計算された特徴ベクトルは事前に中間ファイル、即ち特徴ベクトルファイル３１０に格納されているものとする。 (2) Increment processing The increment processing of the identity determination unit 101 will be described with reference to FIG. Here, it is assumed that the feature vector of the document has already been calculated for the existing event information in the event DB 102 by batch processing before the processing starts. It is assumed that the feature vector calculated here is stored in advance in the intermediate file, that is, the feature vector file 310.

Ｓ３０１：まず処理が開始されると、Ｓ２０１と同じく同一性判定部１０１は閾値格納部１０５から閾値を読み出す。ここで読みだされる閾値も１つの数値とする。 S301: First, when processing is started, the identity determination unit 101 reads a threshold value from the threshold value storage unit 105 as in S201. The threshold value read out here is also one numerical value.

Ｓ３０２：つぎに同一性判定部１０１は、イベントＤＢ１０２から最新のイベント情報を１件読み出す。このとき最新のイベント情報、即ち未処理のイベント情報が格納されていなければ、該最新のイベント情報が読み出し可能となるまで処理を中断するものとする。 S <b> 302: Next, the identity determination unit 101 reads one latest event information from the event DB 102. If the latest event information, that is, unprocessed event information is not stored at this time, the processing is interrupted until the latest event information can be read.

Ｓ３０３，Ｓ３０４：同一性判定部１０１は、Ｓ３０２で読み出した各イベント情報の元文書ＩＤに基づき文書ＤＢ１０３から文書情報を読み出す（Ｓ３０３）。その後にイベント情報および文書情報に基づき各文書の特徴ベクトルを生成し、生成した特徴ベクトルを前記特徴ベクトルファイル３１０に格納する（Ｓ３０４）。 S303, S304: The identity determination unit 101 reads document information from the document DB 103 based on the original document ID of each event information read in S302 (S303). Thereafter, a feature vector of each document is generated based on the event information and the document information, and the generated feature vector is stored in the feature vector file 310 (S304).

Ｓ３０５：同一性判定部１０１は、Ｓ３０４で新たに格納した当該特徴ベクトルについて、前記特徴ベクトルファイル３１０中における既存のすべての特徴ベクトルとの間での類似度を計算する。 S305: The identity determination unit 101 calculates the similarity between all the feature vectors existing in the feature vector file 310 for the feature vector newly stored in S304.

このベクトル間の類似度が閾値「θ」以上であれば、同一のイベントを扱った文書と判定する。ここで他のイベントと同一のイベントと判定されたイベントについては、イベント情報の表示フラグを「１」以上に設定してイベントＤＢ１０２のレコードを更新する。この処理の終了後にＳ３０２に戻る。 If the similarity between the vectors is equal to or greater than the threshold “θ”, it is determined that the documents handle the same event. Here, for an event determined to be the same as another event, the event information display flag is set to “1” or more, and the record in the event DB 102 is updated. After this process ends, the process returns to S302.

≪閾値決定部１０４≫
図４に基づき閾値決定部１０４の処理内容を説明する。処理が開始されると、閾値決定部１０４はイベント情報をイベントＤＢ１０２から読みだす（Ｓ４０１）。このイベント情報に対応する文書情報を、元文書ＩＤに基づき文書ＤＢから読みだす（Ｓ４０２）。 << Threshold Determination Unit 104 >>
The processing content of the threshold value determination unit 104 will be described with reference to FIG. When the process is started, the threshold value determination unit 104 reads event information from the event DB 102 (S401). Document information corresponding to the event information is read from the document DB based on the original document ID (S402).

Ｓ４０１，Ｓ４０２で読みだされたイベント情報の集合および文書情報の集合から統計データを計算し、閾値を決定する（Ｓ４０３）。決定された閾値を閾値格納部１０５に格納して処理を終了する。 Statistical data is calculated from the set of event information and the set of document information read in S401 and S402, and a threshold value is determined (S403). The determined threshold value is stored in the threshold value storage unit 105, and the process ends.

≪具体的な処理内容≫
（１）類似度計算
同一性判定部１０１における類似度計算（Ｓ２０５，Ｓ３０５）の一例として、単語集合による類似度計算、即ちちキーワードによる重みベクトルを用いた類似度計算を説明する。 ≪Specific processing contents≫
(1) Similarity Calculation As an example of similarity calculation (S205, S305) in the identity determination unit 101, similarity calculation using a word set, that is, similarity calculation using a weight vector using a keyword will be described.

この類似度計算では、文書に含まれる単語を等しく扱ってキーワード毎の重みベクトルとし、この重みベクトルを文書の特徴ベクトルとする。この重みベクトルの構成方法としては非特許文献２に記載された手法を採用することができる。 In this similarity calculation, the words contained in the document are treated equally and used as a weight vector for each keyword, and this weight vector is used as the feature vector of the document. As a method for configuring this weight vector, the method described in Non-Patent Document 2 can be employed.

具体的にはＳ２０２，Ｓ３０２で読みだした文書情報について、文書のテキストを形態素解析して単語に分割して単語毎の重みベクトルを構成し、ベクトル間の類似度を計算する。 Specifically, for the document information read in S202 and S302, the text of the document is morphologically analyzed and divided into words to form a weight vector for each word, and the similarity between the vectors is calculated.

重みベクトルの構成方法としては、単語を個別に特徴ベクトルに変換するのではなく、文書テキスト中の単語の連接を要素とする特徴ベクトルを構成する。これによりイベントについての特徴的な表現を「形容詞＋名詞」などの形式（例えば「ソウルフルなディーヴァ」や「笑いあり涙あり」）を特徴ベクトルとすることができる。 As a method of constructing the weight vector, a feature vector having a concatenation of words in a document text as an element is constructed instead of converting individual words into feature vectors. As a result, a characteristic expression of the event can be a characteristic vector of a form such as “adjective + noun” (for example, “soulful diva” or “with tears with laughter”).

この場合、対比される文書における特徴ベクトル間の計算には、コサイン距離と呼ばれる指標が使用される。例えば二つの文書の「特徴ベクトルｄ_i，ｄ_j」についての類似度は式（１）で与えられる。 In this case, an index called cosine distance is used for calculation between feature vectors in the compared documents. For example, the similarity of “feature vectors d _i , d _j ” of two documents is given by equation (1).

ただし、「θ_ij」は「ｄ_i，ｄ_j」のなす角を表し、「ｘ・ｙ」は二つのベクトルの内積を表し、「||ｘ||」はベクトルのノルムを表している。 However, “θ _ij ” represents the angle formed by “d _i , d _j ”, “x · y” represents the inner product of two vectors, and “|| x ||” represents the norm of the vector.

（２）閾値の決定
前述のキーワードによる重みベクトルを用いた類似度計算を用いた場合、イベントの同一性判定には類似度の閾値を設定する必要がある。ここでは類似度の閾値を実験的に求める方法を説明する。 (2) Determination of threshold value When similarity calculation using the above-described keyword weight vector is used, it is necessary to set a similarity threshold value for event identity determination. Here, a method for experimentally obtaining the threshold value of similarity will be described.

この方法は、ある分量の文書集合から抽出されたイベント集合について文書間の類似度を求め、その分布から閾値を決定する。ここでは「２０１３年１０月〜２０１４年１月」までの３ヶ月間に記述されたブログから抽出したイベント集合の内、無作為に選択した「１０，９８８」件のイベント情報について、文書内の単語の出現頻度に基づいて重みベクトルを構成して類似度を計算した。 In this method, the similarity between documents is obtained for an event set extracted from a certain amount of document set, and a threshold is determined from the distribution. Here, “10,988” event information randomly selected from the event set extracted from the blog described in “October 2013 to January 2014” is included in the document. Based on the appearance frequency of words, weight vectors were constructed to calculate the similarity.

図５は計算結果の類似度の分布を、横軸に類似度：縦軸にイベント対の個数（対数）としてグラフ化した状態を示している。ここでは同一イベントが含まれていない場合、グラフは概ね右に向かって単調に減少することが期待される。すなわち、異なるイベントについて言及した文書対について、大部分は類似度「０」の付近に分布し、類似度が高い（「１」に近い）文書対は類似度が低い文書対よりも少なくなることが予想される。 FIG. 5 is a graph showing the distribution of similarity of calculation results, with the horizontal axis representing similarity and the vertical axis representing the number of event pairs (logarithm). Here, if the same event is not included, the graph is expected to monotonously decrease toward the right. That is, for document pairs that refer to different events, most of the document pairs are distributed in the vicinity of similarity “0”, and document pairs with high similarity (close to “1”) are less than document pairs with low similarity. Is expected.

ところが、図６に示すように、大部分の文書対の類似度が「０」付近に分布していることは予想通りであるものの、類似度「０．６〜０．８」を谷間として、「０．８」よりも類似度が高い領域で文書対の数が増加していることが判明した。 However, as shown in FIG. 6, although it is expected that the similarity of most document pairs is distributed in the vicinity of “0”, the similarity “0.6 to 0.8” is defined as a valley. It was found that the number of document pairs is increasing in the region where the similarity is higher than “0.8”.

したがって、この場合は類似度「０．６〜０．８」の間に閾値を設定すればよく、偶然に閾値以上の類似度となる文書対が存在する可能性は残るものの、全体からみれば少なく、実用上は問題にならないと考えられる。 Therefore, in this case, it is only necessary to set a threshold value between the similarities “0.6 to 0.8”, and although there is a possibility that there is a document pair having a similarity degree equal to or higher than the threshold value by chance, There are few, and it is thought that it does not become a problem in practical use.

このような分布から閾値を設定する方法として、スライディングウインドウを使用する方法が考えられる。例えば類似度を「０．０１」刻みで度数化し、類似度「０〜０．０１」のイベント対の個数を「ａ₁」とし、類似度「０．０１〜０．０２」のイベント対の個数を「ａ₂」とし、以下同様に類似度「（ｋ−０．０１）〜ｋ」のイベント対の個数を「ａ_k」とする（類似度が０．０１刻みの場合は、ｋ＝１，．．．，１００）。 As a method for setting a threshold from such a distribution, a method using a sliding window can be considered. For example, the similarity is frequencyized in increments of “0.01”, the number of event pairs with similarity “0 to 0.01” is “a ₁ ”, and event pairs with similarity “0.01 to 0.02” are The number is “a ₂ ”, and similarly, the number of event pairs with similarity “(k−0.01) to _k ” is “a _k ” (when the similarity is in increments of 0.01, k = 1, ..., 100).

このときウインドウサイズｗについて、「ｉ」を「１」から順に増加し、ウインドウサイズ毎のイベント対の個数の和が上昇に転じた点を閾値とする。より厳密には、類似度の範囲を「ｎ」個に分割したときの閾値は式（２）によって得られる。なお、式（２）の「ｍｉｎ」は集合内の要素の内で最小の値を示している。 At this time, with respect to the window size w, “i” is sequentially increased from “1”, and a point at which the sum of the number of event pairs for each window size starts to increase is set as a threshold value. More precisely, the threshold when the similarity range is divided into “n” is obtained by equation (2). Note that “min” in Equation (2) indicates the minimum value among the elements in the set.

あるいは同様にスライディングウインドウによって分布を平準化した上で、最も度数の小さい区間の中央を閾値とする。この場合の閾値は式（３）によって得ることができる。 Alternatively, similarly, the distribution is leveled by the sliding window, and the center of the section with the smallest frequency is set as the threshold value. The threshold value in this case can be obtained by equation (3).

このようにイベント同一性判定装置１００によれば、複数文書の表現するイベントの同一性を判定する際の閾値が実験的に求められる。このとき実験で得られた分布からスライディングウインドウに基づき機械的に適切な閾値に設定することができる。 As described above, according to the event identity determination apparatus 100, a threshold for determining the identity of events expressed by a plurality of documents is experimentally obtained. At this time, it is possible to set a mechanically appropriate threshold value based on the sliding window from the distribution obtained in the experiment.

≪その他・プログラム≫
本発明は、上記実施形態に限定されるものではなく、各請求項に記載された範囲内で変形して実施することができる。例えば閾値決定部１０４および閾値格納部１０５をクラウド化することもできる。 ≪Other ・ Program≫
The present invention is not limited to the above-described embodiment, and can be implemented by being modified within the scope described in each claim. For example, the threshold value determination unit 104 and the threshold value storage unit 105 can be clouded.

また、本発明は、イベント同一性判定装置１００の各部１０１〜１０５の一部もしくは全部として、コンピュータを機能させる文書検索プログラムとして構成することもできる。このプログラムによればＳ２０１〜Ｓ２０５，Ｓ３０１〜Ｓ３０５，Ｓ４０１〜Ｓ４０４の一部あるいは全部をコンピュータに実行させることが可能となる。 In addition, the present invention may be configured as a document search program that causes a computer to function as a part or all of the units 101 to 105 of the event identity determination apparatus 100. According to this program, it is possible to cause a computer to execute part or all of S201 to S205, S301 to S305, and S401 to S404.

前記プログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，ＢＤ−ＲＯＭ，ＢＤ−Ｒ，ＢＤ−ＲＥなどの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 The program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, BD-ROM, BD-R, or BD-RE. It is also possible to record, save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１００…イベント同一性判定装置
１０１…同一性判定部
１０２…イベントＤＢ
１０３…文書ＤＢ
１０４…閾値決定部
１０５…閾値格納部
２１０，３１０…特徴ベクトルファイル（中間ファイル） DESCRIPTION OF SYMBOLS 100 ... Event identity determination apparatus 101 ... Identity determination part 102 ... Event DB
103 ... Document DB
104: Threshold value determination unit 105 ... Threshold value storage unit 210, 310 ... Feature vector file (intermediate file)

Claims

An event identity determination method for determining the identity of events described in a plurality of electronic documents by a computer,
Threshold determination for calculating statistical data from a set of event information for specifying the event in advance and a set of document information of the electronic document from which the event is extracted, and determining a threshold of similarity between the electronic documents Steps,
An identity determination step of reading out electronic documents based on event information to be determined, calculating a similarity between the electronic documents, and comparing the calculated similarity with the threshold to determine the identity between the electronic documents; ,
An event identity determination method characterized by comprising:

The threshold determination step calculates in advance similarity between documents for a set of electronic documents based on the set of event information;
Graphing the calculated similarity distribution with the horizontal axis being the similarity and the vertical axis being the number of event pairs;
In the graph, a sliding window in which the similarity is frequencyized in arbitrary value increments, and when the frequency is gradually increased, the number of event pairs for each window size has changed from a decreasing trend to an increasing trend. A threshold value step;
The event identity determination method according to claim 1, further comprising:

The threshold determination step calculates in advance similarity between documents for a set of electronic documents based on the set of event information;
Graphing the calculated similarity distribution with the horizontal axis being the similarity and the vertical axis being the number of event pairs;
In the graph, after leveling the distribution by a sliding window in which the similarity is frequencyized in arbitrary value increments, the center of the interval with the smallest frequency as the threshold value,
The event identity determination method according to claim 1, further comprising:

An event identity determination device for determining the identity of events described in a plurality of electronic documents,
Threshold determination for calculating statistical data from a set of event information for specifying the event in advance and a set of document information of the electronic document from which the event is extracted, and determining a threshold of similarity between the electronic documents And
An identity determination unit that reads out electronic documents based on event information to be determined, calculates a similarity between the electronic documents, and compares the calculated similarity with the threshold to determine the identity between the electronic documents; ,
An event identity determination device comprising:

The threshold determination unit calculates a similarity between documents for a set of electronic documents based on the set of event information in advance;
Means for graphing the calculated similarity distribution, with the horizontal axis representing similarity and the vertical axis representing the number of event pairs;
In the graph, a sliding window in which the similarity is frequencyized in arbitrary value increments, and when the frequency is gradually increased, the number of event pairs for each window size has changed from a decreasing trend to an increasing trend. Means for thresholding;
The event identity determination device according to claim 4, further comprising:

The threshold determination unit calculates a similarity between documents for a set of electronic documents based on the set of event information in advance;
Means for graphing the calculated similarity distribution, with the horizontal axis representing similarity and the vertical axis representing the number of event pairs;
In the graph, after leveling the distribution with a sliding window in which the similarity is frequencyized in arbitrary value increments, the center of the section with the smallest frequency as the threshold,
The event identity determination device according to claim 4, further comprising:

The event identity determination program which makes a computer function as an event identity determination apparatus of any one of Claims 4-6.