JPH0785101A - Keyword extract processing unit - Google Patents

Keyword extract processing unit

Info

Publication number
JPH0785101A
JPH0785101A JP5232751A JP23275193A JPH0785101A JP H0785101 A JPH0785101 A JP H0785101A JP 5232751 A JP5232751 A JP 5232751A JP 23275193 A JP23275193 A JP 23275193A JP H0785101 A JPH0785101 A JP H0785101A
Authority
JP
Japan
Prior art keywords
value
word
importance
processing unit
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP5232751A
Other languages
Japanese (ja)
Inventor
Hiroshi Onodera
浩 小野寺
Masaki Hosoi
正樹 細井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu FIP Corp
Original Assignee
Fujitsu FIP Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu FIP Corp filed Critical Fujitsu FIP Corp
Priority to JP5232751A priority Critical patent/JPH0785101A/en
Publication of JPH0785101A publication Critical patent/JPH0785101A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PURPOSE:To obtain a keyword extract processing unit by adding an attribute of a meaning of a word to an extract condition so as to apply overall evaluation to a sentence together with an item relating to apparent configuration of the sentence with respect to keyword extraction. CONSTITUTION:A syntax analysis section 1 executes a predetermined syntax analysis for a given document, a weighting section 2 allocates each importance accumulation value to each word resulting from the syntax analysis, a meaning processing section 4 adds a predetermined weighting value based on an attribute of the meaning of each word to the importance accumulation value, a syntax processing section 5 adds a predetermined weighting value to the importance accumulation value based on a part of speech and a case of each word, an appearance state processing section 6 adds a predetermined weighting value to the importance accumulation value based on a predetermined appearance state of each word in the document, and an extract processing section 3 provides the output of the word with the importance accumulation value larger than a threshold value as a keyword based on the result of the processing by the weighting section 2.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、文書中から、その文章
で表す内容との関連性の大きい語句を、キーワードとし
て抽出するための、キーワード抽出処理装置に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a keyword extraction processing apparatus for extracting, from a document, a word or phrase having a high relevance to the content represented by the sentence as a keyword.

【0002】[0002]

【従来の技術と発明が解決しようとする課題】文章中か
らキーワードとなる語句を人が検出する場合には、作業
者が文章を読んで、書かれた内容を理解し、その理解に
基づき直観的または経験的に、内容との関連性の大きい
と判断する語句を抽出する。
2. Description of the Related Art When a person detects a keyword or phrase in a sentence, an operator reads the sentence, understands the written contents, and intuition based on the understanding. Extract words or phrases that are judged to be highly relevant to the content, either experimentally or empirically.

【0003】従って、この場合に良質のキーワード抽出
を行うためには、作業者に高いスキルが必要とされ、質
の維持が困難になり、又費用や時間も大きくなり易い。
以上から、大量の文書データから効率良くキーワードを
抽出するために、キーワード抽出の自動化が考えられて
いる。
Therefore, in this case, in order to extract high-quality keywords, it is necessary for the operator to have high skill, it is difficult to maintain the quality, and the cost and time are likely to increase.
From the above, automation of keyword extraction has been considered in order to efficiently extract keywords from a large amount of document data.

【0004】例えば特開平03-127176 号公報「キーワー
ド抽出装置」には、接辞に着目して複合語のキーワード
を抽出する方式が述べられており、特開平03-135669 号
公報「キーワード自動抽出システム」には、構文上の主
語、目的語の出現回数に基づいてキーワードを抽出する
方式が述べられている。
For example, Japanese Unexamined Patent Publication No. 03-127176, “Keyword Extraction Device” describes a method of extracting a keyword of a compound word by paying attention to an affix, and Japanese Unexamined Patent Publication No. 03-135669 “keyword automatic extraction system”. Describes a method of extracting a keyword based on the number of appearances of a subject and an object in a syntax.

【0005】又、特開平02-32469号公報「情報検索方
式」には、構文上の係受け構造に着目して、係受け構造
の深さをキーワード抽出において重要性を表す要素とし
て使用する方式が述べられている。
Further, in Japanese Unexamined Patent Publication No. 02-32469, "Information Retrieval Method", paying attention to syntactic dependency structure, the depth of the dependency structure is used as an element expressing importance in keyword extraction. Is stated.

【0006】それらの方式は何れも前記のように、語句
の意味内容とは直接関係のない、文章の構文上からの語
句の位置付けや出現回数等の、文章の外見的構成に関す
る事項のみをキーワード抽出の要素として使用してい
る。
As described above, all of these methods use only keywords relating to the outward appearance of a sentence, such as the position of the phrase from the syntactical aspect of the sentence and the number of appearances, which are not directly related to the meaning of the phrase. It is used as an element of extraction.

【0007】そのために、人が文章の内容を理解し、そ
の内容と語句の意味内容との関連を考慮する場合と異な
り、意味的に見て抽出漏れや無駄な抽出を生じ易い。本
発明は、単語の意味属性を抽出条件の一つとし、それと
文章の外見的構成に関する事項とを総合的に評価してキ
ーワードを抽出を自動処理するキーワード抽出処理装置
を目的とする。
Therefore, unlike the case where a person understands the content of a sentence and considers the relation between the content and the meaning content of a phrase, omission or unnecessary extraction is likely to occur in terms of meaning. An object of the present invention is to provide a keyword extraction processing device that automatically processes a keyword extraction by using the semantic attribute of a word as one of extraction conditions and comprehensively evaluating it and matters relating to the outward appearance of a sentence.

【0008】[0008]

【課題を解決するための手段】図1は、本発明の構成を
示すブロック図である。図はキーワード抽出処理装置の
構成であって、構文解析部1と、重み付け部2と、抽出
処理部3とを有する。
FIG. 1 is a block diagram showing the configuration of the present invention. The figure shows the configuration of a keyword extraction processing device, which includes a syntax analysis unit 1, a weighting unit 2, and an extraction processing unit 3.

【0009】構文解析部1は、所与の文書について、構
文解析を実行して、該文書を構成する文字列を、単語、
複合語及び句を含む語句に分割し、各該語句について、
品詞及び格を決定する。
The syntactic analysis unit 1 executes syntactic analysis on a given document to convert a character string constituting the document into words,
Divide into words including compound words and phrases, and for each of the words,
Determine part of speech and case.

【0010】重み付け部2は、意味処理部4と、構文処
理部5と、出現状態処理部6とを有し、該構文解析結果
の各該語句に各重要度累積値に初期値を割り当て、意味
処理部4により、各該語句ごとについて、当該語句の意
味属性に基づいて予め定める重み値を該重要度累積値に
加算する。
The weighting unit 2 has a meaning processing unit 4, a syntax processing unit 5, and an appearance state processing unit 6, and assigns an initial value to each importance cumulative value to each word of the syntax analysis result, The meaning processing unit 4 adds, for each of the words and phrases, a weight value determined in advance based on the meaning attribute of the word and phrase to the importance level cumulative value.

【0011】構文処理部5により、各該語句ごとについ
て、当該語句の品詞及び格に基づいて、それぞれ予め定
める重み値を該重要度累積値に加算する。出現状態処理
部6により、各該語句ごとについて、該文書中での出現
状態に基づいて予め定める重み値を該重要度累積値に加
算する。
The syntax processing unit 5 adds, for each of the words and phrases, a predetermined weight value to the importance level cumulative value based on the part of speech and the case of the word and phrase. The appearance state processing unit 6 adds, for each of the words and phrases, a weight value determined in advance based on the appearance state in the document to the importance degree cumulative value.

【0012】抽出処理部3は、重み付け部2の処理した
結果から、所定の閾値より大きい値を有する該重要度累
積値を選択し、該選択した重要度累積値に対応する該語
句をキーワードとして出力する。
The extraction processing unit 3 selects, from the result of the processing of the weighting unit 2, the importance cumulative value having a value larger than a predetermined threshold value, and the word or phrase corresponding to the selected importance cumulative value is used as a keyword. Output.

【0013】第2の発明では、前記出現状態処理部6
は、各前記語句ごとについて、該文書中の出現位置に基
づいて予め定める重み値と、同一表記の該語句の前記文
書中の出現頻度に基づいて予め定める重み値とを、それ
ぞれ該当する前記重要度累積値に演算する。
In the second invention, the appearance state processing section 6 is provided.
Is, for each of the words and phrases, a weight value predetermined based on the appearance position in the document, and a weight value predetermined based on the appearance frequency of the word with the same notation in the document. To calculate the cumulative value.

【0014】第3の発明では、前記第1又は第2の発明
の出現状態処理部6は、予め定める記号を強調記号と
し、前記文書中の該強調記号の出現位置と、前記語句と
の位置関係に基づいて予め定める重み値を該当する前記
重要度累積値に演算する。
In the third invention, the appearance state processing unit 6 of the first or second invention uses a predetermined symbol as an emphasis symbol, and the appearance position of the emphasis symbol in the document and the position of the word / phrase. A weight value determined in advance based on the relationship is calculated as the relevant importance level cumulative value.

【0015】第4の発明では、前記第1、第2又は第3
の発明の意味処理部4は、前記語句の意味属性に基づい
て予め定める重み値を所定の分野ごとに有し、前記文書
について指定される該分野に従って、前記重要度累積値
に演算する該重み値を選択する。
In a fourth aspect of the invention, the first, second or third aspect is provided.
The meaning processing unit 4 of the invention has a predetermined weight value for each predetermined field based on the semantic attribute of the phrase, and calculates the weight cumulative value according to the field specified for the document. Select a value.

【0016】[0016]

【作用】本発明のキーワード抽出処理装置により、キー
ワードを抽出すべき文書を構文解析して、各語句、即ち
単語、複合語及び句、の構文上の役割から、それらの語
句の重要度を評価する。
With the keyword extraction processing device of the present invention, the document from which the keyword is to be extracted is syntactically analyzed, and the importance of each word or phrase, that is, the word, the compound word and the phrase, is evaluated from the syntactical role. To do.

【0017】それと共に、各語句の意味属性を、要すれ
ばその文書内容の分野別の基準で評価し、構文及び意味
属性による評価を総合するために各評価項目について重
み値を設けて、各評価結果の重みを各語句について累積
加算した重要度累積値を求め、重要度累積値の大きいも
のをキーワードとして抽出する。
At the same time, the semantic attribute of each word or phrase is evaluated, if necessary, according to the criteria for each field of the document content, and a weight value is provided for each evaluation item in order to synthesize the evaluation by the syntax and the semantic attribute. The weight of the evaluation result is cumulatively added for each word to obtain a cumulative importance value, and a keyword with a large cumulative importance value is extracted as a keyword.

【0018】従って本発明により、キーワード自動抽出
に際し、文書の意味内容を加味して語句の重要度を評価
することが可能になる。
Therefore, according to the present invention, it becomes possible to evaluate the importance of a word or phrase in consideration of the meaning and content of a document when automatically extracting a keyword.

【0019】[0019]

【実施例】本発明のキーワード抽出処理装置の一例とし
て、以下に日本語文書からキーワードを抽出する装置の
一例について説明する。
DESCRIPTION OF THE PREFERRED EMBODIMENTS As an example of a keyword extraction processing device of the present invention, an example of a device for extracting a keyword from a Japanese document will be described below.

【0020】その場合に、図1の構文解析部1は、入力
される日本語文書について構文解析処理を行う。この構
文解析処理は、通常の翻訳処理等で行われると同様に、
文書の各文の文法的構成を、内蔵する単語辞書等を参照
して解析する処理である。
In this case, the syntactic analysis unit 1 in FIG. 1 performs syntactic analysis processing on the input Japanese document. This syntactic analysis process is similar to that performed by normal translation process, etc.
This is a process of analyzing the grammatical structure of each sentence of a document by referring to a built-in word dictionary or the like.

【0021】構文解析部1は、そのような構文解析処理
の結果、各文を単語に分解し、各単語の品詞、単語の並
びで構成されている複合語、句を決定し、又必要な単
語、複合語、句について構文上の格を決定する。
As a result of such a syntactic analysis process, the syntactic analysis unit 1 decomposes each sentence into words, determines a part of speech of each word, a compound word composed of a sequence of words, and a phrase, and also determines a necessary word. Determine syntactic case for words, compounds, and phrases.

【0022】重み付け部2は、構文解析部1の解析処理
結果を受け取って、先ず名詞等の必要な語句に重要度累
積値をそれぞれ割り当てる。重要度累積値の初期値は0
とし、以下の処理で各該当する重要度累積値に、各評価
ごとに定まる重み値を累積加算していく。
The weighting unit 2 receives the analysis processing result of the syntax analysis unit 1 and first assigns the importance degree cumulative value to each of necessary words such as nouns. The initial value of the cumulative value of importance is 0
In the following process, the weight value determined for each evaluation is cumulatively added to each applicable importance degree cumulative value.

【0023】重み付け部2は、先ず意味処理部4により
単語の意味属性により重要度を評価する。そのために意
味処理部4は必要な名詞等の単語について、図2(a)に
示すような意味属性を示す単語テーブルと、(b)に示す
ような各意味属性の重み値を示す意味属性テーブルを持
つ。
The weighting unit 2 first evaluates the importance level by the meaning processing unit 4 based on the meaning attribute of the word. For that purpose, the meaning processing unit 4 has a word table showing semantic attributes as shown in FIG. 2A and a meaning attribute table showing weight values of the respective semantic attributes as shown in FIG. 2B for necessary words such as nouns. have.

【0024】意味属性テーブルは、適当な分野別に設け
るのがよく、図2(b)の意味属性テーブルは例えば外交
分野の重み値を示すテーブルとすると、例えば情報産業
分野については(c)に示すような重み値のテーブルを準
備しておく。
The meaning attribute table is preferably provided for each appropriate field. If the meaning attribute table of FIG. 2 (b) is a table showing the weight value of the foreign affairs field, for example, the information industry field is shown in (c). Prepare a table of such weight values.

【0025】意味処理部4は、構文解析結果の必要な各
単語について単語テーブルを検索して各意味属性を得、
その意味属性で意味属性テーブルを検索して重み値を得
ると、その重み値を該当単語の重要度累積値に加算す
る。
The meaning processing unit 4 searches the word table for each word for which a syntactic analysis result is required and obtains each meaning attribute,
When the meaning attribute table is searched with the meaning attribute to obtain the weight value, the weight value is added to the importance value cumulative value of the corresponding word.

【0026】次に重み付け部2は、構文処理部5により
語句の品詞及び格に基づいて重要度を評価し、又複合語
や句の重要度累積値を求める。そのために構文処理部5
は、図3(a)に示すように品詞名とその重み値とを示す
品詞テーブルと、(b)のように複合語や句の構成とその
重み値とを示す句テーブルと、(c)のように構文上の格
とその重み値とを示す格テーブルとを持つ。
Next, the weighting unit 2 evaluates the degree of importance based on the part of speech and the case of the phrase by the syntax processing unit 5, and obtains the accumulated value of the significance of the compound word or phrase. Therefore, the syntax processing unit 5
Is a part-of-speech table showing the part-of-speech name and its weight value as shown in FIG. 3 (a), a phrase table showing the structure of compound words and phrases and their weight values as shown in (b), and (c) , And a case table showing the syntactic case and its weight value.

【0027】構文処理部5は、先ず各単語について構文
解析結果で示される品詞名により品詞テーブルを検索し
て重み値を得、その重み値を該当単語の重要度累積値に
加算する。
The syntax processing unit 5 first searches the part-of-speech table for each word by the part-of-speech name indicated by the syntax analysis result, obtains a weight value, and adds the weight value to the importance cumulative value of the word.

【0028】次に構文処理部5は、構文解析結果の複合
語及び句について、各複合語又は句を構成する単語の重
要度累積値のうち最も大きい値に、句テーブルから定ま
る重み値を加えた値を、その複合語又は句の重要度累積
値として設定する。
Next, the syntactic processing unit 5 adds the weight value determined from the phrase table to the largest value of the cumulative value of importance of the words forming each compound word or phrase for the compound word and the phrase of the syntactic analysis result. The value is set as the importance cumulative value of the compound word or phrase.

【0029】その後構文処理部5は、構文解析結果に示
される格について、格テーブルから定まる重み値を、該
当する語句の重要度累積値に加算する。又、重み付け部
2は、出現状態処理部6により、各語句ごとについて、
文書中での出現状態に基づいて重要度を評価する。
After that, the syntax processing unit 5 adds the weight value determined from the case table, to the case shown in the syntax analysis result, to the importance cumulative value of the corresponding word. Further, the weighting unit 2 causes the appearance state processing unit 6 to
The importance is evaluated based on the appearance state in the document.

【0030】そこで、出現状態処理部6は、例えば処理
する語句がタイトル中か、本文内であればでは前半の例
えば200字内か、それより後かの3ケースにより、重
み値を例えば2、1、0とするものとし、語句の文書中
の位置を識別して各重要度累積値に重み値を加算する。
Therefore, the appearance state processing unit 6 sets the weight value to 2, for example, in three cases, for example, if the word to be processed is in the title, in the first half if it is in the body, for example in 200 characters, or after it. It is assumed to be 1, 0, and the position of the word in the document is identified, and the weight value is added to each importance degree cumulative value.

【0031】又、出現状態処理部6は、図3(d)に示す
ような、強調記号として定めた、括弧記号や、下線記号
や、特別の文字フォント指定等とその重み値とを示す強
調記号テーブルを持ち、テーブルに示される強調記号が
ある場合に、その強調記号で強調される語句についての
重要度累積値に、テーブルから得られる重み値を加算す
る。
Further, the appearance state processing unit 6 emphasizes parenthesis symbols, underline symbols, special character font designations and the like defined as emphasis symbols as shown in FIG. 3 (d) and their weight values. When a symbol table is provided and there is an emphasis symbol shown in the table, the weight value obtained from the table is added to the accumulated importance value of the words emphasized by the emphasis symbol.

【0032】強調記号で強調される語句は、括弧記号で
括られる場合は、括弧内の語句全体が強調の対象とな
り、太字等の文字フォントや下線の場合はマークされて
いる語句が強調の対象となる。
When a word is emphasized with an emphasizing symbol, the whole word in parentheses is the object of emphasizing when enclosed in parenthesis, and the marked word is the object of emphasizing when the character font is bold or underlined. Becomes

【0033】次に出現状態処理部6は、構文解析結果の
全語句について、同一表記の語句ごとの出現頻度を集計
し、出現頻度に応じた重み値を各重要度累積値に加算す
る。出現頻度に応じた重み値は、例えば次のようにして
求める。
Next, the appearance state processing unit 6 totals the appearance frequencies of all the words of the same notation for all the words and phrases of the syntactic analysis result, and adds a weight value according to the appearance frequency to each importance degree cumulative value. The weight value according to the appearance frequency is obtained as follows, for example.

【0034】Nを文書の全文字数として、 出現頻度≦(5/200)×N なら 重み値=0 (5/200)×N<出現頻度≦(5/100)×N なら 重み値=1 (5/100)×N<出現頻度 なら 重み値=2 図4は、例文「米・ソが『中距離核禁止』で合意」につ
いて、以上の処理を行った状態を説明する図であり、構
文解析部1による構文解析結果として、図の単語分割の
行に示すように単語に分割される。なお、この文書は分
野別として「外交分野」が指定されているものとする。
When N is the total number of characters in the document, if appearance frequency ≦ (5/200) × N, weight value = 0 (5/200) × N <appearance frequency ≦ (5/100) × N weight value = 1 ( 5/100) × N <frequency of occurrence = weight value = 2 FIG. 4 is a diagram for explaining a state in which the above processing is performed for the example sentence “US / SO agreed on“ middle-range nuclear ban ””. As a result of the syntax analysis by the analysis unit 1, words are divided into words as shown in the word dividing line in the figure. In this document, "diplomacy field" is specified as the field.

【0035】更に構文解析部1が、図の品詞の行に示す
ように各単語の品詞を決定し、複合語/句の行に示すよ
うに3個の複合語が構成されることを示し、又、格の行
に示すように必要な語句について格を識別して、それら
の解析結果を重み付け部2に渡す。
Further, the syntactic analysis unit 1 determines the part of speech of each word as shown in the line of speech part of the figure, and shows that three compound words are formed as shown in the line of compound word / phrase, Further, as shown in the case line, the case is identified with respect to necessary words and phrases, and the analysis results are passed to the weighting unit 2.

【0036】そこで重み付け部2は、図の意味属性の行
に示すように、単語テーブル(図2(a))を参照して各単
語の意味属性を決定し、前記の意味属性(外交分野)テ
ーブル(図2(b))や品詞テーブル (図3(a))を参照し
て、それぞれ図に括弧付き数字で示すように重み値を決
定して、各重要度累積値に加算する。
Therefore, the weighting unit 2 determines the meaning attribute of each word by referring to the word table (FIG. 2 (a)) as shown in the row of the meaning attribute in the figure, and the meaning attribute (foreign field) With reference to the table (FIG. 2 (b)) and the part-of-speech table (FIG. 3 (a)), weight values are determined as indicated by parenthesized numbers in the figure and added to each importance cumulative value.

【0037】又、語句の出現位置については、この例文
がタイトルに置かれていたとして、全ての単語の出現位
置に係る重み値が前記により「2」とされる。次に以上
の重み値を加算した重要度累積値について、前記のよう
な複合語/句の重要度累積値を決める処理を行い、3個
の複合語について、それぞれ構成単語の最大の重要度累
積値を取り、この場合に何れも名詞であるので、句テー
ブル (図3(b))で定まる重み値「1」を加えた値を、図
4の複合語/句の行に示すように各重要度累積値として
設定する。
Regarding the appearance positions of the words and phrases, assuming that this example sentence is placed in the title, the weight value relating to the appearance positions of all the words is set to "2". Next, with respect to the importance cumulative value obtained by adding the above weight values, the process of determining the importance cumulative value of the compound word / phrase as described above is performed, and the maximum importance cumulative value of each of the three compound words is accumulated. Since a value is taken and all are nouns in this case, the value added with the weight value "1" determined by the phrase table (Fig. 3 (b)) is set as shown in the compound word / phrase line of Fig. 4. Set as a cumulative value of importance.

【0038】又、構文解析結果の格について格テーブル
(図3(c))から、強調記号について強調記号テーブル
(図3(d))から、それぞれ図4の格及び強調記号重みの
行に示す重み値を加算する。
In addition, a case table for the case of the syntactic analysis result
From Fig. 3 (c), the emphasis symbol table for emphasis symbols
From FIG. 3D, the weight values shown in the row of case and emphasis symbol weight in FIG. 4 are added.

【0039】なお、この例で「米」と「ソ」は並立記号
「・」で繋がれていることから、両者同等の重要度を持
つと識別され、この場合の主格の重み値は、両者の単語
に共に適用される。
In this example, since "rice" and "so" are connected by the parallel sign ".", They are identified as having the same degree of importance, and the weight value of the nominative in this case is Applied to both words.

【0040】出現頻度については、この例文が300字
の文書中の文とし、文書全体について、各語句を同一表
記ごとにまとめて出現頻度を調べた結果が図4の出現頻
度の行に示す数値となったと仮定する。
Regarding the frequency of appearance, this example sentence is a sentence in a document of 300 characters, and the results of examining the frequency of appearance of the entire document by grouping the words and phrases in the same notation are shown by the numerical values shown in the line of frequency of occurrence in FIG. Suppose that

【0041】この出現頻度に前記の重み値決定条件を適
用すると、(5/200)×300=7.5以下の頻度は重み値
「0」、7.5を越え(5/100)×300=15以下の頻度は重み値
「1」、15を越える頻度は重み値「2」となり、図示の
括弧内の各重み値が決定される。
When the above weight value determination condition is applied to this appearance frequency, the frequency of (5/200) × 300 = 7.5 or less exceeds the weight value “0”, 7.5 (5/100) × 300 = 15 or less. The frequency is the weight value “1”, and the frequency exceeding 15 is the weight value “2”, and each weight value in parentheses in the figure is determined.

【0042】以上の重み値を各重要度累積値に加算した
結果が、図4の最下の重要度累積値の行に示されてい
る。これらの重要度累積値について、抽出処理部3が所
定の閾値、例えば「7」より大きい値のものを選択し、
対応する語句をキーワードとして出力する。
The result of adding the above weight values to each importance cumulative value is shown in the bottom row of the importance cumulative value in FIG. Of these accumulated values of importance, the extraction processing unit 3 selects a predetermined threshold value, for example, a value larger than “7”,
Output corresponding words as keywords.

【0043】従ってこの例の場合に抽出処理部3は、
「米」、「ソ」、「核」、「禁止」、「合意」、「中距
離核」及び「中距離核禁止」をキーワードとして抽出す
る。図5は本発明の処理の流れの一例を示す図であり、
先ず構文解析部1が処理ステップ10で文書と分野の指定
を受け取る。
Therefore, in the case of this example, the extraction processing unit 3
“US”, “SO”, “nuclear”, “prohibition”, “agreement”, “intermediate range nuclear” and “intermediate range nuclear prohibition” are extracted as keywords. FIG. 5 is a diagram showing an example of the processing flow of the present invention.
First, the parsing unit 1 receives a designation of a document and a field in a processing step 10.

【0044】次に構文解析部1は、処理ステップ11で指
定の文書の1文を読み込み、処理ステップ12で識別して
文書の終わりでなく、1文を読み込めた場合には処理ス
テップ13に進んで、前記のにように構文解析処理をし
て、解析結果と分野を重み付け部2に渡す。
Next, the syntactic analysis unit 1 reads one sentence of the designated document in the processing step 11, identifies the one in the processing step 12 and reads the one sentence instead of the end of the document, and proceeds to the processing step 13. Then, the syntax analysis process is performed as described above, and the analysis result and field are passed to the weighting unit 2.

【0045】重み付け部2は、処理ステップ14で各語句
に重要度累積値を初期化して割り当て、処理ステップ15
で意味処理部4により指定の分野による各単語の意味属
性による重み値加算を前記のように行い、処理ステップ
16で、出現状態処理部6により各単語の出現位置による
重み値加算を前記のように行う。
The weighting unit 2 initializes and assigns the importance degree cumulative value to each word in the processing step 14, and the processing step 15
Then, the meaning processing unit 4 performs the weight value addition according to the meaning attribute of each word in the designated field as described above, and the processing step
At 16, the appearance state processing unit 6 performs weighting value addition according to the appearance position of each word as described above.

【0046】次に構文処理部5により、処理ステップ17
で各単語の品詞による重み値加算を前記のように行い、
処理ステップ18で複合語/句の重要度累積値を前記のよ
うに設定し、処理ステップ19で格による重み値加算を前
記のように該当する語句について行う。
Next, the syntax processing section 5 performs processing step 17
Then, add the weight value according to the part of speech of each word as described above,
In processing step 18, the importance accumulation value of the compound word / phrase is set as described above, and in processing step 19, weight value addition by case is performed for the corresponding word / phrase as described above.

【0047】その後、処理ステップ20で出現状態処理部
6により、必要な語句について強調記号による重み値加
算を前記のように行う。次に、重み付け部2は処理ステ
ップ21で文の文字数を集計して文字数値として累積した
後処理ステップ11に戻り、構文解析部1に次の文の解析
を行わせる。
After that, in processing step 20, the appearance state processing unit 6 performs the weight value addition by the emphasizing symbol for the necessary words and phrases as described above. Next, the weighting unit 2 totals the number of characters of the sentence in processing step 21 and accumulates them as character numerical values, and then returns to the processing step 11 to make the syntax analysis unit 1 analyze the next sentence.

【0048】このように1文ごとの処理を繰り返して、
最後に構文解析部1が処理ステップ12で文書の全文の処
理を終わったことを識別すると、重み付け部2に文書終
了を通知する。
By repeating the processing for each sentence in this way,
Finally, when the syntactic analysis unit 1 identifies that the processing of all the sentences of the document has been completed in the processing step 12, the weighting unit 2 is notified of the end of the document.

【0049】そこで重み付け部2は、処理ステップ22で
全文の語句の処理結果を、同一表記の語句ごとにまとめ
て、処理ステップ23で先に集計した文字数値を文書の全
字数Nとして使って、前記により出現頻度による重み値
決定条件を算出する。
Therefore, the weighting unit 2 collects the processing results of all the words and phrases in processing step 22 for each word having the same notation, and uses the character numerical value previously counted in processing step 23 as the total character number N of the document, As described above, the weight value determination condition based on the appearance frequency is calculated.

【0050】処理ステップ24で処理の終了を識別しなが
ら、各同一表記の語句群ごとに以下の処理を繰り返すも
のとし、出現状態処理部6により処理ステップ25で同一
表記の1群についてその群に含まれる語句の件数を出現
頻度として計数する。
The following process is repeated for each word group having the same notation while identifying the end of the process at the processing step 24, and the appearance state processing unit 6 makes the group having the same notation at the processing step 25 into the group. The number of contained words is counted as the frequency of appearance.

【0051】次に、処理ステップ26で出現頻度による重
み値を前記計算結果から決定し、処理ステップ27で同一
表記群の最大の重要度累積値を、その表記の重要度累積
値として取りだし、それに出現頻度による重み値を加算
し、その表記と重要度累積値とを抽出処理部3に渡す。
Next, in processing step 26, a weight value according to the appearance frequency is determined from the calculation result, and in processing step 27, the maximum cumulative value of importance of the same notation group is taken out as the cumulative value of importance of the notation, and The weight value according to the appearance frequency is added, and the notation and the importance degree cumulative value are passed to the extraction processing unit 3.

【0052】抽出処理部3は処理ステップ28で、受け取
った重要度累積値と所定の閾値を比較し、重要度累積値
が閾値より大きい場合のみ処理ステップ29で、受け取っ
ている表記の語句をキーワードとして出力し、以上の処
理の後処理ステップ24に戻り、重み付け部2により次の
同一表記群の処理をさせる。
In the processing step 28, the extraction processing unit 3 compares the received cumulative importance value with a predetermined threshold value, and only in the case where the cumulative importance value is larger than the threshold value, in the processing step 29, the word in the received notation is used as a keyword. Then, the processing returns to the post-processing step 24 of the above processing, and the weighting unit 2 processes the next same notation group.

【0053】[0053]

【発明の効果】以上の説明から明らかなように本発明に
よれば、キーワードの抽出処理において、構文解析し
て、各語句の構文上の役割等から、それらの語句の重要
度を評価すると共に、各語句の意味属性を、文書内容の
分野別の基準で評価し、それらの評価を重要度累積値と
して総合するので、キーワード自動抽出に際し、文書の
意味内容を加味して語句の重要度を評価することが可能
になり、適切なキーワードの自動抽出ができるという著
しい工業的効果がある。
As is apparent from the above description, according to the present invention, in keyword extraction processing, syntax analysis is performed, and the importance of these words and phrases is evaluated from the syntactic role of each word and phrase. , The semantic attribute of each word is evaluated according to the criteria for each field of the document content, and these evaluations are integrated as the cumulative value of importance, so when automatically extracting keywords, the importance of the word is determined by taking into account the semantic content of the document. It is possible to evaluate, and there is a remarkable industrial effect that an appropriate keyword can be automatically extracted.

【図面の簡単な説明】[Brief description of drawings]

【図1】 本発明の構成を示すブロック図FIG. 1 is a block diagram showing the configuration of the present invention.

【図2】 意味属性テーブル等を説明する図FIG. 2 is a diagram for explaining a semantic attribute table and the like.

【図3】 品詞テーブル等を説明する図FIG. 3 is a diagram illustrating a part-of-speech table and the like.

【図4】 本発明の処理の一例を説明する図FIG. 4 is a diagram illustrating an example of processing of the present invention.

【図5】 本発明の処理の流れ図FIG. 5 is a process flow chart of the present invention.

【符号の説明】[Explanation of symbols]

1 構文解析部 2 重み付け部 3 抽出処理部 4 意味処理部 5 構文処理部 6 出現状態処理部 10〜29 処理ステップ 1 Syntax Analysis Section 2 Weighting Section 3 Extraction Processing Section 4 Semantic Processing Section 5 Syntax Processing Section 6 Appearance State Processing Section 10-29 Processing Steps

Claims (4)

【特許請求の範囲】[Claims] 【請求項1】 構文解析部(1)と、重み付け部(2)と、抽
出処理部(3)とを有し、 該構文解析部(1)は、所与の文書について、構文解析を
実行して、該文書を構成する文字列を、単語、複合語及
び句を含む語句に分割し、各該語句について、品詞及び
格を決定し、 該重み付け部(2)は、意味処理部(4)と、構文処理部(5)
と、出現状態処理部(6)とを有し、 該構文解析結果の各該語句に各重要度累積値に初期値を
割り当て、 該意味処理部(4)により、各該語句ごとについて、当該
語句の意味属性に基づいて予め定める重み値を該重要度
累積値に加算し、 該構文処理部(5)により、各該語句ごとについて、当該
語句の品詞及び格に基づいて、それぞれ予め定める重み
値を該重要度累積値に加算し、 該出現状態処理部(6)により、各該語句ごとについて、
該文書中での出現状態に基づいて予め定める重み値を該
重要度累積値に加算し、 該抽出処理部(3)は、該重み付け部(2)の処理した結果か
ら、所定の閾値より大きい値を有する該重要度累積値を
選択し、該選択した重要度累積値に対応する該語句をキ
ーワードとして出力するように構成されていることを特
徴とするキーワード抽出処理装置。
1. A syntactic analysis unit (1), a weighting unit (2), and an extraction processing unit (3), the syntactic analysis unit (1) performing syntactic analysis on a given document. Then, the character string constituting the document is divided into words and phrases including words, compound words, and phrases, and the part of speech and case are determined for each of the words and phrases, and the weighting unit (2) determines the meaning processing unit (4). ) And the syntax processor (5)
And an appearance state processing unit (6), assigning an initial value to each importance accumulated value for each word and phrase of the syntactic analysis result, and for each word and phrase by the meaning processing unit (4), A weight value determined in advance based on the meaning attribute of the word is added to the importance cumulative value, and the syntax processing unit (5) determines the weight determined in advance for each word based on the part of speech and the case of the word. A value is added to the importance cumulative value, and by the appearance state processing unit (6),
A weighting value determined in advance based on the appearance state in the document is added to the importance degree cumulative value, and the extraction processing unit (3) is determined to be larger than a predetermined threshold value from the processing result of the weighting unit (2). A keyword extraction processing device configured to select the importance cumulative value having a value and output the word corresponding to the selected importance cumulative value as a keyword.
【請求項2】 前記出現状態処理部(6)は、各前記語句
ごとについて、該文書中の出現位置に基づいて予め定め
る重み値と、 同一表記の該語句の前記文書中の出現頻度に基づいて予
め定める重み値とを、それぞれ該当する前記重要度累積
値に演算する、請求項1記載のキーワード抽出処理装
置。
2. The appearance state processing unit (6) determines, for each of the words and phrases, a weight value determined in advance based on an appearance position in the document, and an appearance frequency of the words and phrases in the same notation in the document. The keyword extraction processing device according to claim 1, wherein a weight value determined in advance is calculated as the corresponding cumulative importance value.
【請求項3】 前記出現状態処理部(6)は、予め定める
記号を強調記号とし、前記文書中の該強調記号の出現位
置と、前記語句との位置関係に基づいて予め定める重み
値を該当する前記重要度累積値に演算する、請求項1又
は請求項2記載のキーワード抽出処理装置。
3. The appearance state processing unit (6) uses a predetermined symbol as an emphasis symbol, and applies a predetermined weight value based on a positional relationship between the appearance position of the emphasis symbol in the document and the word / phrase. The keyword extraction processing device according to claim 1, wherein the keyword extraction processing device calculates the importance level accumulated value.
【請求項4】 前記意味処理部(4)は、前記語句の意味
属性に基づいて予め定める重み値を所定の分野ごとに有
し、前記文書について指定される該分野に従って、前記
重要度累積値に演算する該重み値を選択する、請求項
1、請求項2又は請求項3記載のキーワード抽出処理装
置。
4. The meaning processing unit (4) has a weight value predetermined for each predetermined field based on the semantic attribute of the word and phrase, and the accumulated importance value according to the field specified for the document. 4. The keyword extraction processing device according to claim 1, claim 2 or claim 3, wherein the weight value to be calculated is selected.
JP5232751A 1993-09-20 1993-09-20 Keyword extract processing unit Pending JPH0785101A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP5232751A JPH0785101A (en) 1993-09-20 1993-09-20 Keyword extract processing unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP5232751A JPH0785101A (en) 1993-09-20 1993-09-20 Keyword extract processing unit

Publications (1)

Publication Number Publication Date
JPH0785101A true JPH0785101A (en) 1995-03-31

Family

ID=16944191

Family Applications (1)

Application Number Title Priority Date Filing Date
JP5232751A Pending JPH0785101A (en) 1993-09-20 1993-09-20 Keyword extract processing unit

Country Status (1)

Country Link
JP (1) JPH0785101A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10105555A (en) * 1996-09-26 1998-04-24 Sharp Corp Translation-with-original example sentence retrieving device
JP2011014010A (en) * 2009-07-03 2011-01-20 Nec Corp Information assessment system, information assessment method and program
US10198426B2 (en) 2014-07-28 2019-02-05 International Business Machines Corporation Method, system, and computer program product for dividing a term with appropriate granularity
JP2022079442A (en) * 2020-11-16 2022-05-26 深▲ゼン▼市世強元件網絡有限公司 Method and system for identifying user search scene

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6446831A (en) * 1987-08-17 1989-02-21 Nippon Telegraph & Telephone Automatic key word extracting device
JPH01112331A (en) * 1987-10-26 1989-05-01 Nippon Telegr & Teleph Corp <Ntt> Automatic evaluation device for significance of key word
JPH038070A (en) * 1989-04-21 1991-01-16 Hitachi Ltd Keyword extracting system
JPH05135107A (en) * 1991-11-14 1993-06-01 Ricoh Co Ltd Document retrieval device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6446831A (en) * 1987-08-17 1989-02-21 Nippon Telegraph & Telephone Automatic key word extracting device
JPH01112331A (en) * 1987-10-26 1989-05-01 Nippon Telegr & Teleph Corp <Ntt> Automatic evaluation device for significance of key word
JPH038070A (en) * 1989-04-21 1991-01-16 Hitachi Ltd Keyword extracting system
JPH05135107A (en) * 1991-11-14 1993-06-01 Ricoh Co Ltd Document retrieval device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10105555A (en) * 1996-09-26 1998-04-24 Sharp Corp Translation-with-original example sentence retrieving device
JP2011014010A (en) * 2009-07-03 2011-01-20 Nec Corp Information assessment system, information assessment method and program
US10198426B2 (en) 2014-07-28 2019-02-05 International Business Machines Corporation Method, system, and computer program product for dividing a term with appropriate granularity
JP2022079442A (en) * 2020-11-16 2022-05-26 深▲ゼン▼市世強元件網絡有限公司 Method and system for identifying user search scene

Similar Documents

Publication Publication Date Title
Ahonen et al. Applying data mining techniques for descriptive phrase extraction in digital document collections
US5369577A (en) Text searching system
Kraaij et al. Porter’s stemming algorithm for Dutch
US5708829A (en) Text indexing system
US5323316A (en) Morphological analyzer
US5890103A (en) Method and apparatus for improved tokenization of natural language text
Keller et al. Using the web to obtain frequencies for unseen bigrams
JP3266246B2 (en) Natural language analysis apparatus and method, and knowledge base construction method for natural language analysis
US5878386A (en) Natural language parser with dictionary-based part-of-speech probabilities
US7949676B2 (en) Information search system, information search supporting system, and method and program for information search
US20070016863A1 (en) Method and apparatus for extracting and structuring domain terms
EP2354967A1 (en) Semantic textual analysis
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
WO2009123260A1 (en) Cooccurrence dictionary creating system and scoring system
JP2011118689A (en) Retrieval method and system
JPH0785101A (en) Keyword extract processing unit
JPH10254900A (en) Automatic document summarizing device and its method
JP4378106B2 (en) Document search apparatus, document search method and program
JP3985483B2 (en) SEARCH DEVICE, SEARCH SYSTEM, SEARCH METHOD, PROGRAM, AND RECORDING MEDIUM USING LANGUAGE SENTENCE
JP4033093B2 (en) Natural language processing system, natural language processing method, and computer program
JP2000137718A (en) Similarity deciding method for word and record medium where similarity deciding program for word is recorded
JP4114580B2 (en) Natural language processing system, natural language processing method, and computer program
JP4543819B2 (en) Information search system, information search method, and information search program
JP3609252B2 (en) Automatic character string classification apparatus and method
JP3222173B2 (en) Japanese parsing system