JPH11203415A

JPH11203415A - Device and method for preparing similar pattern category discrimination dictionary

Info

Publication number: JPH11203415A
Application number: JP10006785A
Authority: JP
Inventors: Masaharu Ozaki; 正治尾崎
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-01-16
Filing date: 1998-01-16
Publication date: 1999-07-30

Abstract

PROBLEM TO BE SOLVED: To prepare a discrimination dictionary capable of attaining both of high accuracy and a low calculation cost by the small number of dimensions of feature values. SOLUTION: It is supposed that a learning sample for patterns necessary for the preparation of a discrimination dictionary is previously prepared, and its pattern information is inputted to a similar pattern category preparing means 1. The means 1 finds out similar patterns and merges these patterns. A category division part 2 divides the prepared similar pattern category into plural categories reducing misrecognition. A reintegrated category preparing means 3 integrates the divided categories to a similar gategory again. Consequently a hierarchical dictionary consisting of the divided categories and the reintegrated category is prepared, and when the hierarchical dictionary is used for the recognition of the reintegrated category and the recognition of the divided categories in order in similar pattern category discrimination processing, high recognition accuracy can be obtained and calculation volume for recognition can be reduced.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は類似パターンカテゴ
リ識別辞書作成装置および方法に関し、特に画像特徴
上、形状が類似しているパターンまたは文字をあらかじ
めまとめて同一のカテゴリとして分類しておき、未知の
パターンまたは文字画像から抽出した特徴量をこれらの
類似パターンカテゴリまたは類似文字カテゴリのいずれ
に属するものであるかを識別するための、類似パターン
カテゴリ識別辞書作成装置および方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus and method for creating a similar pattern category identification dictionary, and more particularly, to previously classifying patterns or characters having similar shapes in terms of image characteristics into the same category, and finding unknown patterns. The present invention relates to a similar pattern category identification dictionary creating apparatus and method for identifying whether a feature extracted from a pattern or a character image belongs to one of these similar pattern categories or similar character categories.

【０００２】[0002]

【従来の技術】ハードコピー文書をイメージスキャナで
画像に変換して電子的に蓄積し、後から検索することを
可能とする文書ファイリング装置が実用化されている。
しかしながら、その多くは入力した画像１枚ごとにキー
ワードなどの検索のための属性を人手で付与しなければ
ならず、非常に労力を要していた。本来、文書の検索で
はテキスト内容によるフルテキスト検索が望ましい。し
かし、これはＤＴＰ（DeskTop Publishing）などによっ
て作成された電子文書に対しては可能であるが、文書画
像に対しては直接に行うことができない。このため、特
開昭６２−４４８７８号公報では、文書中のテキスト部
分に対して文字認識を行い、コード化されたテキスト内
容でフルテキスト検索を可能にしている。しかしなが
ら、文字認識、特に多くの文字種を持つ日本語などにお
いては一般的に、数百次元の特徴量ベクトルを求め、３
０００文字種以上との特徴量の照合を行なうため、特徴
ベクトルの照合処理に非常に多大な計算機パワーが必要
であった。また、その文字認識率も高くないため、検索
すべきキーワードが誤認されてしまう可能性があるとい
う問題点もあった。2. Description of the Related Art A document filing apparatus has been put into practical use which enables a hard copy document to be converted into an image by an image scanner, stored electronically, and retrieved later.
However, in many cases, attributes for searching such as keywords have to be manually added to each input image, which requires much labor. Originally, a full text search based on text content is desirable in document search. However, this is possible for an electronic document created by DTP (DeskTop Publishing) or the like, but cannot be performed directly for a document image. For this reason, Japanese Unexamined Patent Publication No. Sho 62-44878 discloses that a text portion in a document is subjected to character recognition, and a full-text search can be performed using coded text contents. However, in character recognition, particularly in Japanese having many character types, generally, a hundred-dimensional feature amount vector is obtained and 3
In order to perform feature amount matching with 000 character types or more, very large computer power is required for feature vector matching processing. Further, since the character recognition rate is not high, there is a problem that a keyword to be searched may be erroneously recognized.

【０００３】特開昭６２−４４８７８号公報に記載の電
子ファイリングシステムでは、文字認識処理中に得られ
た各文字の候補を保持しておき、誤認による検索の洩れ
を減少させている。しかしながら、基本的には文字認識
処理を行うために文書登録時に多大な計算機パワーを要
し、最終的に得たいものが検索時に指定した単語を含む
文書画像であるとするならば、文字認識された結果はほ
とんどが無駄なものとなってしまう。In the electronic filing system described in Japanese Patent Application Laid-Open No. 62-44878, each character candidate obtained during the character recognition processing is held to reduce omission of retrieval due to erroneous recognition. However, basically, a large amount of computer power is required at the time of document registration in order to perform character recognition processing. If it is assumed that the final image to be obtained is a document image including a word specified at the time of search, character recognition is not performed. The results are mostly useless.

【０００４】文献（田中他、「日本語文書画像に対する
文字列検索機能の実現」、情報処理学会情報メディア研
究会資料１９−１、１９９５年１月）では、各文字画像
から得られる特徴量を取り出し、文字認識するのではな
く、特徴量をそのまま３６ビットのコードに変換する。
次に、検索キーワード画像のとの特徴量のマッチングに
よって文字列検索を実現している。しかし、検索キーワ
ードを画像として入力するか、あるいは文字フォントイ
メージによって画像を生成する必要があり、フォントの
変動には弱いという欠点があった。In the literature (Tanaka et al., "Implementation of a character string search function for Japanese document images", Information Processing Society of Japan, Information Technology Research Institute, Material 19-1, January 1995), the feature amount obtained from each character image is described. Instead of taking out and character recognition, the feature amount is directly converted into a 36-bit code.
Next, a character string search is realized by matching a feature amount with a search keyword image. However, it is necessary to input a search keyword as an image or to generate an image using a character font image, which is disadvantageous in that it is not easily affected by font variations.

【０００５】また、別の文献（Reynar, J. et al, ”Do
cument Reconstruction: A Thousand Words from One P
icture”, in Proc. of 4th Annual Symposium on Docu
mentAnalysis and Information Retrieval, Las Vegas,
April 1995）には、ヨーロッパ系言語（英語）のテキ
スト画像中の文字をその大きさ、位置によって少数のカ
テゴリに分類し、その並びによって単語として識別しよ
うとする試みが開示されている。しかしながら、日本語
や中国語などの多くの文字種を含む言語に対して、手掛
かりとするような特徴を直感的に設定することは困難で
ある。また、ヨーロッパ系の言語と異なり、単語間のス
ペースが存在しないので単語単位で画像中から直接得る
ことができない。このため、直接的には開示されている
手法を用いて日本語などのテキストを単語で識別するこ
とは困難であった。[0005] Another document (Reynar, J. et al, "Do
cument Reconstruction: A Thousand Words from One P
icture ”, in Proc. of 4th Annual Symposium on Docu
mentAnalysis and Information Retrieval, Las Vegas,
April 1995) discloses an attempt to classify characters in a text image of a European language (English) into a small number of categories according to their size and position, and to identify them as words according to their arrangement. However, it is difficult to intuitively set features that are clues for languages including many character types such as Japanese and Chinese. Also, unlike European languages, since there is no space between words, it cannot be obtained directly from the image in word units. For this reason, it was difficult to directly identify text such as Japanese using words using the disclosed method.

【０００６】既に出願人は上記の問題点を解決するため
に、特願平８−２７４７３２号明細書において、次のよ
うな手法を開示している。その手法は、以下のようなも
のである。まず、あらかじめ形状の類似している字種
（たとえば、「道」と「通」、数字の「０」とローマン
アルファベットの「Ｏ」）を１つのカテゴリとしてまと
めておく。実際の画像の解析時には、各文字画像をこれ
らの類似文字カテゴリで識別し、その類似文字カテゴリ
列から、日本語のテキストから単語を抽出する技術であ
る形態素解析によって単語として確定できるもののみを
取り出し、曖昧性の残る文字についてのみ、詳細に識別
することを行う。効果としては、文字認識における大分
類に用いられている程度の少ない特徴量次元数で、かつ
少ない類似文字カテゴリとの照合で済むため、大幅に計
算コストが削減できること、および形態素解析を利用し
た場合、単語として許容できるもののうち、曖昧性のあ
るものだけについて詳細識別を実施するので、詳細識別
における特徴量照合の計算コストも削減することができ
ることにある。この発明では、特徴空間内での字種の代
表ベクトルによるクラスタリング、すなわち字種の代表
ベクトル（平均ベクトル）間の距離が小さいもの同士を
統合していくことで、そのクラスタ中心を代表ベクトル
として類似文字カテゴリを形成し、未知文字サンプルは
それらの代表ベクトルとの最短距離識別で識別を行うこ
ととなっていた。しかしながら、この開示している手法
における類似文字カテゴリへの識別方法は必ずしも精度
が高いものにはならなかった。実際の文字の特徴量ベク
トルの分布は、類似文字カテゴリを形成するに従って特
徴量空間内で広がることになり、代表ベクトルから距離
の離れた、分布の端に位置するようなサンプルの場合に
は、他の類似文字カテゴリに誤識別されることが増える
ためであると考えられる。その状況を図１６を参照して
説明する。The applicant has already disclosed the following method in Japanese Patent Application No. 8-274732 in order to solve the above problems. The technique is as follows. First, character types having similar shapes (for example, “way” and “tsu”, number “0”, and Roman alphabet “O”) are grouped as one category in advance. When analyzing actual images, each character image is identified by these similar character categories, and only those that can be determined as words by morphological analysis, which is a technique for extracting words from Japanese text, are extracted from the similar character category string Only the characters that remain ambiguous are identified in detail. The advantage is that the number of feature dimensions that are used for large classification in character recognition is small, and matching with a few similar character categories is sufficient, so that the calculation cost can be greatly reduced, and when morphological analysis is used Since the detailed identification is performed only on the ambiguity among words that can be accepted as words, the calculation cost of the feature amount matching in the detailed identification can be reduced. According to the present invention, clustering based on representative vectors of character types in a feature space, that is, by integrating vectors having a small distance between representative vectors (average vectors) of character types, the cluster center is regarded as a representative vector. A character category was formed, and unknown character samples were identified by the shortest distance identification with their representative vectors. However, the method of identifying a similar character category in the disclosed technique has not always been highly accurate. The distribution of the actual character feature vector will spread in the feature space as the similar character category is formed, and in the case of a sample that is far from the representative vector and located at the end of the distribution, This is considered to be due to an increase in misidentification to another similar character category. This situation will be described with reference to FIG.

【０００７】図１６は類似文字カテゴリ識別での問題点
を説明する図である。図１６では、説明を簡単にするた
め、特徴量空間を二次元に設定して模式的に示してい
る。一つの類似文字のカテゴリのサンプルは特徴量空間
内である分布をしており、ここでは例として三つの分布
Ａ，Ｂ，Ｃを示し、各分布Ａ，Ｂ，Ｃをたとえば楕円で
近似している。各分布Ａ，Ｂ，Ｃの代表ベクトルはその
平均値、すなわち、ここでは楕円の中心のところにある
とする。最短距離識別では、これらの代表ベクトルを結
ぶ垂直二等分線が識別境界になり、この識別境界を越え
て分布するサンプルが誤認識の要因となる。たとえば、
ある類似文字カテゴリの分布Ａでは、これを表す楕円の
長軸の両端が別の類似文字カテゴリの分布との識別境界
を越えているので、このような位置にあるサンプルの場
合には、他の類似文字カテゴリに誤識別されることにな
る。FIG. 16 is a diagram for explaining a problem in similar character category identification. FIG. 16 schematically shows a feature amount space set two-dimensionally for the sake of simplicity. A sample of one similar character category has a distribution in the feature amount space. Here, three distributions A, B, and C are shown as examples, and each distribution A, B, and C is approximated by, for example, an ellipse. I have. It is assumed that the representative vector of each distribution A, B, C is at its average value, that is, at the center of the ellipse here. In the shortest distance identification, a vertical bisector connecting these representative vectors becomes an identification boundary, and a sample distributed beyond the identification boundary causes erroneous recognition. For example,
In the distribution A of a certain similar character category, both ends of the major axis of the ellipse representing the distribution are beyond the discrimination boundary with the distribution of another similar character category. It will be erroneously identified as a similar character category.

【０００８】文献（伊藤、遠藤他、「階層的印刷漢字シ
ステムにおける字種を複数クラスタに登録する辞書構成
法」、電子通信学会論文誌D-II, Vol.J78-D-II, No.6,
pp.896-905, １９９５年６月）では、上述と同様に文字
認識の計算コストを削減するために、あらかじめ特徴量
空間で近い、すなわち形状が類似している字種をクラス
タリングによって、カテゴリを作成しておき、これを階
層的に構成し、順に類似文字候補を絞っていく手法を提
案している。ここでも、上述と同様の文字サンプルの分
布の広がりによる誤認の問題を指摘しており、これを解
決するために階層の途中段階では、学習文字サンプルを
用いて、誤識別が発生する可能性を調べ、可能性がある
場合は、その字種を誤識別するカテゴリへ重複して登録
することを行っている。しかしながら、この手法では最
終段階の照合で正解字種に高い精度で識別できることが
前提となっている。本願の類似パターンカテゴリへの識
別では、前述したように類似文字カテゴリの代表ベクト
ルと総当たりで照合を実施しても、精度はさほど良くな
いため、その前提は用いることができない。このとき、
最終段階で他の次元数の多い特徴量を用いて類似文字カ
テゴリを識別することも可能であるが、増加した次元数
分の照合と特徴量抽出のための計算コストが新たに必要
となる。ユークリッド距離を用いずに、マハラノビス距
離など統計的な距離計算を行うことも考えられるが、こ
れも計算コストの増大を招く。精度を向上させる別の方
法として、誤認しやすい字種を複数のカテゴリに登録す
ることを最終段階で許容することが考えられる。しか
し、この場合、後段での単語抽出での問題が生じる。た
とえば、字種「ａ」が、類似文字カテゴリ「Ａ」、
「Ｂ」に重複して登録され、字種「ｂ」がカテゴリ
「Ｃ」、「Ｄ」に登録されている場合、類似文字カテゴ
リ列ＡＣ、ＡＤ、ＢＣ、ＢＤはすべてａｂという単語と
なる可能性があることになり、形態素解析を行う場合
は、単語辞書のエントリが大幅に増えることになる。ま
た、二つの文字の並びをすべてインデックスとして登録
するｂｉ−ｇｒａｍを用いる場合は、検索時に１つの検
索単語を、複数の類似文字カテゴリ列それぞれに検索し
なければならなくなり、検索時の計算コストの増加を引
き起こす。したがって、ある字種が属する類似文字カテ
ゴリは単一であることが望ましい。References (Ito, Endo et al., "Dictionary Construction Method for Registering Character Types in Multiple Clusters in Hierarchical Printing Kanji System", IEICE Transactions D-II, Vol.J78-D-II, No.6 ,
pp. 896-905, June 1995), as described above, in order to reduce the calculation cost of character recognition, in order to reduce the category by performing clustering on character types that are close to each other in the feature space, that is, similar in shape. A method has been proposed in which these are created in a hierarchical manner, and similar character candidates are sequentially narrowed down. Here, too, the problem of misidentification due to the spread of character sample distribution is pointed out, and in order to solve this problem, the possibility of erroneous identification occurring using learning character samples in the middle of the hierarchy is considered. Investigating, and if there is a possibility, the character type is registered in a duplicate category. However, this method is based on the premise that the correct character type can be identified with high accuracy in the final stage of collation. In the classification into the similar pattern category of the present application, even if the matching is performed with the brute force against the representative vector of the similar character category as described above, the accuracy is not so good, so that the premise cannot be used. At this time,
Although it is possible to identify a similar character category using another feature quantity having a large number of dimensions in the final stage, a new calculation cost is required for matching and extracting feature quantities for the increased number of dimensions. Statistical distance calculation such as Mahalanobis distance may be performed without using the Euclidean distance, but this also increases calculation cost. As another method for improving the accuracy, it is conceivable to allow a character type that is easily misidentified to be registered in a plurality of categories at the final stage. However, in this case, a problem occurs in word extraction in a later stage. For example, if the character type “a” is a similar character category “A”,
When the character type "b" is registered in the category "C" or "D" while being registered in duplicate with "B", the similar character category strings AC, AD, BC, and BD may all be the word "ab". Therefore, when morphological analysis is performed, the number of entries in the word dictionary increases significantly. In addition, in the case of using a bi-gram that registers all two character sequences as an index, one search word must be searched for each of a plurality of similar character category strings at the time of search. Cause an increase. Therefore, it is desirable that a similar character category to which a certain character type belongs is single.

【０００９】その他、いくつか類似した文字を含むカテ
ゴリへ識別する手法が開示されている。たとえば、特開
昭６３−２６３５９０号公報では、階層的に類似した文
字サンプルをグループ化しておき、段階的に識別してい
く方法を示している。この中では類似文字のグループ間
での誤認の対処のために複数の候補を選択して、その下
位グループすべてとの識別を行っている。しかし、この
手法では、字種を単位とするのではなく、文字サンプル
を単位として類似文字グループを形成しているので、最
終的に同一字種が異なる分類に属する可能性がある。特
開平４−３３７８８８号公報、特開平５−１７４１９３
号公報では、２分木、３分木を用いて階層的に文字種を
絞っていくことを実施しているが、いずれも同一字種が
異なる分類に属する可能性があり、その救済は最終的に
次元数の多い特徴量での詳細識別を前提としている。こ
のため、前述の理由と同様に後段での計算コスト増加を
招く。[0009] In addition, there is disclosed a method for identifying a category including some similar characters. For example, Japanese Patent Application Laid-Open No. 63-263590 discloses a method in which character samples that are hierarchically similar are grouped and identified stepwise. In this method, a plurality of candidates are selected in order to cope with misidentification between groups of similar characters, and all the lower groups are identified. However, in this method, similar character groups are formed not in units of character types but in units of character samples, so that the same character types may eventually belong to different classifications. JP-A-4-337888, JP-A-5-174193
In the official gazette, the character type is narrowed down hierarchically using a binary tree and a three-way tree. However, in any case, the same character type may belong to a different classification, and the relief is ultimately required. It is assumed that detailed identification is performed using a feature having a large number of dimensions. For this reason, the calculation cost in the subsequent stage is increased as in the above-mentioned reason.

【００１０】したがって、字種が重複していないよう
な、類似文字カテゴリに未知文字を識別する際に、識別
精度を確保しつつ、かつ計算コストが少ない手法が必要
となっていた。Therefore, when an unknown character is identified in a similar character category in which character types do not overlap, a method is required that ensures identification accuracy and has a low calculation cost.

【００１１】[0011]

【発明が解決しようとする課題】従来のいずれにおいて
も、類似文字カテゴリへの識別方法に識別精度や計算コ
ストの点で問題があった。In any of the conventional methods, there is a problem in the method of identifying a similar character category in terms of identification accuracy and calculation cost.

【００１２】本発明は以上のような点に鑑みてなされた
ものであり、あらかじめ定めている字種の重複を許さな
いような類似文字カテゴリへ一意に識別するために、少
ない特徴量次元数で高い精度と少ない計算コストとを両
立することを可能にする識別辞書を作成するための類似
パターンカテゴリ識別辞書作成装置および方法を提供す
ることを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above points. In order to uniquely identify a similar character category which does not allow a predetermined character type to be duplicated, the number of feature amount dimensions is small. An object of the present invention is to provide a similar pattern category identification dictionary creating apparatus and method for creating an identification dictionary that enables both high accuracy and low calculation cost to be achieved.

【００１３】[0013]

【課題を解決するための手段】本発明では上記問題を解
決するために、画像情報に含まれるパターンを類似パタ
ーンカテゴリに識別するときの照合に使用する類似パタ
ーンカテゴリ識別辞書を作成する類似パターンカテゴリ
識別辞書作成装置において、画像に含まれるパターンの
特徴量を抽出したパターン情報からパターン間の類似性
を求め、類似したパターンをまとめて類似パターンカテ
ゴリを作成する類似パターンカテゴリ作成手段と、作成
された類似パターンカテゴリに属するパターンの学習サ
ンプルの誤認の状況を調べ、その状況に基づいて前記類
似パターンカテゴリを分割して分割カテゴリからなる詳
細分類の識別辞書を作成するカテゴリ分割手段と、前記
分割カテゴリの代表ベクトルを求め、分割カテゴリを改
めて類似したものに再統合した再統合カテゴリからなる
大分類の識別辞書を作成し、前記詳細分類の識別辞書と
ともに階層構造を持った類似パターンカテゴリ識別辞書
を構築する再統合カテゴリ作成手段と、を備えているこ
とを特徴とする類似パターンカテゴリ識別辞書作成装置
が提供される。According to the present invention, in order to solve the above-mentioned problem, a similar pattern category for creating a similar pattern category identification dictionary used for matching when a pattern included in image information is identified as a similar pattern category. In the identification dictionary creating apparatus, similar pattern category creating means for obtaining similarity between patterns from pattern information obtained by extracting feature amounts of patterns included in an image, and creating similar pattern categories by grouping similar patterns together; A category dividing means for examining a situation of misrecognition of a learning sample of a pattern belonging to the similar pattern category, dividing the similar pattern category based on the situation and creating a detailed classification identification dictionary composed of divided categories, Find a representative vector and re-similar the division category Re-integrated category creating means for creating a large classification identification dictionary composed of the re-integrated re-integrated categories, and constructing a similar pattern category identification dictionary having a hierarchical structure together with the detailed classification identification dictionary. A feature-based similar pattern category identification dictionary creating apparatus is provided.

【００１４】このような類似パターンカテゴリ識別辞書
作成装置によれば、あらかじめ画像からこれに含まれる
パターンの特徴量を抽出しておいたパターン情報を入力
し、まず、類似パターンカテゴリ作成手段が類似するパ
ターンを求めてこれらをまとめることで類似パターンカ
テゴリを作成する。この作成された類似パターンカテゴ
リはカテゴリ分割手段により誤認識が少なくなるような
カテゴリに細分割される。このようにして分割されたカ
テゴリは、再統合カテゴリ作成手段によって、改めて類
似したカテゴリに統合される。これにより、分割カテゴ
リと再統合カテゴリとの階層的な辞書が作成され、これ
を類似パターンカテゴリ識別処理に使うときは、再統合
カテゴリの中から画像から抽出した特徴量ベクトルと最
短距離にある再統合カテゴリを得、その再統合カテゴリ
に属する分割カテゴリの中から最短距離にある分割カテ
ゴリを得るようにすることで、高い認識精度が得られる
とともに認識のための計算量が大幅に削減される。According to such a similar pattern category identification dictionary creating apparatus, pattern information in which feature amounts of patterns included in the pattern are extracted from an image in advance is input. A similar pattern category is created by obtaining patterns and putting them together. The created similar pattern category is subdivided into categories in which misrecognition is reduced by the category dividing means. The categories thus divided are integrated again into similar categories by the reintegrated category creating means. As a result, a hierarchical dictionary of the divided categories and the reintegration categories is created. By obtaining an integrated category and obtaining the shortest-distance divided category from among the divided categories belonging to the reintegrated category, high recognition accuracy is obtained and the amount of calculation for recognition is significantly reduced.

【００１５】また、本発明では、画像情報に含まれるパ
ターンを類似パターンカテゴリに識別するときの照合に
使用する類似パターンカテゴリ識別辞書を作成する類似
パターンカテゴリ識別辞書作成方法において、画像情報
を入力し、前記画像情報に含まれるパターンの特徴量を
抽出して学習サンプルを蓄積し、前記学習サンプルのパ
ターン間の類似性を調べて、類似したパターンをまとめ
た類似パターンカテゴリを作成し、作成された前記類似
パターンカテゴリに属するパターンの学習サンプルに基
づいて、誤識別が少なくなるよう、各類似パターンカテ
ゴリを分割して分割カテゴリを作成し、前記分割カテゴ
リを改めて類似したカテゴリにまとめて、前記分割カテ
ゴリとともに階層的な識別辞書を構成する再統合カテゴ
リを作成する、ことからなる類似パターンカテゴリ識別
辞書作成方法が提供される。Further, according to the present invention, in a similar pattern category identification dictionary creating method for creating a similar pattern category identification dictionary used for matching when a pattern included in image information is identified as a similar pattern category, image information is input. Extracting a feature amount of a pattern included in the image information, accumulating a learning sample, examining a similarity between the patterns of the learning sample, creating a similar pattern category in which similar patterns are put together, Based on a learning sample of a pattern belonging to the similar pattern category, each similar pattern category is divided to create a divided category so as to reduce erroneous identification, and the divided categories are collected again into similar categories, and To create reintegration categories that together form a hierarchical identification dictionary. Similar pattern category identification dictionary creating method consists of is provided.

【００１６】このような類似パターンカテゴリ識別辞書
作成方法によれば、まず、辞書作成に必要な学習サンプ
ルを画像情報から抽出しておき、これを類似したパター
ンを持つカテゴリにまとめて類似パターンカテゴリを作
成する。次に、その学習サンプルに基づいて、誤認が少
なくなるよう、各類似パターンカテゴリを分割し、分割
カテゴリからなる詳細分類の辞書を作成する。そして、
この分割カテゴリを改めて少数の類似したカテゴリにま
とめて、再統合カテゴリからなる大分類辞書を作成す
る。これによって、階層的な識別辞書が構築され、識別
処理時には、構築された階層的な識別辞書によって、最
終的に重複したパターンのない類似パターンカテゴリに
一意に識別することが可能になる。According to such a similar pattern category identification dictionary creating method, first, a learning sample necessary for creating a dictionary is extracted from image information, and the sample is combined into a category having a similar pattern, and a similar pattern category is created. create. Next, based on the learning sample, each similar pattern category is divided so as to reduce erroneous recognition, and a dictionary of detailed classification including the divided categories is created. And
The divided categories are again grouped into a small number of similar categories, and a large classification dictionary including reintegrated categories is created. As a result, a hierarchical identification dictionary is constructed, and at the time of the identification processing, it is finally possible to uniquely identify a similar pattern category having no overlapping pattern by the constructed hierarchical identification dictionary.

【００１７】[0017]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して説明する。図１は本発明の原理図である。本
発明による類似パターンカテゴリ識別辞書作成装置は、
画像に含まれるパターンの特徴量を抽出したパターン情
報を受ける類似パターンカテゴリ作成手段１と、この類
似パターンカテゴリ作成手段１の出力を受けるカテゴリ
分割手段２と、このカテゴリ分割手段２の出力を受けて
類似パターンカテゴリ識別辞書を出力する再統合カテゴ
リ作成手段３とから構成されている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram illustrating the principle of the present invention. The similar pattern category identification dictionary creation device according to the present invention includes:
Similar pattern category creating means 1 for receiving pattern information obtained by extracting feature amounts of patterns included in an image, category dividing means 2 for receiving the output of the similar pattern category creating means 1, and receiving the output of the category dividing means 2 And a reintegrated category creating means 3 for outputting a similar pattern category identification dictionary.

【００１８】類似パターンカテゴリ作成手段１は、入力
されたパターン情報からパターン間の類似性を調べ、類
似したパターンをまとめて類似文字カテゴリを作成する
ものである。カテゴリ分割手段２は、類似パターンカテ
ゴリ作成手段１によって作成されたカテゴリに属するパ
ターンの学習サンプルに基づいて、誤識別が少なくなる
よう、各類似パターンカテゴリを分割するものであっ
て、カテゴリに属するサンプルを主成分分析し、主成分
軸上での端点ベクトルを求め、その端点ベクトルが、他
のカテゴリに誤識別されるものは、そのカテゴリをクラ
スタリングし、誤識別がなくなるまで、分割する。この
とき、他のカテゴリに誤識別される主成分軸上での端点
ベクトルのうち、固有値の大きなものから分割するよう
にしている。再統合カテゴリ作成手段３は、分割された
カテゴリを改めて類似したカテゴリにまとめて、階層的
な識別辞書を構築するものであって、カテゴリ分割手段
２によって分割されたカテゴリの代表ベクトルをクラス
タリングして代表ベクトルを求めておき、得られた代表
ベクトルとの照合で学習サンプルが誤識別された場合
は、その学習サンプルが属する分割カテゴリを誤識別さ
れたカテゴリに登録することで識別辞書を構築してい
く。The similar pattern category creating means 1 examines the similarity between patterns from input pattern information and creates similar character categories by grouping similar patterns. The category dividing means 2 divides each similar pattern category based on a learning sample of a pattern belonging to the category created by the similar pattern category creating means 1 so as to reduce erroneous identification. Is subjected to principal component analysis to find an end point vector on the principal component axis. If the end point vector is erroneously identified as another category, the category is clustered and divided until no erroneous identification is found. At this time, among the end point vectors on the principal component axis that are erroneously identified as other categories, the end point vectors having the larger eigenvalue are divided. The reintegrated category creating means 3 is for re-assembling the divided categories into similar categories and constructing a hierarchical identification dictionary. The representative vectors of the categories divided by the category dividing means 2 are clustered. When a representative vector is obtained and a learning sample is erroneously identified by comparison with the obtained representative vector, an identification dictionary is constructed by registering a divided category to which the learning sample belongs to an erroneously identified category. Go.

【００１９】このようにして作成された類似パターンカ
テゴリ識別辞書は、たとえばイメージスキャナなどから
入力された文書画像から、たとえば文字あるいは単語情
報を抽出するために、画像中の文字の領域を、形状が類
似している字種をまとめた類似文字カテゴリに識別する
処理に使用される。The similar pattern category identification dictionary created in this manner is used to extract, for example, characters or word information from a document image input from an image scanner or the like. It is used for processing to identify similar character types into similar character categories.

【００２０】次に、類似パターンカテゴリ識別辞書作成
装置を文字認識用の類似文字カテゴリ識別辞書の作成に
適用した場合を例に説明する。図２は本発明を実施する
ハードウェア構成を示す図である。本発明を実施する装
置は、一般的なパーソナルコンピュータ１０とその周辺
機器とから構成されている。パーソナルコンピュータ１
０は中央演算装置（ＣＰＵ）１１と、主記憶メモリ１２
と、周辺機器コントローラ１３とから構成されており、
その周辺機器コントローラ１３には、外部記憶装置１
４、ディスプレイ１５、キーボード１６、マウスなどの
ポインティングデバイス１７、画像入力装置であるイメ
ージスキャナ１８、およびネットワーク１９が接続され
ている。Next, an example in which the similar pattern category identification dictionary creating apparatus is applied to create a similar character category identification dictionary for character recognition will be described. FIG. 2 is a diagram showing a hardware configuration for implementing the present invention. An apparatus embodying the present invention includes a general personal computer 10 and its peripheral devices. Personal computer 1
0 is a central processing unit (CPU) 11 and a main memory 12
And a peripheral device controller 13,
The peripheral device controller 13 includes the external storage device 1
4, a display 15, a keyboard 16, a pointing device 17 such as a mouse, an image scanner 18 as an image input device, and a network 19 are connected.

【００２１】本発明装置の処理はすべてソフトウェアで
構成され、外部記憶装置１４に格納されていて、必要に
応じて主記憶メモリ１２にロードされ、随時ＣＰＵによ
って実行される。The processing of the apparatus of the present invention is entirely constituted by software, stored in the external storage device 14, loaded into the main storage memory 12 as needed, and executed by the CPU as needed.

【００２２】図３は類似文字カテゴリ識別辞書を作成す
る処理の流れを示すフローチャートである。類似文字カ
テゴリ識別辞書の作成にあたって、まず、イメージスキ
ャナ１８の原稿台の上に置かれた原稿を二値画像として
読み取り、二値画像を文字ごとに切り出し、大きさの正
規化などの前処理を施した後、対応する字種とともに外
部記憶装置１４に格納するという画像入力処理を行う
（ステップＳ１）。次に、外部記憶装置１４に格納され
た学習用の文字画像それぞれについて、特徴量を多次元
のベクトルで表現し、それを外部記憶装置１４に保存す
るという特徴量抽出処理が行われる（ステップＳ２）。
次に、それぞれの字種の学習サンプルの平均ベクトルを
計算してその字種の代表ベクトルとしておき、得られた
代表ベクトルをクラスタリング処理を施して類似文字カ
テゴリを作成する（ステップＳ３）。次に、作成された
類似文字カテゴリごとの学習サンプルの分布をもとに、
他のカテゴリへの誤認が少なくなるようカテゴリを分割
し、分割された類似文字カテゴリの代表ベクトルを、そ
のカテゴリを代表する文字とともに番号を付けて外部記
憶装置１４に格納する処理を行う（ステップＳ４）。そ
して、分割された類似文字カテゴリの代表ベクトルを改
めてクラスタリングし、少数のカテゴリに統合し、その
代表ベクトルと、それぞれに属する分割された類似文字
カテゴリの番号を登録することで再統合カテゴリを作成
する処理を行う（ステップＳ５）。以下、辞書作成処理
を処理の流れに沿ってさらに詳細に説明する。FIG. 3 is a flowchart showing the flow of processing for creating a similar character category identification dictionary. When creating the similar character category identification dictionary, first, a document placed on the platen of the image scanner 18 is read as a binary image, and the binary image is cut out for each character, and preprocessing such as size normalization is performed. After the application, an image input process of storing the image data in the external storage device 14 together with the corresponding character type is performed (step S1). Next, for each character image for learning stored in the external storage device 14, a feature value extraction process is performed in which the feature value is represented by a multidimensional vector, and the feature value is stored in the external storage device 14 (step S2). ).
Next, an average vector of the learning sample of each character type is calculated and set as a representative vector of the character type, and the obtained representative vector is subjected to a clustering process to create a similar character category (step S3). Next, based on the distribution of training samples for each created similar character category,
The category is divided so that misrecognition to other categories is reduced, and the representative vector of the divided similar character category is numbered together with the character representative of the category and stored in the external storage device 14 (step S4). ). Then, the representative vectors of the divided similar character categories are clustered again, integrated into a small number of categories, and the representative vectors and the numbers of the divided similar character categories belonging to each are registered to create a reintegrated category. Processing is performed (step S5). Hereinafter, the dictionary creation processing will be described in more detail along the processing flow.

【００２３】図４は画像入力処理の流れを示すフローチ
ャートである。まず、イメージスキャナ１８から学習用
の原稿を読み込む（ステップＳ１１）。原稿は文字の書
体、大きさを変化させておくことが幅広いサンプルを入
手するためには望ましい。単純に固定的なしきい値によ
って二値化してもよいし、グレースケールの多値画像と
して取り込み、しきい値処理によって二値化してもよ
い。後者の場合はいくつかの二値化しきい値によって、
つぶれ、かすれのある画像を収集することができるの
で、サンプル数を増やすのには効果的である。本実施の
形態では、しきい値をいくつか変えて異なるサンプル画
像を作成している（ステップＳ１２）。次に、得られた
二値画像から文字ごとに領域を切り出す（ステップＳ１
３）。切り出された文字画像は、そこから文字を構成し
ないと思われる小さな孤立点ノイズの除去が施され（ス
テップＳ１４）、文字の外接矩形をもとに大きさの正規
化が行われる（ステップＳ１５）。ここでは、１文字を
６４×６４画素の画像の大きさに正規化するものとす
る。ノイズ除去、大きさの正規化などは、いくつかの公
知の技術があるので、そのうちの適当なものを利用すれ
ばよい。これらの前処理が施された画像はその対応する
字種を与えて、外部記憶装置１４に格納される（ステッ
プＳ１６）。なお、ステップＳ１４〜Ｓ１６は切り出さ
れた文字画像ごとに繰り返され、ステップＳ１３〜Ｓ１
６は異なるしきい値ごとに繰り返し処理される。FIG. 4 is a flowchart showing the flow of the image input processing. First, a document for learning is read from the image scanner 18 (step S11). It is desirable to change the typeface and size of the manuscript in order to obtain a wide variety of samples. The binarization may be performed simply by using a fixed threshold, or may be captured as a grayscale multi-valued image, and may be binarized by threshold processing. In the latter case, with some binarization thresholds,
It is effective to increase the number of samples because an image with crushing and blurring can be collected. In the present embodiment, different sample images are created by changing some threshold values (step S12). Next, an area is cut out for each character from the obtained binary image (step S1).
3). The cut-out character image is subjected to removal of small isolated point noise that is considered not to constitute a character from the character image (step S14), and normalization of the size is performed based on a circumscribed rectangle of the character (step S15). . Here, it is assumed that one character is normalized to an image size of 64 × 64 pixels. There are several known techniques for noise removal, size normalization, and the like, and an appropriate one may be used. These preprocessed images are given the corresponding character types and stored in the external storage device 14 (step S16). Steps S14 to S16 are repeated for each cut out character image, and steps S13 to S1 are performed.
6 is repeatedly processed for each different threshold value.

【００２４】次の特徴量抽出の処理は、本実施の形態で
は、次元数の少ないペリフェラル特徴を用いている。こ
のペリフェラル特徴を図５を参照して説明する。図５は
ペリフェラル特徴の説明図である。ペリフェラル特徴
は、外接矩形の各辺から最初に黒画素が現れる所（１次
ペリフェラル）、および一旦白画素になり、再び黒画素
になる所（２次ペリフェラル）までの画素数を特徴量と
するものである。この１次ペリフェラルおよび２次ペリ
フェラルを各辺６４画素それぞれについて調べ、８画素
ずつ平均し、それを特徴量ベクトルの各要素とする。し
たがって、各辺８次元で、２次ペリフェラルまでを取る
とすると、合計６４次元の特徴量ベクトルが得られる。
この特徴量ベクトルを外部記憶装置１４にその字種に対
応させて格納する。この処理によって、準備したすべて
の書体、大きさの学習用の文字画像に対して、特徴量ベ
クトルが計算される。In the following feature value extraction processing, in this embodiment, peripheral features having a small number of dimensions are used. This peripheral feature will be described with reference to FIG. FIG. 5 is an explanatory diagram of peripheral features. Peripheral features are characterized by the number of pixels from each side of the circumscribed rectangle to the point where a black pixel first appears (primary peripheral) and the point where the pixel once becomes a white pixel and then becomes a black pixel again (secondary peripheral). Things. The primary peripheral and the secondary peripheral are examined for each of 64 pixels on each side, and eight pixels are averaged, and the average is set as each element of the feature amount vector. Therefore, assuming that up to the secondary peripheral is taken in eight dimensions on each side, a total of 64-dimensional feature amount vectors are obtained.
This feature amount vector is stored in the external storage device 14 in correspondence with the character type. By this processing, the feature amount vectors are calculated for all the prepared character images for learning the typeface and size.

【００２５】この特徴量抽出の処理が終了すれば、次
に、類似文字カテゴリ生成の処理が起動される。類似文
字カテゴリ生成処理は、まず、外部記憶装置１４から、
字種ごとに学習サンプルすべてを取り出し、それらの平
均ベクトルを計算し、これを字種の代表ベクトルとす
る。これをすべての字種について求めたならば、それら
をクラスタリング処理する。クラスタリングは文献（Du
da, Hart著”Pattern Classification and Scene Analy
sis ”, Wiley-Interscience社刊）に記載されている方
法を用いる。この方法はまず、初めに階層的クラスタリ
ングを施し、これを最初のクラスタとしてクラスタごと
の中心と各学習サンプルの特徴量ベクトルとの自乗誤差
の総和が最小になるように最適化を行なうものである。Upon completion of the feature value extraction process, next, a similar character category generation process is started. In the similar character category generation processing, first, from the external storage device 14,
All learning samples are extracted for each character type, their average vector is calculated, and this is used as a representative vector of the character type. Once this has been obtained for all character types, they are clustered. Clustering is described in the literature (Du
da, Hart, Pattern Classification and Scene Analy
sis ", published by Wiley-Interscience). In this method, first, hierarchical clustering is performed, and the center of each cluster, the feature vector of each learning sample, and The optimization is performed so that the sum of the square errors of the is minimized.

【００２６】階層的クラスタリングは、実際には以下の
ようなステップから構成される。（１）所望のクラスタ数をｍ、文字種の総数をｎ、初期
クラスタをＣ＝｛ｃ_1,ｃ_2,ｃ_3,．．．_,c_n｝とし、ｃ_i
は類似している文字種の代表特徴ベクトルである。初期
値としては、各文字種の代表特徴ベクトルを１つずつ入
れられる。Hierarchical clustering actually consists of the following steps. (1) the desired number of clusters m, the total number of character type n, the initial cluster _{_{C = {c 1, c 2}} , c 3,. . . _, c _n } and c _i
Are representative feature vectors of similar character types. As the initial value, one representative feature vector of each character type can be entered.

【００２７】（２）もし、現在のクラスタの数がｍに等
しければ、その時点のＣをクラスタリングの結果として
処理を終わる。そうでない場合は次に進む。（３）特徴空間におけるクラスタの距離ｄが最も小さい
二つのクラスタの組を見つけ出し、これを一つのクラス
タに統合し、（２）に戻る。(2) If the current number of clusters is equal to m, the processing ends at the time point C as a result of clustering. If not, proceed to the next step. (3) Find a set of two clusters with the smallest distance d between clusters in the feature space, integrate them into one cluster, and return to (2).

【００２８】ここで、所望のクラスタ数ｍは任意に与え
る。また、この処理の中でクラスタ間の距離ｄの計算方
法には種々のものが考えられるが、ここでは重心法と呼
ばれる２つのクラスタ中心間の距離を二つのクラスタの
距離ｄとする。Here, the desired number m of clusters is arbitrarily given. In this processing, various methods for calculating the distance d between the clusters are conceivable. Here, the distance between the centers of the two clusters, which is called the centroid method, is set as the distance d between the two clusters.

【００２９】この階層的クラスタリングの結果は最適な
クラスタリングとはいえないため、これを出発点とし
て、クラスタの最適化を行う。最適化は各クラスタ内の
特徴ベクトルの平均値と各特徴ベクトルとの距離の二乗
和をとり、すべてのクラスタについての総和を評価関数
とする。この評価関数の値が小さいほどクラスタ内の特
徴ベクトルが密集しており、より良いクラスタリングで
あるといえる。これを最小とするようなクラスタリング
を見つけることは一般的には困難であるが、疑似的に最
適化を施すことが可能である。これは以下のステップで
行われる。Since the result of this hierarchical clustering cannot be said to be optimal clustering, cluster optimization is performed starting from this. The optimization takes the sum of squares of the distance between the average value of the feature vectors in each cluster and each feature vector, and uses the sum of all clusters as an evaluation function. As the value of the evaluation function is smaller, the feature vectors in the cluster are denser, and it can be said that the clustering is better. It is generally difficult to find clustering that minimizes this, but it is possible to perform optimization in a pseudo manner. This is done in the following steps.

【００３０】（１）任意の特徴ベクトルｘを取り出す。（２）ｘが現在属しているクラスタをｃ_iとして、そこ
に登録されている特徴ベクトルがｘのみである場合は
（１）へ戻る。そうでない場合は、すべてのクラスタｃ
_jに対して以下の計算を行う。(1) Extract an arbitrary feature vector x. (2) If the cluster to which x currently belongs is c _i and the feature vector registered there is only x, the process returns to (1). Otherwise, all clusters c
_The following calculation is performed for _j .

【００３１】すなわち、ｊ≠ｉの時は、That is, when j ≠ i,

【００３２】[0032]

【数１】 (Equation 1)

【００３３】の計算を行い、ｊ＝ｉの時は、Is calculated, and when j = i,

【００３４】[0034]

【数２】 (Equation 2)

【００３５】の計算を行う。ただし、ｎ_jはｃ_jに登録
されているベクトルの個数、Ｍ_jはｃ _jに属する特徴ベ
クトルの平均である。上記の式はｘをｃ_jに移動させた
時の判定関数の変化量を示している。Is calculated. Where n_jIs c_jRegister with
The number of vectors_jIs c _jFeatures belonging to
It is the average of Khutor. The above equation gives x as c_jMoved to
The amount of change of the judgment function at the time is shown.

【００３６】（３）ａが最小となるｊがｉ以外である場
合はｘをクラスタｃ_jへ移動し、（４）へ戻る。（４）次の特徴ベクトルをｘとして（２）から繰り返
す。もし、すべての特徴ベクトルについてクラスタの移
動ができなくなった場合は、その時点でのクラスタを結
果として処理を終了する。(3) If j at which a is a minimum is other than i, move x to cluster c _j and return to (4). (4) Repeat from (2) with the next feature vector as x. If the cluster cannot be moved for all the feature vectors, the processing at the time is ended with the cluster at that time as a result.

【００３７】このようにして類似文字のクラスタリング
が行われる。この操作のうち（１）の任意の文字を取り
出す方法をさまざまに変えて同様の処理を施し、評価関
数（各クラスタ内の特徴ベクトルの平均値と各特徴ベク
トルとの距離の二乗和の総和）を最小とするものを結果
として採用する。In this way, clustering of similar characters is performed. In this operation, the same processing is performed by variously changing the method of extracting an arbitrary character in (1), and an evaluation function (sum value of the sum of squares of the distance between each feature vector and the average value of the feature vectors in each cluster) is performed. Is adopted as the result.

【００３８】それぞれのクラスタは類似文字カテゴリテ
ーブルとして記憶され、文書の登録の際に用いられる。
ここで、記憶されるテーブルの例を図６に示す。図６は
類似文字カテゴリ生成処理により作成されるテーブルの
例を示す図であって、（Ａ）は類似文字カテゴリテーブ
ルの例を示し、（Ｂ）は文字コード・カテゴリ対応テー
ブルの例を示している。類似文字カテゴリテーブルは、
その一部を（Ａ）に示したように、各カテゴリごとに、
属する文字の文字コード（類似文字）、カテゴリ特徴の
代表ベクトル（代表ベクトル）、およびカテゴリを代表
する文字コード（代表文字）から構成され、記憶装置１
４に蓄えられる。カテゴリ代表ベクトルは属する文字の
特徴ベクトルの平均ベクトルである。カテゴリを代表す
る文字コードはそのカテゴリに属する文字の文字コード
のうち、任意の１つが当てられる。なお、検索処理にお
いて、検索キーワードを類似文字カテゴリ列に変換する
ために、類似文字カテゴリテーブルの逆引きテーブルと
して（Ｂ）に示すような、文字コードと対応するカテゴ
リの代表文字コードとを組にした文字コード・カテゴリ
対応テーブルが同時に作成される。Each cluster is stored as a similar character category table, and is used when registering a document.
Here, an example of the stored table is shown in FIG. 6A and 6B are diagrams illustrating an example of a table created by the similar character category generation process, in which FIG. 6A illustrates an example of a similar character category table, and FIG. 6B illustrates an example of a character code / category correspondence table. I have. Similar character category table
As shown in part (A), for each category,
The storage device 1 includes a character code (similar character) of a character to which the character belongs, a representative vector (representative vector) of a category feature, and a character code (representative character) representing a category.
4 The category representative vector is an average vector of the feature vectors of the characters to which the category belongs. An arbitrary one of the character codes of the characters belonging to the category is assigned to the character code representing the category. In the search process, in order to convert a search keyword into a similar character category string, a character code and a representative character code of a corresponding category are set as a reverse lookup table of the similar character category table as shown in FIG. The created character code / category correspondence table is created at the same time.

【００３９】類似文字カテゴリが作成されたならば、次
に、カテゴリ分割処理が起動される。この処理は、類似
文字カテゴリに含まれる字種の学習サンプルから、その
カテゴリに含まれるものの分布を仮定し、その分布の分
散が大きい方向に分布の端点、すなわち統計的に現れ得
る代表ベクトルから最も距離の離れた点を仮定し、それ
が該当カテゴリの代表ベクトルよりも他のカテゴリの代
表ベクトルに近い場合は、そのカテゴリを分割するもの
である。After the similar character category has been created, a category division process is started. This processing assumes the distribution of the characters included in the similar character category from the learning sample of the character type included in the similar character category, and determines the most significant point from the end point of the distribution in the direction in which the distribution of the distribution is large, that is, the representative vector that can appear statistically. Assuming points that are far apart from each other, if the point is closer to the representative vector of another category than the representative vector of the category, the category is divided.

【００４０】カテゴリの分割は、基本的には文献（大
町、孫他、「カテゴリー間分布を考慮した文字認識用マ
ルチテンプレート辞書の構成法」、電子情報通信学会論
文誌D-II, Vol.J79-D-II, No.9, pp.1525-1533）に開示
されている手法を用いる。これは、文字認識の精度の向
上を目的としたもので、各字種の学習サンプルをいくつ
かのクラスタに分割し、同一字種に複数の代表ベクトル
を与えることによって、計算コストの増加はあるもの
の、誤認を少なくするための手法である。実際の処理の
フローを図７に示す。The division of the categories is basically performed according to the literature (Omachi, Son, et al., “Construction method of multi-template dictionary for character recognition considering distribution between categories”, IEICE Transactions D-II, Vol. J79. -D-II, No. 9, pp. 1525-1533). This aims to improve the accuracy of character recognition, and there is an increase in computational cost by dividing the learning sample of each character type into several clusters and giving multiple representative vectors to the same character type. However, this is a method for reducing false positives. FIG. 7 shows the flow of the actual processing.

【００４１】図７はカテゴリ分割処理の流れを示すフロ
ーチャートである。まず、この処理に対する前処理とし
て、一つの類似文字カテゴリに注目し、これに属する字
種の学習サンプルに対して主成分分析を実施し、上位ｌ
位の主成分に対応する固有値、固有ベクトルを保存する
（ステップＳ２１）。固有値、固有ベクトルは特徴ベク
トルの次元数、または学習サンプルの数−１の小さい方
だけ得られ、固有ベクトルは固有値の大きな順に分散が
大きな方向を示す。ここでは、次元数より多くの学習サ
ンプルが得られていると仮定している。ｌは固有値の大
きな順にいくつ調べるかを示す定数で、辞書作成時に与
えられる。通常、誤認は分散の大きい軸上で発生すると
考えられるので、ｌ＝５くらいで十分である。FIG. 7 is a flowchart showing the flow of the category dividing process. First, as preprocessing to this process, one similar character category is focused on, principal component analysis is performed on a learning sample of a character type belonging to the similar character category, and
The eigenvalues and eigenvectors corresponding to the principal components of the order are stored (step S21). Eigenvalues and eigenvectors are obtained only from the smaller of the number of dimensions of the feature vector or the number of learning samples minus one. Here, it is assumed that more learning samples are obtained than the number of dimensions. l is a constant indicating how many eigenvalues are examined in descending order, and is given when a dictionary is created. Usually, it is considered that misidentification occurs on an axis with large variance, so that about l = 5 is sufficient.

【００４２】前処理が終了したならば、各類似文字カテ
ゴリごとに、特徴量空間内でｌ個の主成分軸方向それぞ
れに中心から最も離れて現れ得る二つのサンプル端点を
仮定する（ステップＳ２２）。端点ベクトルｐは次式で
得られる。After the pre-processing is completed, two sample endpoints which can appear farthest from the center in each of the l principal component axis directions in the feature amount space are assumed for each similar character category (step S22). . The end point vector p is obtained by the following equation.

【００４３】[0043]

【数３】 (Equation 3)

【００４４】ただし、ｍはカテゴリの代表ベクトル、ａ
は定数、λ_i、Φ_iはｉ番目の固有値、固有ベクトルで
ある。ａは定数で正の実数である。すなわち、このｐ
は、主成分軸上での代表ベクトルから最も離れたサンプ
ルの特徴空間内での位置を意味する。この式の意味を、
図８に模式的に２次元の特徴量空間で示す。Where m is a representative vector of a category, a
Is a constant, λ _i and Φ _i are the i-th eigenvalue and eigenvector. a is a constant and a positive real number. That is, this p
Means the position in the feature space of the sample farthest from the representative vector on the principal component axis. The meaning of this expression is
FIG. 8 schematically shows a two-dimensional feature space.

【００４５】図８は端点ベクトルの式の特徴量空間での
意味を説明するための図である。図８において、小さな
黒丸は類似文字カテゴリ内のサンプルを表し、その分布
は主成分分析という統計上の処理により楕円で近似され
ている。この楕円の中心が分布の平均である代表ベクト
ルｍ、長軸上の第１主成分方向に示した矢印が固有ベク
トルΦ₁、短軸上の第２主成分方向に示した矢印が固有
ベクトルΦ₂であり、×で示した長軸の両端点が式
（３）で表される端点ベクトルｐである。主成分分析で
得られた主成分軸は、特徴量空間で互いに直交してお
り、かつ共分散がないことが知られている。したがっ
て、各主成分軸では、統計的に独立に扱うことができ
る。定数ａはその主成分軸の標準偏差の何倍までを分布
の範囲とみなすかを示す。正規分布を仮定した場合、ａ
＝３．５とすると、９９．９６％の分布がこの中に含ま
れることになる。FIG. 8 is a diagram for explaining the meaning of the expression of the end point vector in the feature quantity space. In FIG. 8, small black circles represent samples in the similar character category, and their distribution is approximated by an ellipse by statistical processing called principal component analysis. The center of this ellipse is the representative vector m, which is the average of the distribution, the arrow on the long axis in the direction of the first principal component is the eigenvector Φ ₁ , and the arrow on the short axis in the direction of the second principal component is the eigenvector Φ ₂ . In this case, both end points of the long axis indicated by x are end point vectors p expressed by Expression (3). It is known that the principal component axes obtained by the principal component analysis are orthogonal to each other in the feature space and have no covariance. Therefore, each principal component axis can be treated statistically independently. The constant a indicates how many times the standard deviation of the principal component axis is considered as the range of the distribution. Assuming a normal distribution, a
If = 3.5, 99.96% of the distribution will be included in this.

【００４６】この端点ベクトルｐがそれぞれの主成分軸
上で求まったならば、そのそれぞれについて最短距離に
あるカテゴリの代表ベクトルを取り出す（ステップＳ２
３）。もし、最短距離にある代表ベクトルが、現在注目
しているカテゴリ、あるいはすでに分割されている同一
カテゴリの代表ベクトルであれば、何もしない（ステッ
プＳ２４、Ｓ２５）。それら以外であれば、誤認が生じ
るカテゴリとして、分割候補としてその主成分次数、対
応する固有値、固有ベクトルとともに保存する（ステッ
プＳ２６）。以上のステップＳ２１〜Ｓ２６までの検査
をすべての類似文字カテゴリについて実施した中で、最
も大きな固有値を持つ分割候補を取り出し、分割を行う
（ステップＳ２８）。分割は、その類似文字カテゴリに
属するすべての字種の学習サンプルをクラスタリングす
ることによって行われる。ここでのクラスタリングは、
ユークリッド距離を用いたｋ−平均法を用いる。ｋ−平
均法を用いる理由は、誤認が発生している主成分軸上に
沿ってサンプルを分割するように制御するためである。
したがって、与える初期クラスタ中心として、分割対象
となる類似文字カテゴリの二つの端点ベクトルｐに最も
近い学習サンプルを与える。これは文献（大町、孫他、
「カテゴリー間分布を考慮した文字認識用マルチテンプ
レート辞書の構成法」、電子情報通信学会論文誌D-II,
Vol.J79-D-II, No.9, pp.1525-1533）とは異なり、より
確実に端点を含む領域を分割することができる。なお、
すでにその類似文字カテゴリがいくつかに分割されてい
るものの一つである場合は、対象クラスタだけでなく、
元の類似文字カテゴリすべてを対象としてクラスタリン
グを実施する。そのときの初期クラスタ中心には、分割
対象カテゴリは端点ベクトルに最も距離の小さい学習サ
ンプル、それ以外はその代表ベクトルを与える。図９に
その例を模式的に示す。When the end point vectors p are obtained on the principal component axes, the representative vectors of the categories located at the shortest distance are extracted for each of them (step S2).
3). If the representative vector at the shortest distance is the category of interest at present or a representative vector of the same category that has already been divided, nothing is performed (steps S24 and S25). Otherwise, it is stored as a division candidate along with its principal component order, corresponding eigenvalue, and eigenvector as a category in which misidentification occurs (step S26). Of the inspections in steps S21 to S26 performed for all similar character categories, a division candidate having the largest eigenvalue is extracted and divided (step S28). The division is performed by clustering learning samples of all character types belonging to the similar character category. The clustering here is
The k-means method using the Euclidean distance is used. The reason for using the k-means method is to control so as to divide the sample along the principal component axis where misidentification occurs.
Therefore, a learning sample closest to the two end point vectors p of the similar character category to be divided is given as the initial cluster center to be given. This is a document (Omachi, Son,
"Construction method of multi-template dictionary for character recognition considering distribution between categories", IEICE Transactions D-II,
Unlike Vol.J79-D-II, No.9, pp.1525-1533), the region including the end point can be divided more reliably. In addition,
If the similar character category is already one of several parts, it is not only the target cluster,
Perform clustering for all original similar character categories. At the center of the initial cluster at that time, the category to be divided gives a learning sample with the shortest distance to the end point vector, and the representative vector otherwise. FIG. 9 schematically shows an example.

【００４７】図９はｋ−平均法によるカテゴリ分割を説
明する図であって、（Ａ）は類似文字カテゴリの最初の
分割例を示し、（Ｂ）は分割された類似文字カテゴリの
さらなる分割例を示している。最初に類似文字カテゴリ
をｋ−平均法により分割する場合、まず、長軸の端点に
最も近いサンプルを初期クラスタ中心とする。このサン
プルは、（Ａ）に示した分割前の左側の図では白丸で示
してある。このサンプルを初期クラスタ中心としてｋ−
平均法のクラスタリングの手法を使うことにより最終的
に分割されたクラスタが、右側の図である。これによ
り、長軸の端点が最も距離が長く誤認が生じる可能性が
高かったクラスタが、それぞれ短い長軸を有するクラス
タに分割されたことになる。そして、各クラスタについ
て主成分分析をすることにより、新たに代表ベクトル、
固有ベクトル、端点ベクトルが求められる。（Ｂ）は、
既に二つに分割された類似文字カテゴリの一方が、分割
対象となる場合で、（Ａ）と同様に、左側の白丸が初期
クラスタ中心を示し、結果は右側のようになる。FIGS. 9A and 9B are diagrams for explaining the category division by the k-means method. FIG. 9A shows an example of the first division of the similar character category, and FIG. 9B shows a further example of the division of the similar character category. Is shown. When dividing a similar character category by the k-means method first, the sample closest to the end point of the long axis is set as the initial cluster center. This sample is indicated by a white circle in the left-hand diagram before the division shown in FIG. Using this sample as the initial cluster center,
The clusters finally divided by using the clustering method of the averaging method are shown on the right side. This means that the cluster whose end point of the long axis has the longest distance and has a high possibility of erroneous recognition is divided into clusters each having a short long axis. Then, by performing principal component analysis on each cluster, a new representative vector,
An eigenvector and an end point vector are obtained. (B)
In the case where one of the similar character categories already divided into two is to be divided, as in (A), the white circle on the left indicates the initial cluster center, and the result is on the right.

【００４８】このようにしてクラスタリングによる分割
が終了したならば、分割されたそれぞれのクラスタの代
表ベクトルを求めた後、主成分分析を実施し、上位ｌ個
の主成分の固有値、固有ベクトルを求め、記憶してお
く。これをすべての分割されたカテゴリについて、ｌ個
すべての主成分について繰り返し、分割候補が得られな
くなるまで続ける（ステップＳ２７）。終了したなら
ば、分割された類似文字カテゴリ（以下、分割カテゴリ
と呼ぶ）の代表ベクトル、類似文字代表文字コードを組
にして識別辞書を作成し、外部記憶装置に登録する。When the division by the clustering is completed in this way, a representative vector of each of the divided clusters is obtained, a principal component analysis is performed, and the eigenvalues and eigenvectors of the top l principal components are obtained. Remember. This is repeated for all of the l principal components for all of the divided categories, and continues until no more division candidates can be obtained (step S27). Upon completion, an identification dictionary is created by combining a representative vector of the divided similar character category (hereinafter, referred to as a divided category) and a similar character representative character code, and registered in an external storage device.

【００４９】図１０はカテゴリ分割による識別辞書のデ
ータ構造を示す図である。この識別辞書はカテゴリ番
号、類似文字代表文字コード、および代表ベクトルから
構成されている。ここで、カテゴリ番号は、後で述べる
再統合カテゴリ作成処理で作成される再統合カテゴリに
属する分割カテゴリを識別するために用いられる。FIG. 10 is a diagram showing a data structure of an identification dictionary by category division. This identification dictionary includes a category number, a similar character representative character code, and a representative vector. Here, the category number is used to identify a divided category belonging to a reintegrated category created in a reintegrated category creation process described later.

【００５０】なお、分割途中にサンプル数が少ないカテ
ゴリが生成されることがある。このような場合、主成分
分析を実施した場合、誤差が大きくなる可能性がある。
このため、一定個数以下のサンプルしか含まない分割カ
テゴリについては、分割対象としないようにする。この
ようにすることで、無意味な分割を防ぐことができる。
本実施の形態では、サンプル数のしきい値は、主成分分
析の次数が特徴量次元６４次元以下にならないように、
６５と設定している。A category with a small number of samples may be generated during the division. In such a case, when the principal component analysis is performed, an error may increase.
For this reason, a division category including only a certain number of samples or less is not set as a division target. In this way, meaningless division can be prevented.
In the present embodiment, the threshold value of the number of samples is set so that the order of the principal component analysis does not fall below the feature dimension of 64 dimensions.
65 is set.

【００５１】このように得られた分割カテゴリによる識
別辞書を用いて、未知文字の識別には最短距離識別を行
うことによって、誤認を減少させることができる。しか
しながら、当然ながらカテゴリの分割によって照合すべ
き代表ベクトルの数が増加し、元の字種数以上に大きく
なる場合がある。本来、字種数より少ない数の類似文字
カテゴリに識別することで計算コストの削減を実現して
いる意味がなくなってしまう。これを解決するために、
分割された類似文字カテゴリの再クラスタリングを実施
し、少ないカテゴリへ改めて統合する。これが再統合カ
テゴリ作成処理である。実際の識別の際には、まず、こ
の再クラスタリングで得られた少ないカテゴリと照合を
行い、最良のものを取り出し、それに属する分割クラス
タの代表ベクトルと照合を行うことで、計算量の削減を
行う。再統合カテゴリ作成処理は、最初に字種の代表ベ
クトルをクラスタリング処理して類似文字カテゴリを構
築した重心法と最適化手法をそのまま用いる。ここで、
元々同一の類似文字カテゴリを分割したかどうかはまっ
たく関係なく、分割結果として得られているものを区別
なく扱う。この処理によって得られたクラスタを再統合
カテゴリと呼ぶ。Using the identification dictionary based on the divided categories obtained as described above, the shortest distance identification is performed for the identification of unknown characters, thereby reducing erroneous recognition. However, as a matter of course, the number of representative vectors to be collated increases due to the category division, and may be larger than the original character type. Originally, there is no point in reducing the calculation cost by identifying the similar character categories in a number smaller than the number of character types. To solve this,
The re-clustering of the divided similar character categories is performed, and they are integrated again into fewer categories. This is the reintegration category creation processing. At the time of actual identification, first, a comparison is performed with a small number of categories obtained by this re-clustering, the best one is extracted, and a comparison is performed with a representative vector of a divided cluster belonging thereto, thereby reducing the amount of calculation. . In the reintegration category creation process, the centroid method and the optimization method in which the representative vector of the character type is first clustered to construct a similar character category are used as they are. here,
It does not matter whether the same similar character category was originally divided or not, and the one obtained as a result of the division is treated without distinction. The cluster obtained by this processing is called a reintegration category.

【００５２】この処理で得られた再統合カテゴリとの最
短距離識別だけでは、最初に類似文字カテゴリの代表ベ
クトルで最短距離識別した場合と同様、誤認が発生す
る。これを避けるために、分割カテゴリごとに学習サン
プルを用い、誤認が発生したものについては、誤認が発
生している再統合カテゴリに重複して登録するようにす
る。この処理フローを図１１に示す。With the shortest distance discrimination with the reintegrated category obtained by this process alone, erroneous recognition occurs as in the case of first identifying the shortest distance with a representative vector of a similar character category. In order to avoid this, a learning sample is used for each of the divided categories, and if an erroneous recognition occurs, the learning sample is registered in the reintegrated category where the erroneous recognition has occurred. FIG. 11 shows this processing flow.

【００５３】図１１は再統合カテゴリ作成処理の流れを
示すフローチャートである。この再統合カテゴリ作成処
理では、まず、カテゴリ分割処理で得られた分割カテゴ
リごとに、属する学習サンプルすべてを取り出し（ステ
ップＳ３１）、その一つを取り出し、最短距離にある、
再統合カテゴリの代表ベクトルを得る（ステップＳ３
２）。そのカテゴリが、いま注目している分割カテゴリ
の属する再統合カテゴリであれば、何もせずに次のサン
プルを調べる（ステップＳ３３）。異なる再統合カテゴ
リであれば、その再統合カテゴリに、現在注目している
分割カテゴリを登録する（ステップＳ３４）。すなわ
ち、注目している分割カテゴリは複数の再統合カテゴリ
に属することになる。この操作をすべての分割カテゴリ
のすべてのサンプルについて行い、終了したならば、再
統合カテゴリをその代表ベクトルと、それに属している
分割カテゴリの番号とを組にして大分類辞書として外部
記憶装置に格納する（ステップＳ３５）。そのデータ構
造を図１２に示す。FIG. 11 is a flowchart showing the flow of the reintegration category creation process. In this reintegration category creation process, first, for each of the divided categories obtained by the category division process, all the belonging learning samples are extracted (step S31), one of which is extracted, and
A representative vector of the reintegrated category is obtained (step S3)
2). If the category is the reintegrated category to which the current divided category belongs, the next sample is checked without doing anything (step S33). If it is a different reintegration category, the currently focused division category is registered in the reintegration category (step S34). That is, the divided category of interest belongs to a plurality of reintegrated categories. This operation is performed for all the samples of all the divided categories, and when the operation is completed, the reintegrated category is stored in the external storage device as a large classification dictionary by combining the representative vector and the number of the divided category belonging thereto. (Step S35). FIG. 12 shows the data structure.

【００５４】図１２は再統合カテゴリ作成処理による大
分類辞書のデータ構造を示す図である。再統合により作
成された大分類辞書のデータはその代表ベクトルと、そ
れに属している分割カテゴリの番号とから構成されてお
り、この例からは、分割カテゴリ番号に重複して登録さ
れている分割カテゴリが存在することが分かる。なお、
ここでは、最短距離にある分割カテゴリにのみ登録する
ことを行ったが、サンプルから見た最短距離にある値を
加えた距離にあるすべての再統合カテゴリに、注目して
いる分割カテゴリを登録することを実施すれば、未知文
字の変動に対処することができる。FIG. 12 is a diagram showing a data structure of a large classification dictionary by the reintegrated category creation processing. The data of the large classification dictionary created by the reintegration is composed of the representative vector and the number of the divided category belonging to it, and from this example, the divided category registered overlapping with the divided category number Is found to exist. In addition,
Here, registration is performed only for the divided category located at the shortest distance, but the divided category of interest is registered for all reintegrated categories located at a distance obtained by adding a value to the shortest distance viewed from the sample. By doing so, it is possible to cope with fluctuations in unknown characters.

【００５５】このようにして、再統合カテゴリ作成処理
で得られた結果の再統合カテゴリの代表ベクトルを大分
類辞書とし、分割クラスタの代表ベクトルを詳細分類辞
書として、未知文字の照合時には、二段階の照合を実施
する。類似文字カテゴリへの識別を実施する疑似文字認
識処理のフローチャートを図１３に示す。As described above, the representative vector of the reintegrated category obtained as a result of the reintegrated category creation processing is used as a large classification dictionary, and the representative vector of the divided cluster is used as a detailed classification dictionary. Perform collation of. FIG. 13 shows a flowchart of the pseudo character recognition process for performing the identification into the similar character category.

【００５６】図１３は類似文字カテゴリ識別処理の流れ
を示すフローチャートである。まず、イメージスキャナ
から文書画像（二値画像）を入力する（ステップＳ４
１）。入力画像から文字ブロックを抽出し、各文字を切
り出す（ステップＳ４２）。文字ブロックの切り出し処
理は、文献（秋山、増田、「周辺分布、線密度、外接矩
形特徴を併用した文書画像の領域分割」電子情報通信学
会論文誌D-II, Vol.J69,No.8 ）などに開示されている
周辺分布による領域分割手法を用いることができる。切
り出された文字画像について、ペリフェラル特徴量ベク
トルを計算する（ステップＳ４３）。この特徴量ベクト
ルに対して、まず、大分類として再クラスタリングで得
られた再統合カテゴリの代表ベクトルと最初に照合し、
最短距離にあるものを取り出し、そのクラスタに属する
分割カテゴリの代表ベクトルを得る（ステップＳ４
４）。次に、その分割カテゴリの代表ベクトルとの照合
を行い、最短距離にあるのものを取り出し（ステップＳ
４５）、それに対応する類似文字カテゴリの代表文字を
出力する（ステップＳ４６）。以上のステップＳ４３〜
Ｓ４６の処理を切り出された文字ごとに順次繰り返し行
われる。FIG. 13 is a flowchart showing the flow of the similar character category identification process. First, a document image (binary image) is input from the image scanner (step S4).
1). A character block is extracted from the input image, and each character is cut out (step S42). Character block extraction processing is described in the literature (Akiyama, Masuda, "Segmentation of Document Image Using Peripheral Distribution, Line Density, and Bounding Rectangular Features" IEICE Transactions D-II, Vol.J69, No.8) For example, a region division method based on a marginal distribution disclosed in US Pat. A peripheral feature vector is calculated for the cut-out character image (step S43). First, this feature vector is compared with the representative vector of the reintegrated category obtained by re-clustering as a large classification,
The one at the shortest distance is extracted, and a representative vector of the divided category belonging to the cluster is obtained (step S4).
4). Next, comparison with the representative vector of the divided category is performed, and the one at the shortest distance is extracted (step S).
45), and output the representative character of the similar character category corresponding thereto (step S46). Steps S43 and above
The process of S46 is sequentially repeated for each extracted character.

【００５７】なお、本実施の形態では２段階の識別辞書
を作成したが、これをさらに多くの階層にすることも可
能である。すなわち、分割カテゴリの再統合を行い、そ
れをさらに少ないカテゴリ数で再統合を行うということ
を行えばよい。段数が増えると誤認が生じる可能性が増
えるので、どれくらいの段数が妥当かについてはカテゴ
リ数、段数をいくつか変化させて実験的に確かめればよ
い。In this embodiment, a two-stage identification dictionary is created. However, the identification dictionary can be formed in more layers. That is, the reintegration of the divided categories may be performed, and the reintegration may be performed with a smaller number of categories. As the number of stages increases, the possibility of erroneous recognition increases. Therefore, the appropriate number of stages can be experimentally confirmed by changing the number of categories and the number of stages.

【００５８】この変形例として、大分類時に最短距離に
あるものだけではなく、距離の小さなものから、数個カ
テゴリを取り出しておき、それらに属する分割カテゴリ
とのマッチングを行うことで、誤識別を少なくすること
ができる。この場合、照合回数は増加するが、精度は当
然向上する。いくつまで候補をとるかについては、候補
数を変換させて実験的に確かめればよい。As a modified example, erroneous identification is performed by extracting several categories not only from the shortest distance in the large classification but also from the shortest distance and performing matching with the divided categories belonging to them. Can be reduced. In this case, the number of times of collation increases, but the accuracy naturally increases. The number of candidates can be determined by changing the number of candidates and experimentally confirming them.

【００５９】この結果、得られた類似文字カテゴリ代表
コード列は、特願平８−２７４７３２号明細書で示され
ているように、後で検索が容易となるようにｂｉ−ｇｒ
ａｍを取り出して登録する処理、あるいは類似文字カテ
ゴリ列の形態素解析を実施し、単語として許容できるも
のを抽出して登録する処理に渡される。この時、必要で
あれば、単語を確定するために文字候補の詳細識別を実
施する。これら類似文字カテゴリ列が得られた後の処
理、さらに文字の切り出し位置が複数ある場合の処理に
ついても、特願平８−２７４７３２号明細書に開示して
いる処理をそのまま利用することができる。As a result, as shown in the specification of Japanese Patent Application No. 8-274732, the obtained similar character category representative code string is bi-gr for easy retrieval later.
am is extracted and registered, or a morphological analysis of a similar character category string is performed, and an allowable word is extracted and registered. At this time, if necessary, detailed identification of character candidates is performed to determine the word. The processing disclosed in Japanese Patent Application No. 8-274732 can be used as it is for the processing after the similar character category string is obtained and for the processing when there are a plurality of character cutout positions.

【００６０】本発明は、特願平８−２７４７３２号明細
書で開示した類似文字カテゴリ列から単語を抽出する処
理において、精度と速度を両立させる類似文字カテゴリ
識別手法を実現するものであるが、従来手法である文字
認識における大分類処理と置き換えて利用することも可
能である。The present invention realizes a similar character category discrimination method that achieves both accuracy and speed in the process of extracting words from similar character category strings disclosed in Japanese Patent Application No. 8-274732. It can also be used in place of the conventional method of large classification in character recognition.

【００６１】いま、総字種Ｎ個について総当たりで代表
ベクトルとの照合による大分類を行い、上位ｎ個を詳細
分類する従来手法の文字認識の場合と、本発明によって
大分類処理を置き換えた場合との比較を行う。従来手法
での大分類の特徴量次元数をｘ、詳細分類の特徴量次元
数をｙとする。識別は、ユークリッド距離で最短距離識
別とする。計算量は、乗算の回数すなわち、距離計算の
回数と次元数の積にほぼ比例する。したがって、これを
計算量の指標として考えた場合、一つの未知文字につい
ての計算量はＮｘ＋ｎｙである。これに対して、本手法
での、類似文字カテゴリ数をＭとすると、各類似文字カ
テゴリには平均Ｎ／Ｍ個の字種が含まれることになる。
再クラスタリングするときのクラスタ数をＬ、再クラス
タリングされた各クラスタに平均Ｋ個の分割カテゴリが
含まれているとし、類似文字カテゴリへの識別の特徴量
次元数、詳細識別の次元数を従来手法と同様にｘ、ｙと
すると、計算量は、（Ｌ＋Ｋ）ｘ＋（Ｎ／Ｍ）ｙとな
る。なお、ここでは各字種の出現確率はすべて等しいと
仮定している。Now, a large classification process is performed for all N character types by collation with a representative vector in a round robin manner, and the major classification process is replaced by the present invention with the case of the character recognition of the conventional method in which the top n characters are classified in detail. Make a comparison with the case. In the conventional method, the dimension of the feature quantity of the large classification is x, and the dimension of the feature quantity of the detailed classification is y. The identification is the shortest distance identification based on the Euclidean distance. The amount of calculation is substantially proportional to the number of multiplications, that is, the product of the number of distance calculations and the number of dimensions. Therefore, when this is considered as an index of the calculation amount, the calculation amount for one unknown character is Nx + ny. On the other hand, assuming that the number of similar character categories in the present method is M, each similar character category includes an average of N / M character types.
It is assumed that the number of clusters at the time of re-clustering is L, and that each re-clustered cluster includes an average of K divided categories. Assuming that x and y are the same as in the above, the calculation amount is (L + K) x + (N / M) y. Here, it is assumed that the appearance probabilities of the respective character types are all equal.

【００６２】実際に実験を実施して得られた数値によ
り、その効果を示す。ここで、総字種をＪＩＳ第１水準
の約１／４個の字種を対象とした。用いた特徴量はペリ
フェラル特徴で、次元数は６４である。これを各字種に
ついて、いくつかのフォント、大きさを変化させ、学習
サンプルを４００程度作成した。与えたパラメータは、
Ｎ＝８３５、Ｍ＝５００、Ｌ＝１００である。結果とし
て、分割クラスタ総数は３３３９、Ｋ＝１０３．５とな
った。実際に学習サンプルを識別させたところ、類似文
字カテゴリへの識別率９９．３％と十分な精度を達成し
た。従来手法での大分類特徴量次元数ｘを６４、詳細分
類次元数ｙを２５６、大分類での候補数ｎ＝２０と仮定
し、これを計算量の指標に当てはめると、従来手法で
は、８３５×６４＋２０×２５６＝５８５６０となる。
これに対して本手法では、（１００＋１０３．５）×６
４＋８３５／５００×２５６＝１３４５１．５２とな
る。したがって、計算量は約１／４以下となり、大幅に
削減されていることがわかる。仮に、精度を向上させる
ために類似文字カテゴリへの識別結果の上位１０カテゴ
リを取り出し、それに含まれている字種すべてを詳細識
別するとした場合、（１００＋１０３．５）×６４＋８
３５／５００×２５６×１０＝１７２９９．２となり、
これでもまだ１／３以下の計算量となる。類似文字カテ
ゴリに識別された場合に、特願平８−２７４７３２号明
細書で開示している類似文字カテゴリ列を形態素解析す
る手法を用いて、さらに詳細分類の回数を減らすことが
できる。また、識別された類似文字カテゴリに単一の字
種しか登録されていない場合（実験で得られた類似文字
カテゴリ５００個のうち、一つの字種しか登録されてい
ないカテゴリは３１７個であった）、あるいは形態素解
析によって字種が一つに特定できる場合は、その文字に
ついての詳細分類の必要がなくなるので、詳細分類のた
めの特徴量抽出処理と、その照合処理は不要になる。し
たがって、計算量の差はさらに大きくなる。The effect is shown by numerical values obtained by actually performing experiments. Here, the total character type is about 1/4 character type of the JIS first level. The feature amount used is a peripheral feature, and the number of dimensions is 64. For each character type, several fonts and sizes were changed, and about 400 learning samples were created. The given parameters are
N = 835, M = 500, L = 100. As a result, the total number of divided clusters was 3339 and K = 103.5. When the learning samples were actually identified, the identification rate for the similar character category was 99.3%, and sufficient accuracy was achieved. Assuming that the number x of dimensions of the large classification feature quantity in the conventional method is 64, the number y of the detailed classification dimensions is 256, and the number of candidates in the large classification n = 20. X64 + 20x256 = 58560.
On the other hand, in the present method, (100 + 103.5) × 6
4 + 835/500 × 256 = 1345.52. Therefore, it can be seen that the amount of calculation is about 1/4 or less, which is greatly reduced. If it is assumed that the top 10 categories of the result of classification into similar character categories are taken out in order to improve the accuracy and all the character types included in the top 10 categories are identified in detail, (100 + 103.5) × 64 + 8
35/500 × 256 × 10 = 172999.2,
Even so, the calculation amount is still 1/3 or less. When a similar character category is identified, the number of detailed classifications can be further reduced by using a method of morphologically analyzing a similar character category string disclosed in Japanese Patent Application No. 8-274732. Also, when only a single character type is registered in the identified similar character category (out of 500 similar character categories obtained in the experiment, there were 317 categories in which only one character type was registered) ), Or when the character type can be uniquely identified by morphological analysis, there is no need to perform detailed classification for the character, and therefore, feature amount extraction processing for detailed classification and its matching processing become unnecessary. Therefore, the difference in the amount of calculation is further increased.

【００６３】次に、カテゴリ分割処理の別の実施の形態
について説明する。類似文字カテゴリ識別辞書を作成す
る第２の実施の形態においても、基本的な処理は図３に
示した流れに沿って実施され、その詳細はカテゴリ分割
処理を除き同じであるので、省略し、カテゴリ分割処理
だけを説明する。Next, another embodiment of the category dividing process will be described. Also in the second embodiment for creating a similar character category identification dictionary, the basic processing is performed according to the flow shown in FIG. 3, and details thereof are the same except for the category division processing. Only the category division processing will be described.

【００６４】図１４は第２の実施の形態におけるカテゴ
リ分割処理の流れを示すフローチャートである。まず、
一つの類似文字カテゴリに注目し、その類似文字カテゴ
リに属する字種すべての学習サンプルを取り出す（ステ
ップＳ５１）。次に、その学習サンプルの中の一つのサ
ンプルに注目し、最短距離にある類似文字カテゴリを得
る（ステップＳ５２）。その類似文字カテゴリは、現在
注目している類似文字カテゴリ、あるいはすでに分割さ
れている同一の類似文字カテゴリであれば、何もしない
（ステップＳ５３、Ｓ５４）。それ以外であれば、その
中で誤識別が発生しているので、誤識別されたカテゴリ
ごとに学習サンプルをまとめておく。FIG. 14 is a flowchart showing the flow of the category dividing process according to the second embodiment. First,
Attention is paid to one similar character category, and all the learning samples of the character types belonging to the similar character category are extracted (step S51). Next, focusing on one of the learning samples, a similar character category located at the shortest distance is obtained (step S52). If the similar character category is the similar character category currently focused on or the same similar character category already divided, nothing is performed (steps S53 and S54). Otherwise, since misidentification has occurred therein, the learning samples are compiled for each misidentified category.

【００６５】図１５は第２の実施の形態におけるカテゴ
リ分割の説明図である。図１５に示すように、たとえ
ば、類似文字カテゴリＡ，Ｂ，Ｃ，Ｄがあって、類似文
字カテゴリＡにおけるサンプルのいくつかがカテゴリ
Ｂ，Ｃ，Ｄとの識別境界面を越えて分布しているとする
とき、類似文字カテゴリＡに属するサンプルの一部はカ
テゴリＢ，Ｃ，Ｄの３つに誤識別されていることにな
る。このような場合、それぞれ誤識別している学習サン
プルをまとめ、図示のように、類似文字カテゴリＡを四
つに分割する。FIG. 15 is an explanatory diagram of category division in the second embodiment. As shown in FIG. 15, for example, there are similar character categories A, B, C, and D, and some of the samples in the similar character category A are distributed over the identification boundary with the categories B, C, and D. That is, some of the samples belonging to the similar character category A are erroneously identified as three categories B, C, and D. In such a case, the learning samples which are erroneously identified are put together, and the similar character category A is divided into four as shown in the figure.

【００６６】ただし、この時点では、分割を行わず、そ
の誤識別されたサンプルと最短距離カテゴリとの対応を
記憶しておくだけとする（ステップＳ５５）。以上のス
テップＳ５２〜Ｓ５５までの検査をすべてのサンプルに
対して実施し、さらにステップＳ５１〜Ｓ５５までの検
査をすべての類似文字カテゴリについて実施した後、実
際に分割を行う。すなわち、誤識別されたカテゴリごと
にサンプルの平均ベクトルを計算し（ステップＳ５
７）、注目しているカテゴリの代表ベクトルと誤識別さ
れたカテゴリのサンプルの平均ベクトルとを初期クラス
タとしてクラスタを分割する（ステップＳ５８）。分割
が終了すれば、分割カテゴリに促するサンプルの平均ベ
クトルを求め、代表ベクトルとする。これを誤識別され
たサンプルを含むすべての類似文字カテゴリについて行
い、同様の処理を誤識別がなくなるまで繰り返す（ステ
ップＳ５６）。However, at this point, division is not performed, and only the correspondence between the erroneously identified sample and the shortest distance category is stored (step S55). The inspections in steps S52 to S55 are performed on all the samples, and the inspections in steps S51 to S55 are performed on all the similar character categories, and then the division is actually performed. That is, the average vector of the sample is calculated for each of the erroneously identified categories (step S5).
7) The cluster is divided using the representative vector of the category of interest and the average vector of the sample of the misidentified category as the initial cluster (step S58). When the division is completed, the average vector of the sample prompting the division category is obtained and set as the representative vector. This is performed for all similar character categories including the erroneously identified sample, and the same processing is repeated until erroneous identification is eliminated (step S56).

【００６７】分割は、ｋ−平均法を用いる。初期クラス
タ中心は、誤識別されたサンプルを誤識別対象類似文字
カテゴリごとに集めて、その平均ベクトルと、注目して
いる類似文字カテゴリの代表ベクトルとする。これによ
って、図１５の右側のように、識別平面が移動し、誤識
別が少なくなる。The division uses the k-means method. The center of the initial cluster collects erroneously identified samples for each erroneously identified similar character category, and sets the average vector and the representative vector of the similar character category of interest. As a result, the identification plane moves as shown on the right side of FIG. 15, and erroneous identification is reduced.

【００６８】[0068]

【発明の効果】以上説明したように本発明では、類似パ
ターンカテゴリ識別辞書を、一旦作成した類似パターン
カテゴリを分割し、これらの分割カテゴリの代表ベクト
ルを詳細分類辞書とし、さらに分割カテゴリを再統合し
て、その再統合カテゴリの代表ベクトルを大分類辞書と
する二段階の構成にするようにした。これにより、未知
パターンの照合時に、大分類および詳細分類の二段階の
照合が可能な類似パターンカテゴリ識別辞書になる。こ
のような辞書を使用することにより、再統合カテゴリへ
の識別率は、最初に作成した類似パターンカテゴリへの
識別率の約９２％の場合に比較して９９．３％と向上し
ており、かつ、計算量は総当たりの照合の場合に比較し
て約１／４以下となって、識別速度を向上させることが
できる。As described above, according to the present invention, the similar pattern category identification dictionary is obtained by dividing a once created similar pattern category, a representative vector of these divided categories is used as a detailed classification dictionary, and the divided categories are reintegrated. Then, the representative vector of the reintegrated category is set to have a two-stage configuration in which a large classification dictionary is used. As a result, a similar pattern category identification dictionary capable of performing two-stage matching of a large classification and a detailed classification when matching an unknown pattern is obtained. By using such a dictionary, the identification rate for the reintegrated category is improved to 99.3% as compared with the case where the identification rate for the similar pattern category created first is about 92%. In addition, the amount of calculation is about 1/4 or less as compared with the case of the brute force collation, so that the identification speed can be improved.

[Brief description of the drawings]

【図１】本発明の原理図である。FIG. 1 is a principle diagram of the present invention.

【図２】本発明を実施するハードウェア構成を示す図で
ある。FIG. 2 is a diagram showing a hardware configuration for implementing the present invention.

【図３】類似文字カテゴリ識別辞書を作成する処理の流
れを示すフローチャートである。FIG. 3 is a flowchart illustrating a flow of a process of creating a similar character category identification dictionary.

【図４】画像入力処理の流れを示すフローチャートであ
る。FIG. 4 is a flowchart illustrating a flow of an image input process.

【図５】ペリフェラル特徴の説明図である。FIG. 5 is an explanatory diagram of peripheral features.

【図６】類似文字カテゴリ生成処理により作成されるテ
ーブルの例を示す図であって、（Ａ）は類似文字カテゴ
リテーブルの例を示し、（Ｂ）は文字コード・カテゴリ
対応テーブルの例を示している。6A and 6B are diagrams illustrating an example of a table created by a similar character category generation process, wherein FIG. 6A illustrates an example of a similar character category table, and FIG. 6B illustrates an example of a character code / category correspondence table; ing.

【図７】カテゴリ分割処理の流れを示すフローチャート
である。FIG. 7 is a flowchart illustrating a flow of a category division process.

【図８】端点ベクトルの式の特徴量空間での意味を説明
するための図である。FIG. 8 is a diagram for explaining the meaning of an end point vector expression in a feature amount space.

【図９】ｋ−平均法によるカテゴリ分割を説明する図で
あって、（Ａ）は類似文字カテゴリの最初の分割例を示
し、（Ｂ）は分割された類似文字カテゴリのさらなる分
割例を示している。FIGS. 9A and 9B are diagrams illustrating category division by the k-means method, wherein FIG. 9A shows an example of the first division of a similar character category, and FIG. 9B shows an example of a further division of the divided similar character category. ing.

【図１０】カテゴリ分割による識別辞書のデータ構造を
示す図である。FIG. 10 is a diagram showing a data structure of an identification dictionary by category division.

【図１１】再統合カテゴリ作成処理の流れを示すフロー
チャートである。FIG. 11 is a flowchart illustrating a flow of a reintegration category creation process.

【図１２】再統合カテゴリ作成処理による大分類辞書の
データ構造を示す図である。FIG. 12 is a diagram showing a data structure of a large classification dictionary by a re-integrated category creation process.

【図１３】類似文字カテゴリ識別処理の流れを示すフロ
ーチャートである。FIG. 13 is a flowchart illustrating a flow of a similar character category identification process.

【図１４】第２の実施の形態におけるカテゴリ分割処理
の流れを示すフローチャートである。FIG. 14 is a flowchart illustrating a flow of a category division process according to the second embodiment.

【図１５】第２の実施の形態におけるカテゴリ分割の説
明図である。FIG. 15 is an explanatory diagram of category division in the second embodiment.

【図１６】類似文字カテゴリ識別での問題点を説明する
図である。FIG. 16 is a diagram illustrating a problem in similar character category identification.

[Explanation of symbols]

１類似パターンカテゴリ作成手段２カテゴリ分割手段３再統合カテゴリ作成手段１０パーソナルコンピュータ１１中央演算装置（ＣＰＵ）１２主記憶メモリ１３周辺機器コントローラ１４外部記憶装置１５ディスプレイ１６キーボード１７ポインティングデバイス１８イメージスキャナ１９ネットワーク DESCRIPTION OF SYMBOLS 1 Similar pattern category creation means 2 Category division means 3 Reintegration category creation means 10 Personal computer 11 Central processing unit (CPU) 12 Main storage memory 13 Peripheral device controller 14 External storage device 15 Display 16 Keyboard 17 Pointing device 18 Image scanner 19 Network

Claims

[Claims]

1. A similar pattern category identification dictionary creating apparatus for creating a similar pattern category identification dictionary used for matching when a pattern included in image information is identified as a similar pattern category. A similar pattern category creating means for obtaining similarities between patterns from the extracted pattern information and creating similar pattern categories by grouping similar patterns, and examining a state of misrecognition of a learning sample of a pattern belonging to the created similar pattern category. And a category dividing means for dividing the similar pattern category based on the situation to create a detailed classification identification dictionary composed of the divided categories; obtaining a representative vector of the divided category; and reintegrating the divided categories into similar ones again. Classification Dictionaries with Classified Reintegrated Categories And a reintegrated category creating means for creating a similar pattern category identification dictionary having a hierarchical structure together with the detailed classification identification dictionary.

2. A similar pattern category identification dictionary creating method for creating a similar pattern category identification dictionary used for matching when a pattern included in image information is identified as a similar pattern category, comprising the steps of: Extracting the feature amounts of the patterns included in the learning sample, accumulating the learning samples, examining the similarity between the patterns of the learning samples, creating a similar pattern category in which similar patterns are put together, and creating the similar pattern category. Based on the training sample of the pattern belonging to,
Dividing each similar pattern category to create a divided category, re-assembling the divided categories into similar categories, and creating a reintegrated category that constitutes a hierarchical identification dictionary together with the divided categories. How to create a category identification dictionary.

3. The step of creating the divided category includes performing principal component analysis on a sample belonging to the similar pattern category, obtaining an end point vector on the principal component axis, and erroneously identifying the end point vector as another similar pattern category. 3. The method according to claim 2, wherein, when performed, the similar pattern category is divided by clustering, and the clustering is repeated until no erroneous identification is found.

4. The step of creating the divided category includes: performing principal component analysis on a sample belonging to the similar pattern category; determining an end point of a sample distribution on a principal component axis in a feature space; 3. The sample which belongs to the similar pattern category is divided by clustering from those having large eigenvalues among those which are erroneously identified as other similar pattern categories, and the division is repeated until erroneous identification is eliminated. Method for creating a similar pattern category identification dictionary.

5. The step of creating the reintegrated category includes obtaining a representative vector of the divided category, and if the learning sample is erroneously identified by comparison with the previously obtained representative vector, the learning sample 3. The method for creating a similar pattern category identification dictionary according to claim 2, wherein the re-integration is performed by registering the divided category to which the.

6. The step of creating the reintegrated category includes obtaining a representative vector of the divided category, and if the learning sample is erroneously identified by comparison with the previously obtained representative vector, the learning sample The reintegration category with a large number of categories is created by registering the divided category to which the category belongs to the misidentified category, and the reintegration category with a small number of categories is created recursively using the result as an input, so that multi-stage 3. The method according to claim 2, wherein an identification dictionary is created.

7. The step of creating a divided category includes, when a learning sample belonging to a similar pattern category is erroneously identified as a representative vector of a category other than the category to which it belongs, collects the sample for each erroneously identified category. 3. The method for creating a similar pattern category identification dictionary according to claim 2, wherein the division is performed as a new cluster, and the division is performed until there is no erroneous identification.