JP7559762B2

JP7559762B2 - Information processing device, information processing method, and program

Info

Publication number: JP7559762B2
Application number: JP2021545233A
Authority: JP
Inventors: 紘士飯田
Original assignee: Sony Corp; Sony Group Corp
Current assignee: Sony Corp; Sony Group Corp
Priority date: 2019-09-11
Filing date: 2020-09-01
Publication date: 2024-10-02
Anticipated expiration: 2040-09-01
Also published as: JPWO2021049365A1; WO2021049365A1

Description

本技術は、機械学習を用いた予測モデルの学習処理に適用可能な情報処理装置、情報処理方法、及びプログラムに関する。 This technology relates to an information processing device, an information processing method, and a program that can be applied to the learning process of a predictive model using machine learning.

従来、機械学習を用いて予測モデルを構築する技術が開発されている。予測モデルを適正に構築することで様々な予測処理を行うことが可能となる。予測モデルは、多数のデータを学習させることで構築されるが、その学習処理に時間がかかる場合がある。 Conventionally, technology has been developed to build predictive models using machine learning. By constructing a predictive model appropriately, it becomes possible to perform a variety of predictive processing. A predictive model is built by training a large amount of data, but this training process can take time.

例えば特許文献１には、深層学習の学習処理の最中にハードウェアリソースを追加することが可能なシステムについて記載されている。このシステムでは、学習処理の進捗状況とともに、ハードウェアリソースを追加するための追加ボタンがユーザに提示される。これにより、例えば学習処理の進捗状況が捗っていない場合には、ユーザはハードウェアリソースを追加して学習処理の速度を向上させることが可能となっている（特許文献１の明細書段落［００３０］［００３４］［００３５］、図４等）。For example, Patent Document 1 describes a system that can add hardware resources during deep learning processing. In this system, an add button for adding hardware resources is presented to the user along with the progress of the learning processing. This allows the user to add hardware resources to improve the speed of the learning processing, for example, when the learning processing is not progressing well (see paragraphs [0030], [0034], [0035], Figure 4, etc. of the specification of Patent Document 1).

特開２０１７－１８２１１４号公報JP 2017-182114 A

予測モデルの学習処理では、上記したように演算リソースを増やすことで多数のデータを短時間で学習させることが可能となる。一方で、学習させるデータ数を増やしても予測精度の向上が見込めない場合等には、時間や費用が無駄になってしまう可能性がある。このため、予測モデルを効率的に学習させることが可能な技術が求められている。 In the learning process of predictive models, as described above, increasing the computational resources makes it possible to train a large amount of data in a short period of time. On the other hand, in cases where increasing the amount of data to be trained does not improve prediction accuracy, time and money may be wasted. For this reason, there is a demand for technology that can train predictive models efficiently.

以上のような事情に鑑み、本技術の目的は、予測モデルを効率的に学習させることが可能な情報処理装置、情報処理方法、及びプログラムを提供することにある。 In view of the above circumstances, the object of the present technology is to provide an information processing device, an information processing method, and a program capable of efficiently learning a predictive model.

上記目的を達成するため、本技術の一形態に係る情報処理装置は、取得部と、推定処理部とを具備する。
前記取得部は、予測モデルの生成に用いる全データセットの一部である部分データセットの特徴量を取得する。
前記推定処理部は、前記部分データセットの特徴量に基づいて、前記全データセットを用いて生成される前記予測モデルの予測精度を表す精度情報を推定する。 In order to achieve the above object, an information processing device according to an embodiment of the present technology includes an acquisition unit and an estimation processing unit.
The acquisition unit acquires features of a partial dataset that is a part of an entire dataset used to generate a prediction model.
The estimation processing unit estimates accuracy information representing a prediction accuracy of the prediction model generated using the entire data set, based on the feature amount of the partial data set.

この情報処理装置では、全データセットのうち、部分データセットの特徴量が取得される。この特徴量に基づいて、全データセットを用いて予測モデルを生成した場合の予測精度を表す精度情報が推定される。これにより、例えば全データセットを用いるべきか否かを判断することが可能となり、予測モデルを効率的に生成することが可能となる。In this information processing device, feature quantities of a partial dataset from among the entire dataset are acquired. Based on these feature quantities, accuracy information is estimated that represents the prediction accuracy when a predictive model is generated using the entire dataset. This makes it possible to determine, for example, whether or not the entire dataset should be used, and to efficiently generate a predictive model.

前記推定処理部は、前記精度情報として、前記部分データセットを用いて生成される前記予測モデルの予測精度に対する前記全データセットを用いて生成される前記予測モデルの予測精度の変化を推定してもよい。The estimation processing unit may estimate, as the accuracy information, a change in the prediction accuracy of the predictive model generated using the entire dataset relative to the prediction accuracy of the predictive model generated using the partial dataset.

前記推定処理部は、前記予測精度の変化を推定する推定モデルを用いて構成されてもよい。 The estimation processing unit may be configured using an estimation model that estimates changes in the prediction accuracy.

前記推定モデルは、所定のデータセットの一部のデータセットの特徴量と、所定の予測モデルを前記所定のデータセットの全部及び一部を用いて生成した場合に生じる予測精度の変化との関係を学習したモデルであってもよい。The estimation model may be a model that has learned the relationship between features of a portion of a specified dataset and the change in prediction accuracy that occurs when a specified prediction model is generated using all or part of the specified dataset.

前記推定モデルは、前記予測精度の変化量を複数のレベルに分類する分類モデルであってもよい。 The estimation model may be a classification model that classifies the amount of change in prediction accuracy into multiple levels.

前記推定モデルは、前記予測精度の変化量を複数のレベルに分類する分類モデルをルールベースで近似したモデルであってもよい。The estimation model may be a rule-based approximation of a classification model that classifies the amount of change in prediction accuracy into multiple levels.

前記推定モデルは、前記予測精度の変化量を推定する回帰モデルであってもよい。 The estimation model may be a regression model that estimates the amount of change in the prediction accuracy.

前記部分データセットの特徴量は、前記部分データセットの内容に応じた第１の特徴量を含んでもよい。この場合、前記取得部は、前記部分データセットを解析することで前記第１の特徴量を算出してもよい。The features of the partial data set may include a first feature corresponding to the content of the partial data set. In this case, the acquisition unit may calculate the first feature by analyzing the partial data set.

前記第１の特徴量は、前記部分データセットに含まれるデータの数、前記データに含まれる特徴量の数、前記データの数と前記データに含まれる特徴量の数との比率の少なくとも１つを含んでもよい。The first feature may include at least one of the number of data included in the partial data set, the number of features included in the data, and the ratio between the number of data and the number of features included in the data.

前記部分データセットの特徴量は、前記部分データセットを用いて生成される前記予測モデルに応じた第２の特徴量を含んでもよい。この場合、前記取得部は、前記部分データセットを用いた前記予測モデルの生成処理を実行することで前記第２の特徴量を算出してもよい。The features of the partial dataset may include a second feature corresponding to the predictive model generated using the partial dataset. In this case, the acquisition unit may calculate the second feature by executing a generation process of the predictive model using the partial dataset.

前記部分データセットは、互いに用途の異なる複数のデータグループを含んでもよい。この場合、前記第２の特徴量は、前記複数のデータグループの各々に対する前記部分データセットを用いて生成される前記予測モデルの予測値を評価する評価値、又は前記評価値を比較した比較値の少なくとも一方を含んでもよい。The partial data set may include a plurality of data groups having different uses. In this case, the second feature may include at least one of an evaluation value for evaluating a predicted value of the prediction model generated using the partial data set for each of the plurality of data groups, or a comparison value for comparing the evaluation values.

前記複数のデータグループは、学習データのグループと、検証データのグループと、テストデータのグループとを含んでもよい。The plurality of data groups may include a group of training data, a group of validation data, and a group of test data.

前記評価値は、前記部分データセットを用いて生成される前記予測モデルの予測値に関する誤差中央値、平均二乗誤差、及び誤差率中央値の少なくとも１つを含んでもよい。The evaluation values may include at least one of the median error, mean squared error, and median error rate for the predicted values of the predictive model generated using the partial dataset.

前記比較値は、前記複数のデータグループのうち２つのデータグループについて算出された前記評価値の差分又は比率の少なくとも一方を含んでもよい。The comparison value may include at least one of the difference or ratio of the evaluation values calculated for two of the plurality of data groups.

前記情報処理装置は、さらに、前記精度情報を提示する画面を生成する画面生成部を具備してもよい。The information processing device may further include a screen generation unit that generates a screen presenting the accuracy information.

前記推定処理部は、前記精度情報として、前記部分データセットを用いて生成される前記予測モデルの予測精度に対する前記全データセットを用いて生成される前記予測モデルの予測精度の変化を推定してもよい。この場合、前記画面生成部は、前記予測精度の変化量を複数のレベルにわけて提示する画面、または前記予測精度の変化量の値を提示する画面の少なくとも一方を生成してもよい。The estimation processing unit may estimate, as the accuracy information, a change in the prediction accuracy of the prediction model generated using the entire data set relative to the prediction accuracy of the prediction model generated using the partial data set. In this case, the screen generation unit may generate at least one of a screen that presents the amount of change in the prediction accuracy divided into a plurality of levels, or a screen that presents the value of the amount of change in the prediction accuracy.

前記画面生成部は、前記部分データセットを用いた前記予測モデルの生成処理の実行を選択するための選択画面を生成してもよい。この場合、前記取得部は、前記生成処理の実行が選択された場合に、前記生成処理を実行して前記部分データセットの特徴量を算出してもよい。また、前記推定処理部は、前記部分データセットの特徴量に基づいて前記精度情報を推定してもよい。The screen generation unit may generate a selection screen for selecting execution of a generation process of the predictive model using the partial dataset. In this case, when execution of the generation process is selected, the acquisition unit may execute the generation process to calculate features of the partial dataset. Furthermore, the estimation processing unit may estimate the accuracy information based on the features of the partial dataset.

本技術の一実施形態に係る情報処理方法は、コンピュータシステムにより実行される情報処理方法であって、予測モデルの生成に用いる全データセットの一部である部分データセットの特徴量を取得することを含む。
前記部分データセットの特徴量に基づいて、前記全データセットを用いて生成される前記予測モデルの予測精度を表す精度情報が推定される。 An information processing method according to one embodiment of the present technology is an information processing method executed by a computer system, and includes acquiring features of a partial dataset that is a part of an entire dataset used to generate a predictive model.
Accuracy information representing the prediction accuracy of the prediction model generated using the entire dataset is estimated based on the feature amounts of the partial dataset.

本技術の一実施形態に係るプログラムは、コンピュータシステムに以下のステップを実行させる。
予測モデルの生成に用いる全データセットの一部である部分データセットの特徴量を取得するステップ。
前記部分データセットの特徴量に基づいて、前記全データセットを用いて生成される前記予測モデルの予測精度を表す精度情報を推定するステップ。 A program according to an embodiment of the present technology causes a computer system to execute the following steps.
A step of obtaining features of a partial dataset that is a part of the entire dataset used to generate a predictive model.
A step of estimating accuracy information representing the prediction accuracy of the prediction model generated using the entire dataset, based on the feature quantities of the partial dataset.

本技術の一実施形態に係るモデル生成システムの構成例を示すブロック図である。1 is a block diagram showing an example configuration of a model generation system according to an embodiment of the present technology. 図１に示す端末装置の構成例を示すブロック図である。2 is a block diagram showing an example of the configuration of a terminal device shown in FIG. 1 . 推定モデルの生成処理の概要を示す模式図である。FIG. 2 is a schematic diagram showing an overview of a generation process of an estimation model. モデル生成システムの概要を説明するための模式図である。FIG. 1 is a schematic diagram for explaining an overview of a model generation system. メタ特徴量の具体例を示す表である。11 is a table showing specific examples of meta features. モデル生成システムの基本的な動作例を示すフローチャートである。1 is a flowchart illustrating an example of a basic operation of the model generation system. 設定画面の一例を示す模式図である。FIG. 13 is a schematic diagram showing an example of a setting screen. 選択エリアのインターフェースの一例を示す模式図である。FIG. 13 is a schematic diagram showing an example of a selection area interface. 第１の予測モデルに関する評価画面の一例を示す模式図である。FIG. 13 is a schematic diagram showing an example of an evaluation screen regarding a first prediction model. 向上幅αの表示エリアのインターフェースの一例を示す模式図である。FIG. 13 is a schematic diagram showing an example of an interface of a display area of the improvement width α. サーバ装置での演算を含む学習処理の一例を示すタイムチャートである。11 is a time chart showing an example of a learning process including calculations in a server device.

以下、本技術に係る実施形態を、図面を参照しながら説明する。 Below, an embodiment of the present technology is described with reference to the drawings.

［システムの構成］
図１は、本技術の一実施形態に係るモデル生成システムの構成例を示すブロック図である。モデル生成システム１００は、機械学習の手法を用いて予測処理を行う予測モデルを生成するシステムである。予測モデルにより予測対象についての予測分析が可能となる。
モデル生成システム１００では、予測モデルを生成するためのアプリケーション（以下、予測分析ツールと記載する）が動作する。ユーザは、予測分析ツールを用いることで、所望の予測処理を行う予測モデルを生成することが可能となる。
予測モデルの種類や予測対象等は限定されず、ユーザが任意に設定可能である。 [System Configuration]
1 is a block diagram illustrating an example of a configuration of a model generation system according to an embodiment of the present technology. The model generation system 100 is a system that generates a prediction model that performs prediction processing using a machine learning technique. The prediction model enables prediction analysis of a prediction target.
An application for generating a predictive model (hereinafter, referred to as a predictive analysis tool) runs in the model generation system 100. By using the predictive analysis tool, a user can generate a predictive model that performs a desired predictive process.
There are no limitations on the type of prediction model or the prediction target, and the user can set them as desired.

モデル生成システム１００は、端末装置１０と、サーバ装置３０とを有する。端末装置１０及びサーバ装置３０は、通信ネットワーク３１を介して相互に通信可能に接続される。
端末装置１０は、ユーザが直接操作する情報処理装置であり、予測分析ツールの操作端末として機能する。端末装置１０としては、ＰＣ（Personal Computer）等が用いられる。あるいは、タブレット端末やスマートフォン等が端末装置１０として用いられてもよい。
サーバ装置３０は、端末装置１０にリモート接続する情報処理装置である。サーバ装置３０は、例えば端末装置１０で指定された所定の処理（例えば予測モデルの学習処理等）を実行し、その処理結果を端末装置１０に送信する。サーバ装置３０は、例えば所定のネットワークで接続可能なネットワークサーバや、クラウド接続可能なクラウドサーバ等が用いられる。ここでは、従量課金制のサーバ装置３０が用いられる場合を想定する。
通信ネットワーク３１は、端末装置１０とサーバ装置３０とを通信可能に接続するネットワークであり、例えばインターネット回線等が用いられる。あるいは、専用のローカルネットワーク等が用いられてもよい。 The model generation system 100 includes a terminal device 10 and a server device 30. The terminal device 10 and the server device 30 are connected to each other via a communication network 31 so as to be able to communicate with each other.
The terminal device 10 is an information processing device that is directly operated by a user, and functions as an operation terminal for the predictive analysis tool. A personal computer (PC) or the like is used as the terminal device 10. Alternatively, a tablet terminal, a smartphone, or the like may be used as the terminal device 10.
The server device 30 is an information processing device that is remotely connected to the terminal device 10. The server device 30 executes, for example, a predetermined process (e.g., a learning process for a predictive model, etc.) designated by the terminal device 10, and transmits the processing result to the terminal device 10. The server device 30 may be, for example, a network server that can be connected via a predetermined network, or a cloud server that can be connected to the cloud. Here, it is assumed that a pay-as-you-go server device 30 is used.
The communication network 31 is a network that connects the terminal device 10 and the server device 30 so that they can communicate with each other, and may be, for example, an Internet line or a dedicated local network.

図２は、図１に示す端末装置１０の構成例を示すブロック図である。端末装置１０は、表示部１１と、操作部１２と、通信部１３と、記憶部１４と、制御部１５とを有する。 Figure 2 is a block diagram showing an example of the configuration of the terminal device 10 shown in Figure 1. The terminal device 10 has a display unit 11, an operation unit 12, a communication unit 13, a memory unit 14, and a control unit 15.

表示部１１は、各情報を表示するディスプレイであり、例えば予測分析ツールのＵＩ（User Interface）画面等を表示する。表示部１１としては、例えば液晶ディスプレイ（ＬＣＤ：Liquid Cristal Display）や有機ＥＬ（Electro-Luminescence）ディスプレイ等が用いられる。表示部１１の具体的な構成は限定されず、例えば操作部１２として機能するタッチパネル等を搭載したディスプレイ等が用いられてもよい。また表示部１１としてＨＭＤ（Head Mounted Display）が用いられてもよい。The display unit 11 is a display that displays various information, for example, a UI (User Interface) screen of a predictive analysis tool. For example, a liquid crystal display (LCD) or an organic EL (Electro-Luminescence) display is used as the display unit 11. The specific configuration of the display unit 11 is not limited, and for example, a display equipped with a touch panel or the like that functions as the operation unit 12 may be used. Furthermore, an HMD (Head Mounted Display) may be used as the display unit 11.

操作部１２は、ユーザが各種の情報を入力するための操作装置を含む。操作部１２としては、例えばマウスやキーボード等の情報入力が可能な装置が用いられる。この他、操作部１２の具体的な構成は限定されない。例えば操作部１２として、タッチパネル等が用いられてもよい。また操作部１２として、ユーザを撮影するカメラ等が用いられ、視線やジェスチャによる入力が可能であってもよい。The operation unit 12 includes an operation device for the user to input various information. As the operation unit 12, for example, a device capable of inputting information, such as a mouse or a keyboard, is used. The specific configuration of the operation unit 12 is not limited to this. For example, a touch panel or the like may be used as the operation unit 12. Furthermore, as the operation unit 12, a camera or the like that photographs the user may be used, and input may be possible through gaze or gestures.

通信部１３は、端末装置１０と他の装置（例えばサーバ装置３０）との通信処理を行うモジュールである。通信部１３は、例えばＷｉ－Ｆｉ等の無線ＬＡＮ（Local Area Network）モジュールや、有線ＬＡＮモジュールにより構成される。この他、Ｂｌｕｅｔｏｏｔｈ（登録商標）等の近距離無線通信や、光通信等が可能な通信モジュールが用いられてよい。The communication unit 13 is a module that performs communication processing between the terminal device 10 and other devices (e.g., the server device 30). The communication unit 13 is configured, for example, by a wireless LAN (Local Area Network) module such as Wi-Fi, or a wired LAN module. In addition, a communication module capable of short-range wireless communication such as Bluetooth (registered trademark), optical communication, etc. may be used.

記憶部１４は、不揮発性の記憶デバイスであり、例えばＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）等が用いられる。この他、記憶部１４して用いられる記録媒体の種類等は限定されず、例えば非一時的にデータを記録する任意の記録媒体が用いられてよい。
記憶部１４には、本実施形態に係る制御プログラム１６が記憶される。制御プログラム１６は、例えば端末装置１０全体の動作を制御するプログラムである。 The storage unit 14 is a non-volatile storage device, and may be, for example, a hard disk drive (HDD) or a solid state drive (SSD). In addition, the type of recording medium used as the storage unit 14 is not limited, and any recording medium that non-temporarily records data may be used.
A control program 16 according to the present embodiment is stored in the storage unit 14. The control program 16 is a program that controls the overall operation of the terminal device 10, for example.

また記憶部１４には、予測モデルの生成に用いる学習データセット１７が記憶される。学習データセット１７は、予測モデルの機械学習に用いられる複数のデータを含むデータセットである。学習データセット１７は、予測モデル５０の対象（予測項目）に合わせて適宜生成され、記憶部１４に格納される。予測モデルを構築する際には、学習データセット１７に含まれるデータが適宜読み込まれて用いられる。The memory unit 14 also stores a learning dataset 17 used to generate the predictive model. The learning dataset 17 is a dataset including multiple data used for machine learning of the predictive model. The learning dataset 17 is generated as appropriate to match the target (prediction item) of the predictive model 50, and stored in the memory unit 14. When constructing a predictive model, the data included in the learning dataset 17 is read and used as appropriate.

学習データセット１７のデータは、例えば複数の属性値（特徴量）とそれらに対応する正解ラベルとが対応づけられたデータである。この場合、正解ラベルの項目を予測する予測モデルの学習が可能となる。
例えば顧客データを学習データセット１７として、顧客が好む商品を予測するモデルを生成するとする。この場合、例えば顧客データのうち、顧客が好む商品を表す項目（例えば顧客が購入した商品や閲覧した商品）が正解ラベルとなる。また他の属性（顧客の年齢、性別、商品の購入頻度等）についての項目は、予測モデルを学習させるための入力項目となる。
この他、学習データセット１７の種類等は限定されず、予測モデルに応じた任意のデータセットが用いられてよい。 The data in the learning dataset 17 is, for example, data in which a plurality of attribute values (feature values) are associated with corresponding correct labels. In this case, it is possible to train a prediction model that predicts items with correct labels.
For example, assume that customer data is used as the learning data set 17 to generate a model that predicts products that customers like. In this case, for example, among the customer data, items that indicate products that customers like (e.g., products that customers have purchased or viewed) become correct labels. Items regarding other attributes (such as the customer's age, sex, and frequency of product purchases) become input items for training the prediction model.
In addition, the type of the learning dataset 17 is not limited, and any dataset according to the prediction model may be used.

端末装置１０では、後述するように、学習データセット１７の一部のデータセットを用いた処理が実行される。以下では、学習データセット１７の一部であるデータセットを部分データセット１８と記載する。
部分データセット１８は、例えば学習データセット１７からサンプリングされた複数のデータにより構成される。部分データセット１８となるデータは、例えば部分データセット１８が必要となるたびに適宜サンプリングされる。あるいは、部分データセット１８となるデータが予め設定されていてもよい。
本実施形態では、学習データセット１７は、予測モデルの生成に用いる全データセットに相当し、部分データセット１８は、全データセットの一部である部分データセットに相当する。 As described below, the terminal device 10 executes a process using a partial dataset of the training dataset 17. Hereinafter, a dataset that is a part of the training dataset 17 will be referred to as a partial dataset 18.
The partial data set 18 is composed of, for example, a plurality of data sampled from the learning data set 17. The data to be the partial data set 18 is sampled as appropriate, for example, every time the partial data set 18 is required. Alternatively, the data to be the partial data set 18 may be set in advance.
In this embodiment, the learning dataset 17 corresponds to the entire dataset used to generate a prediction model, and the partial dataset 18 corresponds to a partial dataset that is a part of the entire dataset.

制御部１５は、端末装置１０が有する各ブロックの動作を制御する。制御部１５は、例えばＣＰＵやメモリ（ＲＡＭ、ＲＯＭ）等のコンピュータに必要なハードウェア構成を有する。ＣＰＵが記憶部１４に記憶されているプログラムをＲＡＭにロードして実行することにより、種々の処理が実行される。制御部１５としては、例えばＦＰＧＡ（Field Programmable Gate Array）等のＰＬＤ(Programmable Logic Device)、その他ＡＳＩＣ（Application Specific Integrated Circuit）等のデバイスが用いられてもよい。The control unit 15 controls the operation of each block of the terminal device 10. The control unit 15 has the hardware configuration necessary for a computer, such as a CPU and memory (RAM, ROM). The CPU loads a program stored in the storage unit 14 into the RAM and executes it, thereby executing various processes. The control unit 15 may be, for example, a PLD (Programmable Logic Device) such as an FPGA (Field Programmable Gate Array), or other devices such as an ASIC (Application Specific Integrated Circuit).

本実施形態では、制御部１５のＣＰＵが本実施形態に係るプログラムを実行することで、機能ブロックとして、ＵＩ生成部２０と、予測モデル生成部２１と、メタ特徴量算出部２２と、精度推定部２３とが実現される。そしてこれらの機能ブロックにより、本実施形態に係る情報処理方法が実行される。なお、各機能ブロックを実現するために、ＩＣ（集積回路）等の専用のハードウェアが適宜用いられてもよい。In this embodiment, the CPU of the control unit 15 executes the program according to this embodiment to realize functional blocks including a UI generation unit 20, a prediction model generation unit 21, a meta-feature calculation unit 22, and an accuracy estimation unit 23. These functional blocks then execute the information processing method according to this embodiment. Note that dedicated hardware such as an IC (integrated circuit) may be used as appropriate to realize each functional block.

ＵＩ生成部２０は、ユーザと端末装置１０（あるいはサーバ装置３０）との情報のやり取りを行うためのＵＩを生成する。具体的には、ＵＩ生成部２０は、予測モデル５０を生成する際に表示部１１に表示されるＵＩ画面（図７及び図９参照）等を生成する。このＵＩ画面が、上記した予測分析ツールの画面となる。
ＵＩ画面には、例えばユーザに提示するための情報や、ユーザが情報を入力するための入力欄等が表示される。ユーザはＵＩ画面を見ながら、操作部（キーボード等）を操作して各種の設定や値等を指定することが可能である。ＵＩ生成部２０は、このようにＵＩ画面を介してユーザが指定した情報を受け付ける。
本実施形態では、ＵＩ生成部は、画面生成部に相当する。 The UI generation unit 20 generates a UI for exchanging information between the user and the terminal device 10 (or the server device 30). Specifically, the UI generation unit 20 generates a UI screen (see FIGS. 7 and 9) to be displayed on the display unit 11 when generating the prediction model 50. This UI screen becomes the screen of the above-mentioned prediction analysis tool.
The UI screen displays, for example, information to be presented to the user and input fields for the user to input information. The user can specify various settings, values, etc. by operating an operation unit (keyboard, etc.) while viewing the UI screen. The UI generation unit 20 accepts information specified by the user via the UI screen in this manner.
In this embodiment, the UI generating unit corresponds to a screen generating unit.

予測モデル生成部２１は、予測モデルの生成処理を実行する。本実施形態では、予測モデル生成部２１により、部分データセット１８を用いた予測モデルの生成処理が実行される。この処理は、全ての学習データセット１７を用いた予測モデルの生成処理と比べ、短時間で実行可能な処理となる。なお、全ての学習データセット１７を用いた生成処理は、例えばサーバ装置３０により実行される。
以下では、部分データセット１８を用いて生成される予測モデルを第１の予測モデルと記載する。また全ての学習データセット１７を用いて生成される予測モデルを第２の予測モデルと記載する。 The prediction model generation unit 21 executes a process of generating a prediction model. In this embodiment, the prediction model generation unit 21 executes a process of generating a prediction model using the partial data set 18. This process can be executed in a short time compared to the process of generating a prediction model using all of the learning data set 17. Note that the generation process using all of the learning data set 17 is executed by, for example, the server device 30.
Hereinafter, a prediction model generated using the partial data set 18 will be referred to as a first prediction model, and a prediction model generated using the entire training data set 17 will be referred to as a second prediction model.

予測モデルの生成処理には、予測モデルを構築するために必要な一連の処理が含まれる。例えば予測分析ツールでは、予測モデルの生成処理として、予測モデルを学習させる学習処理（予測モデルのトレーニング）、予測モデルの状態（学習の傾向等）を検証する検証処理、予測モデルの予測精度等を確認するテスト処理等が適宜実行される。
従って、予測モデル生成部２１では、学習処理、検証処理、及びテスト処理等が部分データセット１８を用いてそれぞれ実行される。
予測モデルに用いられる機械学習のアルゴリズム等は限定されず、例えば予測モデルの処理内容に応じた任意のアルゴリズムが用いられてよい。アルゴリズムの種類等に係わらず本技術は適用可能である。
以下では、予測モデルの生成処理のことを指して、単に学習処理と記載する場合がある。 The process of generating a predictive model includes a series of processes required to construct a predictive model. For example, in a predictive analysis tool, the process of generating a predictive model includes a learning process for training a predictive model, a verification process for verifying the state of the predictive model (learning tendency, etc.), a test process for checking the prediction accuracy of the predictive model, etc., as appropriate.
Therefore, in the prediction model generation unit 21, learning processing, verification processing, test processing, etc. are each performed using the partial data set 18.
The machine learning algorithm used in the prediction model is not limited, and any algorithm may be used according to the processing content of the prediction model. The present technology is applicable regardless of the type of algorithm.
Hereinafter, the process of generating a predictive model may be simply referred to as a learning process.

メタ特徴量算出部２２は、予測モデルの生成に用いる学習データセット１７の一部である部分データセット１８の特徴量を取得する。
ここで、部分データセット１８の特徴量とは、部分データセット１８自身の性質等を表す特徴量である。以下では、このようなデータセット自身の特徴量をメタ特徴量と記載する。すなわち、メタ特徴量算出部２２では、部分データセット１８のメタ特徴量が取得される。
なお、メタ特徴量は、部分データセット１８を構成するデータに記録された属性値（以下、データ特徴量と記載する）とは異なる。例えば、データセットに含まれるデータの数やデータ特徴量の数といったデータセットそのものが持つ特徴量は、メタ特徴量となる。 The meta-feature calculation unit 22 acquires features of a partial dataset 18, which is a part of a learning dataset 17 used to generate a predictive model.
Here, the feature of the partial dataset 18 is a feature that represents the properties of the partial dataset 18 itself. Hereinafter, such a feature of the dataset itself will be referred to as a meta-feature. That is, the meta-feature calculation unit 22 acquires the meta-feature of the partial dataset 18.
Note that meta features are different from attribute values (hereinafter referred to as data features) recorded in the data constituting the partial dataset 18. For example, features possessed by the dataset itself, such as the number of data included in the dataset or the number of data features, are meta features.

メタ特徴量には、部分データセット１８を解析して得られる特徴量（第１の特徴量）が含まれる。この特徴量は、部分データセット１８を解析することで算出される。
またメタ特徴量には、実際に部分データセット１８を使用することで得られる特徴量（第２の特徴量）が含まれる。この特徴量は、上記した予測モデル生成部２１により生成される予測モデルを用いて算出される。
メタ特徴量については、図５等を参照して後に詳しく説明する。
本実施形態では、予測モデル生成部２１とメタ特徴量算出部２２とが共動することで、取得部が実現される。 The meta feature includes a feature (first feature) obtained by analyzing the partial data set 18. This feature is calculated by analyzing the partial data set 18.
The meta features also include features (second features) obtained by actually using the partial data set 18. These features are calculated using the prediction model generated by the prediction model generation unit 21 described above.
The meta-features will be described in detail later with reference to FIG. 5 etc.
In this embodiment, the predictive model generating unit 21 and the meta-feature calculating unit 22 work together to realize an acquiring unit.

精度推定部２３は、部分データセット１８の特徴量（メタ特徴量）に基づいて、学習データセット１７を用いて生成される予測モデル（第２の予測モデル）の予測精度を表す精度情報を推定する。
精度情報は、第２の予測モデルの予測精度を表すことが可能な情報である。この精度情報を参照することで、全ての学習データセット１７を使って第２の予測モデルを構築した場合にどの程度の予測精度が実現できるかといったことを判断することが可能となる。 The accuracy estimation unit 23 estimates accuracy information representing the prediction accuracy of a prediction model (second prediction model) generated using the training dataset 17 based on the features (meta features) of the partial dataset 18.
The accuracy information is information capable of expressing the prediction accuracy of the second prediction model. By referring to this accuracy information, it is possible to determine the degree of prediction accuracy that can be achieved when the second prediction model is constructed using all of the learning data sets 17.

本実施形態では、精度推定部２３により、精度情報として、部分データセット１８を用いて生成される予測モデル（第１の予測モデル）の予測精度に対する学習データセット１７を用いて生成される予測モデル（第２の予測モデル）の予測精度の変化が推定される。
機械学習では、学習させるデータの数が多いほど予測精度が向上すると考えられる。しかしながら、データの数を増やしたからといって予測精度が十分に向上するとは限らない、
精度推定部２３は、部分データセット１８で学習した第１の予測モデルを基準として、全ての学習データセット１７を用いて第２の予測モデルを生成した場合に予想される予測精度の向上幅を推定する。この予測精度の向上幅が、上記した予測精度の変化に対応する。 In this embodiment, the accuracy estimation unit 23 estimates, as accuracy information, the change in the prediction accuracy of a prediction model (second prediction model) generated using the learning dataset 17 relative to the prediction accuracy of a prediction model (first prediction model) generated using the partial dataset 18.
In machine learning, it is believed that the more data that is used for learning, the more accurate the prediction will be. However, simply increasing the amount of data does not necessarily improve the prediction accuracy.
The accuracy estimation unit 23 estimates an expected improvement in prediction accuracy when a second prediction model is generated using all of the learning data set 17, based on the first prediction model trained with the partial data set 18. This improvement in prediction accuracy corresponds to the change in prediction accuracy described above.

精度推定部２３は、予測精度の変化を推定する推定モデルを用いて構成される。推定モデルは、部分データセット１８のメタ特徴量を入力として、第１の予測モデルに対する第２の予測モデルの予測精度の変化を出力するように学習した学習モデルである。このように、精度推定部２３は、推定モデルを実装したモジュール（推定モジュール）であるともいえる。
推定モデルは、例えば、ウェブ等から入手できる多数のデータセットのメタ特徴量から学習を行うことで構成される。推定モデルを生成する方法については、後に詳しく説明する。 The accuracy estimation unit 23 is configured using an estimation model that estimates a change in prediction accuracy. The estimation model is a learning model that is trained to input meta-features of the partial dataset 18 and output a change in prediction accuracy of the second prediction model relative to the first prediction model. In this way, the accuracy estimation unit 23 can also be said to be a module that implements the estimation model (estimation module).
The estimation model is constructed by learning from meta-features of a large number of data sets available from the web, etc. A method for generating the estimation model will be described in detail later.

予測分析ツールでは、まず、予測精度の向上幅を推定する推定モデル（推定モジュール）が構成される。推定モデルは、例えば予測モデルの種類に合わせて生成される。あるいは種類の異なる予測モデルに対応可能な汎用性のある推定モデルが生成されてもよい。推定モデルのデータは、例えば予め記憶部１４に格納され、精度推定部２３を動作させるたびに適宜読み込まれて使用される。
精度推定部２３では、このように構成された推定モデルを用いて、実際に使用する学習データセット１７について、全てのデータセットを学習に用いた際の推定精度の向上幅（推定精度の変化）が推定される。 In the predictive analysis tool, first, an estimation model (estimation module) is configured to estimate the improvement range of prediction accuracy. The estimation model is generated, for example, according to the type of prediction model. Alternatively, a versatile estimation model that can be used with different types of prediction models may be generated. Data of the estimation model is, for example, stored in advance in the storage unit 14, and is appropriately read and used each time the accuracy estimation unit 23 is operated.
The accuracy estimation unit 23 uses the estimation model configured in this manner to estimate the improvement in estimation accuracy (change in estimation accuracy) when all data sets of the actually used learning data set 17 are used for learning.

［推定モデルの生成処理］
図３は、推定モデルの生成処理の概要を示す模式図である。以下では、図３を参照して、予測精度の向上幅を推定する推定モデル４０を生成する方法について説明する。
近年、機械学習に用いることが可能な多数のデータセットがウェブ等から入手できるようになっている。これらのデータセットのメタ特徴量から学習した情報を用いることで、新規のデータセットの性質を予測するといったことが可能である。
本実施形態では、推定モデル４０を構築するために、このように既に存在する多数のデータセットが用いられる。以下では、推定モデル４０を構築するために用いられるデータセットを、推定用データセットと記載する。 [Estimation model generation process]
3 is a schematic diagram showing an overview of a process for generating an estimation model. A method for generating an estimation model 40 for estimating an improvement in prediction accuracy will be described below with reference to FIG.
In recent years, many datasets that can be used for machine learning have become available on the web, etc. By using information learned from the meta-features of these datasets, it is possible to predict the properties of new datasets.
In this embodiment, a large number of already existing data sets are used to construct the estimation model 40. Hereinafter, the data sets used to construct the estimation model 40 will be referred to as estimation data sets.

例えば、推定用データセットを用いてある予測モデル（以下、推定用予測モデルと記載する）を生成するとする。この場合、推定用データセットの一部を用いて学習したモデルと、全ての推定用データセットを用いて学習したモデルとでは、予測精度が異なる。このような予測精度の違いを、複数の推定用データセットごとに学習させることで、予測精度の向上幅を推定する推定モデル４０が構築される。
なお推定用予測モデルは、例えば推定用データセットに合わせて任意に設定されてよい。
本実施形態では、推定用データセットは、所定のデータセットに相当する。また推定用予測モデルは、所定の予測モデルに相当する。 For example, suppose a prediction model (hereinafter, referred to as an estimation prediction model) is generated using an estimation dataset. In this case, the prediction accuracy differs between a model trained using a part of the estimation dataset and a model trained using all of the estimation dataset. By training the model on such a difference in prediction accuracy for each of a plurality of estimation datasets, an estimation model 40 is constructed that estimates the improvement range of the prediction accuracy.
The estimation prediction model may be arbitrarily set in accordance with the estimation data set, for example.
In this embodiment, the estimation data set corresponds to a predetermined data set, and the estimation prediction model corresponds to a predetermined prediction model.

図３に示すように、推定モデルの生成処理では、入力データ２５と、入力データ２５に対応する解答データ２６とのセットが用いられる。入力データ２５及び解答データ２６のセットは、推定用データセットごとにそれぞれ生成される。推定モデルの生成処理に用いられる推定用データセットの数（入力データ２５及び解答データ２６のセットの数）は、例えば数百セット程度である。 As shown in FIG. 3, the process of generating an estimation model uses a set of input data 25 and answer data 26 corresponding to the input data 25. A set of input data 25 and answer data 26 is generated for each estimation dataset. The number of estimation datasets (the number of sets of input data 25 and answer data 26) used in the process of generating an estimation model is, for example, approximately several hundred sets.

入力データ２５は、推定用データセットのメタ特徴量である。具体的には、対象となる推定用データセットに含まれる一部のデータセット(例えば１０％等)のメタ特徴量が、入力データ２５として用いられる。
入力データ２５に含まれるメタ特徴量としては、例えばデータの個数、データ特徴量の個数、あるいは後述する学習データ（train）/検証データ（validation）/テストデータ（test）に対する予測評価値等が挙げられる。これらのメタ特徴量の数や種類は、例えば推定モデル４０が向上幅を推定する際に実際に参照するメタ特徴量と同様に設定される（図５参照）。
また入力データ２５を生成する際には、メタ特徴量算出部２２が部分データセット１８のメタ特徴量を算出する方法と同様の方法が用いられる。 The input data 25 is a metafeature of the estimation dataset. Specifically, metafeatures of a portion of the dataset (e.g., 10%) included in the target estimation dataset are used as the input data 25.
Examples of meta-features included in the input data 25 include the number of data, the number of data features, or predicted evaluation values for training data (train), validation data (validation), and test data (test), which will be described later. The number and types of these meta-features are set in the same way as the meta-features that the estimation model 40 actually refers to when estimating the improvement range (see FIG. 5).
Furthermore, when generating the input data 25, a method similar to the method used by the metafeature calculation unit 22 to calculate the metafeature of the partial data set 18 is used.

解答データ２６は、推定モデル４０が学習するべき項目（予測精度の向上幅）の正解ラベルである。具体的には、推定用データセットの一部(例えば１０％等)を使用して学習を行った際の推定用予測モデルの予測精度と、推定用データセットの全部を使用して学習を行った際の推定用予測モデルの予測精度との差分（向上幅）が正解ラベルとして用いられる。
従って解答データ２６は、推定用データセットの一部又は全部を用いて実際に推定用予測モデルを学習させることで算出される。なお、この時生成された推定用予測モデルから、上記した入力データ２５の一部が算出される。 The answer data 26 is a correct answer label of an item (improvement in prediction accuracy) to be learned by the estimation model 40. Specifically, the difference (improvement) between the prediction accuracy of the estimation prediction model when learning is performed using a part (e.g., 10%) of the estimation dataset and the prediction accuracy of the estimation prediction model when learning is performed using the entire estimation dataset is used as the correct answer label.
Therefore, the answer data 26 is calculated by actually learning the estimation prediction model using a part or all of the estimation data set. Note that a part of the input data 25 is calculated from the estimation prediction model generated at this time.

推定モデル４０の生成処理では、上記した入力データ２５及び解答データ２６が生成される。すなわち、複数の推定用データセットに対して、それらのメタ特徴量（入力データ２５）と実際に全データで学習した場合の予測精度の向上幅（解答データ２６）とがそれぞれ算出される。
このように算出された入力データ２５及び解答データ２６に基づいて機械学習が実行される。具体的には、予測精度の向上幅を正解ラベルとし、メタ特徴量を特徴量として学習処理等が実行される。
この処理は、例えば全てのデータを使用した際に予測精度の向上幅が大きくなるようなデータセットの特徴を学習させる処理であるともいえる。すなわち、推定モデル４０は、データ数を増やしたときに予測精度が向上するデータセットの特徴（メタ特徴量）を学習することになる。これにより、データセットのメタ特徴量から、予測精度の向上幅を推定する学習済みの推定モデル４０が構築される。 In the process of generating the estimation model 40, the above-mentioned input data 25 and answer data 26 are generated. That is, for a plurality of estimation data sets, their meta-features (input data 25) and the improvement in prediction accuracy when actually learning with all the data (answer data 26) are calculated.
Machine learning is executed based on the input data 25 and answer data 26 calculated in this manner. Specifically, learning processing and the like are executed using the improvement range of prediction accuracy as a correct answer label and meta features as features.
This process can also be said to be a process of learning the features of a dataset that will increase the improvement in prediction accuracy when all data is used. That is, the estimation model 40 learns the features (meta-features) of a dataset that will increase the prediction accuracy when the amount of data is increased. This allows the constructed trained estimation model 40 to estimate the improvement in prediction accuracy from the meta-features of the dataset.

このように、推定モデル４０は、推定用データセットの一部のデータセットの特徴量と、推定用予測モデルを推定用データセットの全部及び一部を用いて学習させた場合に生じる予測精度の変化との関係を学習したモデルである。
推定モデル４０を用いることで、未知の学習データセット１７が用いられる場合であっても、予測モデルの予測精度の向上幅を精度よくかつ容易に推定することが可能となる。
なお、推定モデル４０は、メタ特徴量と正解ラベルから学習して得られた学習モデルでもよいし、その学習モデルを近似したモデルであってもよい。以下、推定モデル４０の種類について説明する。 In this way, the estimation model 40 is a model that learns the relationship between the features of a portion of the estimation dataset and the change in prediction accuracy that occurs when the estimation prediction model is trained using all and part of the estimation dataset.
By using the estimation model 40, even when an unknown learning dataset 17 is used, it is possible to accurately and easily estimate the improvement in the prediction accuracy of the prediction model.
The estimation model 40 may be a learning model obtained by learning from meta-features and correct labels, or may be a model that approximates the learning model. Types of the estimation model 40 will be described below.

例えば、推定モデル４０は、予測精度の変化量を複数のレベルに分類する分類モデルである。この場合、例えば正解ラベル（解答データ２６）は、予測精度の変化量を表す各レベルを２値分類したものとなる。
変化量を表すレベルとしては、例えば全データセットで学習した場合に、一部のデータセットで学習した時よりも予測精度が変化する度合いを表すレベルが設定される。例えば、予測精度が「大幅に向上する(５％以上)」場合、「ある程度向上する(２－５％)」場合、あるいは「殆ど向上しない(２％未満)」場合といった３段階のレベルにわけて正解ラベルが設定される。これにより、予測精度の向上幅を複数の段階に分けて推定することが可能となる。 For example, the estimation model 40 is a classification model that classifies the amount of change in prediction accuracy into a plurality of levels. In this case, for example, the correct answer label (answer data 26) is a binary classification of each level representing the amount of change in prediction accuracy.
The level of change is set to indicate the degree to which prediction accuracy changes when learning with the entire data set compared to when learning with only a portion of the data set. For example, the correct answer label is set in three levels: when prediction accuracy is "significantly improved (5% or more)," when it is "improved to some extent (2-5%)," and when it is "almost no improvement (less than 2%)." This makes it possible to estimate the improvement in prediction accuracy in multiple levels.

また例えば、推定モデル４０は、予測精度の変化量を複数のレベルに分類する分類モデルをルールベースで近似したモデルであってもよい。この場合、推定モデル４０は、分類モデルを簡易化したルールベースの分類器となる。
例えば上記した分類モデルを所定のアルゴリズムで近似することで、最終的な推定モデル４０が算出される。分類モデルを近似するアルゴリズムとしては、決定木のアルゴリズムや、決定木をランダムに組み合わせたランダムフォレスト、あるいは分類モデルによる処理をルールの集合に置き換えるルールフィット等が用いられる。
ルールベースのモデルを用いることで、向上幅の推定に要する演算量や演算時間を抑制することが可能である。また推定処理の内容等を、ユーザにも理解できるように説明するといったことも可能となる。 Furthermore, for example, the estimation model 40 may be a rule-based approximation of a classification model that classifies the amount of change in prediction accuracy into a plurality of levels. In this case, the estimation model 40 becomes a rule-based classifier that is a simplification of the classification model.
For example, the above-described classification model is approximated by a predetermined algorithm to calculate the final estimation model 40. As an algorithm for approximating the classification model, a decision tree algorithm, a random forest in which decision trees are randomly combined, or a rule fit in which processing by the classification model is replaced with a set of rules, etc., are used.
By using a rule-based model, it is possible to reduce the amount of calculation and the time required to estimate the improvement amount. It is also possible to explain the details of the estimation process in a way that is easy for users to understand.

また例えば、推定モデル４０は、予測精度の変化量を推定する回帰モデルであってもよい。この場合、例えば正解ラベル（解答データ２６）は、予測精度の変化量の値（例えば向上幅Ｘ％等）に設定される。
このように、予測精度の変化量（向上幅）を具体的な数値として直接回帰するような推定モデル４０が構築されてもよい。これにより、ユーザに対して、予測精度の向上幅の具体的な推定値を提示することが可能となる。
この他、推定モデル４０の具体的な構成は限定されない。 Furthermore, for example, the estimation model 40 may be a regression model that estimates the amount of change in prediction accuracy. In this case, for example, the correct answer label (answer data 26) is set to the value of the amount of change in prediction accuracy (for example, improvement range X%).
In this manner, the estimation model 40 may be constructed so as to directly regress the amount of change (improvement) in the prediction accuracy as a specific numerical value, thereby making it possible to present a specific estimated value of the improvement in the prediction accuracy to the user.
Besides, the specific configuration of the estimation model 40 is not limited.

［モデル生成システムの概要］
図４は、モデル生成システム１００の概要を説明するための模式図である。ここでは、上記した推定モデル４０を用いて予測精度の向上幅を推定し、その推定結果を提示するまでの処理の流れが模式的に図示されている。
図４には、予測モデル５０（第１の予測モデル５１）の生成処理（ステップ１）、メタ特徴量の算出処理（ステップ２）、向上幅の推定処理（ステップ３）、及びＵＩの提示処理（ステップ４）が含まれる。以下順番に説明する。 [Overview of the model generation system]
4 is a schematic diagram for explaining an overview of the model generation system 100. Here, the flow of processing is illustrated, from estimating the improvement range of prediction accuracy using the estimation model 40 described above to presenting the estimation result.
4 includes a process for generating a prediction model 50 (first prediction model 51) (step 1), a process for calculating meta-features (step 2), a process for estimating an improvement amount (step 3), and a process for presenting a UI (step 4). Each process will be described in order below.

［予測モデルの生成処理］
予測精度の向上幅を推定する場合、部分データセット１８を用いて第１の予測モデル５１を生成する処理が実行される。この処理は、学習データセット１７全体での学習（第２の予測モデル５２の生成処理）を行う前に実行される予備的な生成処理である。
具体的には、予測モデル生成部２１により、学習データセット１７に含まれる一部のデータセット（部分データセット１８）がサンプリングされる。そして、この部分データセット１８を用いた機械学習が実行される。 [Prediction model generation process]
When estimating the improvement range of prediction accuracy, a process of generating a first prediction model 51 is executed using the partial data set 18. This process is a preliminary generation process executed before learning is performed using the entire training data set 17 (a process of generating a second prediction model 52).
Specifically, the prediction model generation unit 21 samples a partial dataset 18 included in the learning dataset 17. Then, machine learning is performed using the partial dataset 18.

この生成処理では、部分データセット１８は、互いに用途の異なる複数のデータグループに分けて用いられる。すなわち、部分データセット１８には、互いに用途の異なる複数のデータグループが含まれるとも言える。
一つのデータグループには、少なくとも１つのデータが含まれ、各グループは、それぞれ別の目的で使用される。なおデータグループを設定する方法は限定されない。 In this generation process, the partial data set 18 is divided into a plurality of data groups each having a different purpose for use, that is, it can be said that the partial data set 18 includes a plurality of data groups each having a different purpose.
Each data group includes at least one piece of data, and each group is used for a different purpose. Note that the method for setting the data groups is not limited.

本実施形態では、複数のデータグループは、学習データのグループと、検証データのグループと、テストデータのグループである。
学習データ（training data）は、予測モデル５０の学習処理を行う際に用いられるデータであり、予測モデル５０が実際に学習（トレーニング）するデータである。この学習データが多いほど、予測モデル５０の精度が向上する傾向がある。
検証データ（validation data）は、予測モデル５０の学習の状態（学習の傾向等）を検証する検証処理を行う際に用いられるデータである。従って検証データは、予測モデル５０の学習をチェックするためのデータであると言える。
テストデータ（test data）は、学習データで学習した予測モデル５０の最終的な予測精度等を確認するテスト処理を行う際に用いるデータである。従ってテストデータは、予測モデル５０を評価するためのデータであると言える。
なお、学習の種類や設定によっては、これらのデータのうち検証データが不要な場合もある。この場合、検証データのグループはなくてもよい。 In this embodiment, the multiple data groups are a group of training data, a group of validation data, and a group of test data.
The training data is data used when performing a learning process for the prediction model 50, and is data that the prediction model 50 actually learns (trains) from. The more training data there is, the more the accuracy of the prediction model 50 tends to improve.
The validation data is data used when performing a validation process to validate the learning state (learning tendency, etc.) of the prediction model 50. Therefore, it can be said that the validation data is data for checking the learning of the prediction model 50.
The test data is data used when performing a test process to confirm the final prediction accuracy, etc., of the prediction model 50 trained with the training data. Therefore, the test data can be said to be data for evaluating the prediction model 50.
Depending on the type of learning or settings, validation data may not be necessary. In this case, the validation data group may be omitted.

予測モデル生成部２１では、これらのデータグループを使って、上記した学習処理、検証処理、テスト処理等が適宜実行される。これにより、部分データセット１８から学習した学習済みの予測モデル５０（第１の予測モデル５１）が生成される。
各データグループの情報や、第１の予測モデル５１のデータは、メタ特徴量算出部２２に出力される。 In the prediction model generation unit 21, the above-mentioned learning process, verification process, test process, etc. are appropriately executed using these data groups, whereby a trained prediction model 50 (first prediction model 51) trained from the partial data set 18 is generated.
Information on each data group and data of the first prediction model 51 are output to the meta-feature calculation unit 22.

［メタ特徴量の算出処理］
メタ特徴量算出部２２により、第１の予測モデル５１の生成に用いた部分データセット１８のメタ特徴量Ｆが算出される。
まず、メタ特徴量を算出するために必要なデータが適宜読み込まれる。具体的には、部分データセット１８に含まれる各データグループと、部分データセット１８を用いて生成された第１の予測モデル５１とが読み込まれる。図４には、部分データセット１８の学習データ及びテストデータのグループと、第１の予測モデル５１とが模式的に図示されている。また図示を省略したが、検証データのグループも適宜読み込まれる。
このように、予測精度の向上幅を知りたいデータセット（学習データセット１７）について、そこからサンプリングした部分データセット１８（学習データ、検証データ、テストデータ）と、部分データセット１８で学習済みの第１の予測モデルとが用意される。
以下では、これらのデータをもとに算出されるメタ特徴量Ｆについて具体的に説明する。 [Meta feature amount calculation process]
The meta-feature calculation unit 22 calculates the meta-feature F of the partial data set 18 used to generate the first prediction model 51.
First, data required for calculating meta-features is appropriately read. Specifically, each data group included in the partial dataset 18 and a first prediction model 51 generated using the partial dataset 18 are read. Fig. 4 shows a schematic diagram of the learning data and test data groups of the partial dataset 18 and the first prediction model 51. Although not shown, a validation data group is also appropriately read.
In this way, for a dataset (learning dataset 17) for which the degree of improvement in prediction accuracy is desired, a partial dataset 18 (learning data, validation data, test data) sampled from the dataset and a first prediction model that has been trained using the partial dataset 18 are prepared.
The meta feature amount F calculated based on these data will be specifically described below.

図５は、メタ特徴量の具体例を示す表である。図５には、複数のメタ特徴量について、各メタ特徴量の項目とその具体的な内容とが示されている。これらのメタ特徴量は、例えば予測モデルとして回帰モデルを用いる場合に使用される。
ここでは、各メタ特徴量に番号（Ｆ１～Ｆ１６）を付けて説明する。なお、図５に示す表は一例であって、メタ特徴量の数や種類等は限定されない。 Fig. 5 is a table showing specific examples of meta-features. Fig. 5 shows the items of each meta-feature and their specific contents for a plurality of meta-features. These meta-features are used, for example, when a regression model is used as a prediction model.
Here, the meta-features are numbered (F1 to F16) for explanation. Note that the table shown in Fig. 5 is an example, and the number and types of meta-features are not limited.

部分データセット１８のメタ特徴量には、部分データセット１８の内容に応じた第１の特徴量が含まれる。第１の特徴量とは、部分データセット１８そのものが持つ特徴量である。
本実施形態では、メタ特徴量算出部２２により、部分データセット１８を解析することで第１の特徴量が算出される。図５に示す表では、メタ特徴量Ｆ１～Ｆ４及びＦ９が、第１の特徴量に相当する。 The meta feature of the partial data set 18 includes a first feature according to the content of the partial data set 18. The first feature is a feature that the partial data set 18 itself has.
In this embodiment, the meta-feature calculation unit 22 calculates the first feature by analyzing the partial data set 18. In the table shown in Fig. 5, the meta-feature F1 to F4 and F9 correspond to the first feature.

メタ特徴量Ｆ１（データ数）は、部分データセット１８に含まれるデータの数である。例えば、部分データセット１８に含まれるデータの総数がメタ特徴量として算出される。あるいは部分データセット１８に含まれる学習データの総数が用いられてもよい。
メタ特徴量Ｆ２（特徴量数）は、部分データセット１８のデータに含まれるの特徴量（データ特徴量）の数である。例えば、各データに設定されたデータ特徴量の総数がメタ特徴量として算出される。またデータごとに特徴量の数（種類）が異なる場合には、延べ総数等が算出されてもよい。
メタ特徴量Ｆ３（特徴量数／データ数）は、部分データセット１８に含まれるデータの数とデータに含まれるデータ特徴量の数との比率である。例えば、上記したメタ特徴量Ｆ２をメタ特徴量Ｆ１で除算した値が新たなメタ特徴量として算出される。
メタ特徴量Ｆ４（展開後の特徴量数）は、所定の前処理を済ませた後の学習データに使用するデータ特徴量の数である。例えばＯｎｅＨｏｔエンコーディング等の前処理を行う場合、ダミー変数が用いられることでデータ特徴量の数が処理の前後で変化する。この処理後のデータ特徴量の総数がメタ特徴量として算出される。
メタ特徴量Ｆ９（正解値の分散）は、予測対象ラベル（正解ラベル）の分散である。例えば、回帰モデルの予測対象となる予測対象ラベルの値についての分散値（例えば標準偏差等）がメタ特徴量として算出される。 The meta-feature F1 (number of data) is the number of data included in the partial dataset 18. For example, the total number of data included in the partial dataset 18 is calculated as the meta-feature. Alternatively, the total number of training data included in the partial dataset 18 may be used.
The meta feature F2 (number of features) is the number of features (data features) included in the data of the partial data set 18. For example, the total number of data features set in each data is calculated as the meta feature. In addition, when the number (type) of features differs for each data, the total number or the like may be calculated.
The meta-feature F3 (number of features/number of data) is the ratio between the number of data included in the partial data set 18 and the number of data features included in the data. For example, a value obtained by dividing the meta-feature F2 by the meta-feature F1 is calculated as a new meta-feature.
The meta feature F4 (number of features after expansion) is the number of data features used for the learning data after a predetermined preprocessing is completed. For example, when performing preprocessing such as OneHot Encoding, the number of data features changes before and after the processing due to the use of dummy variables. The total number of data features after this processing is calculated as the meta feature.
The meta-feature F9 (variance of correct values) is the variance of the prediction target label (correct label). For example, a variance value (e.g., standard deviation) of the value of the prediction target label to be predicted by the regression model is calculated as the meta-feature.

また、部分データセット１８のメタ特徴量には、部分データセット１８を用いて生成される予測モデル５０（第１の予測モデル５１）に応じた第２の特徴量が含まれる。すなわち第２の特徴量は、部分データセット１８を実際に使用することで得られる特徴量であるといえる。
本実施形態では、予測モデル生成部２１により、部分データセット１８を用いた第１の予測モデル５１の生成処理が実行されることで第２の特徴量が算出される。図５に示す表では、メタ特徴量Ｆ５～Ｆ８及びＦ１０～Ｆ１６が、第２の特徴量に相当する。 Furthermore, the meta features of the partial dataset 18 include second features corresponding to the prediction model 50 (first prediction model 51) generated using the partial dataset 18. In other words, the second features can be said to be features obtained by actually using the partial dataset 18.
In this embodiment, the second feature is calculated by the prediction model generation unit 21 executing a generation process of the first prediction model 51 using the partial data set 18. In the table shown in Fig. 5, the meta features F5 to F8 and F10 to F16 correspond to the second feature.

ここでは、第２の特徴量として、複数のデータグループの各々に対する部分データセット１８を用いて生成される第１の予測モデル５１の予測値を評価する評価値が用いられる。ここで評価値とは、あるデータグループ（学習データ、検証データ、及びテストデータのグループ）を入力とした場合に、第１の予測モデル５１から出力される予測値を評価することが可能なパラメータである。
予測値を評価するパラメータとしては、例えば予測値に関する誤差中央値（ＭＡＥ：Mean Absolute Error）、平均二乗誤差（ＲＭＳＥ：Root Mean Squared Error）、誤差率中央値（ＭＡＰＥ：Mean Absolute Percentage Error）等が用いられる。あるいは予測値の分散等が評価値として用いられてもよい。評価値として用いるパラメータは限定されず、他の指標が用いられてもよい。 Here, the second feature is an evaluation value that evaluates a predicted value of the first prediction model 51 generated using the partial data set 18 for each of the multiple data groups. Here, the evaluation value is a parameter that can evaluate a predicted value output from the first prediction model 51 when a certain data group (a group of learning data, validation data, and test data) is input.
As a parameter for evaluating a predicted value, for example, a mean absolute error (MAE), a root mean squared error (RMSE), a mean absolute percentage error (MAPE), etc. related to the predicted value are used. Alternatively, the variance of the predicted value, etc. may be used as an evaluation value. The parameters used as the evaluation value are not limited, and other indices may be used.

メタ特徴量Ｆ５（Ｉｔｅｒａｔｉｏｎ数に応じたテストデータの誤差中央値の変化）は、テストデータに対するＩｔｅｒａｔｉｏｎ処理における誤差中央値の変化量である。Ｉｔｅｒａｔｉｏｎ処理は、例えばテストデータの選び方を変えてモデルの予測精度を複数回にわたって検証する処理（交差検証）であり、テストデータの選び方による評価の偏りを低減する効果がある。具体的には、Ｉｔｅｒａｔｉｏｎが収束した時の回数の半分の回数における誤差中央値と、最終的に収束した誤差中央値との差がメタ特徴量として算出される。メタ特徴量Ｆ５は、テストデータに対する評価値の一例である。Metafeature F5 (change in median error of test data according to the number of iterations) is the change in median error in iteration processing for test data. Iteration processing is, for example, a process (cross-validation) in which the prediction accuracy of the model is verified multiple times by changing the way the test data is selected, and has the effect of reducing evaluation bias due to the way the test data is selected. Specifically, the difference between the median error at half the number of iterations when it converges and the final converged median error is calculated as the metafeature. Metafeature F5 is an example of an evaluation value for test data.

メタ特徴量Ｆ６（学習／検証／テストデータの誤差中央値）は、学習済みの第１の予測モデル５１で予測した際の、学習データ、検証データ、テストデータの各グループに対する誤差中央値（ＭＡＥ）の値である。
メタ特徴量Ｆ７（学習／検証／テストデータの平均二乗誤差）は、学習済みの第１の予測モデル５１で予測した際の、学習データ、検証データ、テストデータの各グループに対する平均二乗誤差（ＲＭＳＥ）の値である。
メタ特徴量Ｆ８（学習／検証／テストデータの誤差率中央値）は、学習済みの第１の予測モデル５１で予測した際の、学習データ、検証データ、テストデータの各グループに対する誤差率中央値（ＭＡＰＥ）の値である。
メタ特徴量Ｆ１０（予測値の分散）は、学習済みの第１の予測モデル５１で予測した予測値の分散（標準偏差等）である。
これらの評価値の全部、又は一部が用いられてよい。 The meta-feature F6 (median error of training/validation/test data) is the median error (MAE) value for each group of training data, validation data, and test data when predictions are made using the trained first prediction model 51.
The meta-feature F7 (root mean square error of training/validation/test data) is the root mean square error (RMSE) value for each group of training data, validation data, and test data when predictions are made using the trained first prediction model 51.
The meta-feature F8 (median error rate of training/validation/test data) is the median error rate (MAPE) value for each group of training data, validation data, and test data when predictions are made using the trained first prediction model 51.
The meta-feature F10 (variance of predicted values) is the variance (standard deviation, etc.) of the predicted values predicted by the trained first prediction model 51.
All or some of these evaluation values may be used.

また、第２の特徴量として、上記した評価値を比較した比較値が用いられる。ここで比較値とは、各データグループ（学習データ、検証データ、及びテストデータのグループ）について算出された評価値をグループ間で比較した値である。
具体的には、複数のデータグループのうち２つのデータグループについて算出された評価値の差分又は比率の少なくとも一方が比較値として用いられる。 In addition, a comparison value obtained by comparing the above-mentioned evaluation values is used as the second feature amount. Here, the comparison value is a value obtained by comparing the evaluation values calculated for each data group (groups of training data, validation data, and test data) between groups.
Specifically, at least one of the difference or ratio between evaluation values calculated for two of the plurality of data groups is used as the comparison value.

メタ特徴量Ｆ１１（学習データとテストデータとの誤差中央値の差）は、学習データに対する誤差中央値と、テストデータに対する誤差中央値との差である。
メタ特徴量Ｆ１２（学習データとテストデータとの誤差中央値の比率）は、学習データに対する誤差中央値と、テストデータに対する誤差中央値との比率である。
メタ特徴量Ｆ１３（検証データとテストデータとの誤差中央値の差）は、検証データに対する誤差中央値と、テストデータに対する誤差中央値との差である。
メタ特徴量Ｆ１４（検証データとテストデータとの誤差中央値の比率）は、検証データに対する誤差中央値と、テストデータに対する誤差中央値との比率である。
メタ特徴量Ｆ１５（学習データと検証データとの誤差中央値の差）は、学習データに対する誤差中央値と、検証データに対する誤差中央値との差である。
メタ特徴量Ｆ１６（学習データと検証データとの誤差中央値の比率）は、学習データに対する誤差中央値と、検証データに対する誤差中央値との比率である。 The meta-feature F11 (the difference in the median error between the training data and the test data) is the difference between the median error for the training data and the median error for the test data.
The meta-feature F12 (the ratio of the median errors between the training data and the test data) is the ratio of the median error for the training data to the median error for the test data.
The meta-feature F13 (the difference in the median error between the validation data and the test data) is the difference between the median error for the validation data and the median error for the test data.
The meta-feature F14 (ratio of median errors between validation data and test data) is the ratio of the median error for validation data to the median error for test data.
The meta-feature F15 (the difference in the median error between the training data and the validation data) is the difference between the median error for the training data and the median error for the validation data.
The meta-feature F16 (ratio of median errors between training data and validation data) is the ratio of the median error for the training data to the median error for the validation data.

これらのメタ特徴量Ｆ１１～Ｆ１６は、例えば上記したメタ特徴量Ｆ６の結果をもとに算出される。
なお、差分及び比率を算出する際の基準は任意に設定されてよい。例えばメタ特徴量Ｆ１１において、学習データに対する誤差中央値からテストデータに対する誤差中央値を引いてもよいしその逆でもよい。あるいは差分の絶対値が用いられてもよい。また例えばメタ特徴量Ｆ１２において、学習データに対する誤差中央値をテストデータに対する誤差中央値で割って比率を算出してもよいし、その逆でもよい。
また、誤差中央値に代えて、平均二乗誤差や誤差率中央値を比較した比較値等がメタ情報として用いられてもよい。 These meta-feature amounts F11 to F16 are calculated based on the results of the meta-feature amount F6 described above, for example.
The criteria for calculating the difference and ratio may be set arbitrarily. For example, in the meta-feature F11, the median error for the training data may be subtracted from the median error for the test data, or vice versa. Alternatively, the absolute value of the difference may be used. Also, for example, in the meta-feature F12, the median error for the training data may be divided by the median error for the test data to calculate the ratio, or vice versa.
Also, instead of the median error, a comparison value obtained by comparing the mean square error or the median error rate may be used as meta information.

このように、誤差中央値等は学習済みの第１の予測モデルと、その学習に用いた部分データセット１８が与えられれば計算可能である。他の特徴量についてもサンプリングした学習データ・検証データ・テストデータ・第１の予測モデル５１を使えば全て計算可能である。
なお、これらの値は殆どが第１の予測モデル５１を作成する過程で計算している値であり、追加の計算は必要としない。 In this way, the median error and the like can be calculated if the trained first prediction model and the partial data set 18 used for training are given. All other feature amounts can also be calculated if the sampled training data, validation data, test data, and first prediction model 51 are used.
Most of these values are calculated in the process of creating the first prediction model 51, and no additional calculation is required.

［向上幅の推定処理］
図４に戻り、精度推定部２３により、上記のように算出されたメタ特徴量Ｆに基づいて、予測モデル５０における予測精度の向上幅αが算出される。
予測精度の向上幅αの推定には、メタ特徴量から学習する事で構築した、予測精度の向上幅を推定する推定モデル４０が用いられる（図３参照）。具体的には、部分データセット１８のメタ特徴量Ｆが推定モデル４０に入力データとして入力される。そして推定モデル４０を用いた演算が実行され、向上幅αの分類値や値が出力される。 [Improvement Estimation Process]
Returning to FIG. 4, the accuracy estimation unit 23 calculates the improvement α of the prediction accuracy of the prediction model 50 based on the meta-feature F calculated as described above.
To estimate the improvement range α of the prediction accuracy, an estimation model 40 that estimates the improvement range of the prediction accuracy, constructed by learning from meta-features, is used (see FIG. 3). Specifically, the meta-feature F of the partial data set 18 is input as input data to the estimation model 40. Then, a calculation is performed using the estimation model 40, and a classification value or value of the improvement range α is output.

例えば、推定モデル４０が分類モデルや、分類モデルを近似したルールベースのモデルである場合、向上幅αを複数のレベルに分類した分類結果が出力される。この場合、出力値は「大幅に向上する(５％以上)」、「ある程度向上する(２－５％)」、「殆ど向上しない(２％未満)」といった各レベルについての予測確立となる。すなわち向上幅が５％以上となる確率等が算出される。
また例えば、推定モデル４０が回帰モデルである場合、予測精度の向上幅αの値が回帰問題を解くことで直接推定される。この場合、出力値は向上幅αを具体的に表す値（例えばα＝４％等）となる。
推定モデル４０の出力は、ＵＩ生成部２０に出力される。 For example, when the estimation model 40 is a classification model or a rule-based model that approximates a classification model, a classification result in which the improvement range α is classified into a plurality of levels is output. In this case, the output value is the prediction probability for each level, such as "great improvement (5% or more),""some improvement (2-5%)," and "little improvement (less than 2%)." In other words, the probability that the improvement range will be 5% or more is calculated.
Furthermore, for example, when the estimation model 40 is a regression model, the value of the improvement range α of the prediction accuracy is directly estimated by solving a regression problem. In this case, the output value is a value that specifically represents the improvement range α (for example, α=4%).
The output of the estimation model 40 is output to the UI generation unit 20 .

［ＵＩの提示処理］
ＵＩ生成部２０により、推定した予測精度の向上幅αが表示される。具体的には、ＵＩ生成部２０により、予測精度の向上幅α（予測精度の変化）の推定結果を提示する画面が生成される。そして生成された画面が、表示部１１に表示される。
これにより、全ての学習データセット１７を使って予測モデル５０（第２の予測モデル５２）を生成した場合に想定される予測精度の向上幅αがユーザに提示され、第２の予測モデル５２の生成を行うか否かの判断を支援するといったことが可能となる。 [UI Presentation Processing]
The estimated improvement range α of the prediction accuracy is displayed by the UI generating unit 20. Specifically, the UI generating unit 20 generates a screen presenting an estimation result of the improvement range α of the prediction accuracy (change in the prediction accuracy). The generated screen is then displayed on the display unit 11.
This makes it possible to present to the user the expected improvement α in prediction accuracy when a prediction model 50 (second prediction model 52) is generated using all of the learning data sets 17, thereby assisting the user in deciding whether or not to generate the second prediction model 52.

このように、モデル生成システム１００（予測分析ツール）では、学習データセット１７の一部を用いて短時間で学習を行い、その時の情報から全データセットを用いて学習した際にどれくらい予測精度が向上するかを推定することが可能である。すなわち、一部のデータセットから１回だけ学習することで、全データを学習に使用した際の予測精度の向上幅αが推定可能である。In this way, the model generation system 100 (predictive analysis tool) can perform learning in a short period of time using a portion of the learning dataset 17, and estimate from the information at that time how much the prediction accuracy will improve when learning using the entire dataset. In other words, by learning only once from a portion of the dataset, it is possible to estimate the improvement α in prediction accuracy when all data is used for learning.

本発明者は、予測精度の向上幅αを推定する推定モデル４０を実際に構築し、その精度を検証した。その結果、向上幅αを分類する推定モデル４０のＡＵＣ(分類問題に対する評価指標)は０．７５となり、向上幅αを高い精度で予測できていることが分かった。これは、メタ特徴量から、予測精度が向上するデータセットを適正に予測できることを意味する。The inventors actually constructed an estimation model 40 that estimates the improvement range α of prediction accuracy and verified its accuracy. As a result, it was found that the AUC (evaluation index for classification problems) of the estimation model 40 that classifies the improvement range α was 0.75, and that the improvement range α could be predicted with high accuracy. This means that it is possible to properly predict datasets in which prediction accuracy will improve, from meta-features.

また本発明者は、実際の実験結果から、データ数を増やした際に精度が向上するデータセットの傾向についての知見を得た。具体的には、学習データとテストデータに対する予測値の評価指標（例えば上記した評価値）に大きな差があるデータセットほど、全データで学習した時に精度の向上幅が大きい傾向にあることを見出した。例えば学習データとテストデータの評価指標の差は、予測モデル５０がどの程度学習データに過学習しているかを表す指標となっており、これらの差が大きい場合にはデータ数の増加と共に精度の向上を見込めることが多い。 The inventors also obtained knowledge from actual experimental results about the tendency of datasets for which accuracy improves when the amount of data is increased. Specifically, they found that the greater the difference in the evaluation index (e.g., the above-mentioned evaluation value) of the predicted value for the training data and the test data, the greater the improvement in accuracy when training with all the data. For example, the difference in the evaluation index between the training data and the test data is an index that indicates the degree to which the prediction model 50 has overfitted to the training data, and when the difference between these is large, it is often possible to expect an improvement in accuracy as the amount of data increases.

従ってメタ特徴量の中でも、学習データとテストデータに対する予測値の評価指標（評価値）や、評価指標を比較した値（比較値）は、特に重要な特徴量となる。このようなメタ特徴量を入力とする推定モデル４０を用いることで、予測精度の向上幅αを精度よく推定することが可能となる。Therefore, among meta-features, the evaluation index (evaluation value) of the predicted value for the training data and the test data and the value obtained by comparing the evaluation index (comparison value) are particularly important features. By using an estimation model 40 that uses such meta-features as input, it becomes possible to accurately estimate the improvement range α of the prediction accuracy.

［モデル生成システムの基本動作］
図６は、モデル生成システムの基本的な動作例を示すフローチャートである。図６に示す処理は、例えば端末装置１０を使用するユーザが予測分析ツールで予測モデル５０を生成する際に実行される処理である。
まず、予測モデル５０の各設定値が読み込まれる（ステップ１０１）。具体的には、ＵＩ生成部２０により、予測モデル５０に関する設定画面が生成され、表示部１１に出力される。そしてユーザが設定画面を介して入力した内容（設定値）が読み込まれる。 [Basic operation of the model generation system]
6 is a flowchart showing an example of a basic operation of the model generation system. The process shown in Fig. 6 is executed when a user of the terminal device 10 generates a predictive model 50 using a predictive analysis tool, for example.
First, each setting value of the prediction model 50 is read (step 101). Specifically, the UI generating unit 20 generates a setting screen related to the prediction model 50 and outputs it to the display unit 11. Then, the contents (setting values) input by the user via the setting screen are read.

図７は、設定画面の一例を示す模式図である。図７に示すように、設定画面３５には、複数の設定欄が設けられる。ここでは、商品の購入記録を含む顧客データを学習データセット１７として、商品の購入の有無を予測する予測モデル５０を生成する場合について説明する。 Figure 7 is a schematic diagram showing an example of a settings screen. As shown in Figure 7, the settings screen 35 is provided with a plurality of setting fields. Here, a case will be described in which customer data including product purchase records is used as a learning dataset 17 to generate a predictive model 50 that predicts whether or not a product is purchased.

「入力項目」の設定欄（画面右側）では、学習データセット１７に含まれる項目のうち、予測モデル５０の学習に用いる項目（データ特徴量）を指定可能である。ここでは、顧客に関する"年齢"、"性別"、"顧客ランク"、"過去購入額"、"クーポン利用回数"、"メールアドレス登録"、"オプション購入"等の項目が選択可能に提示される。また、各項目についてのデータタイプやユニーク数等が合わせて表示される。 In the "Input Items" setting field (right side of the screen), it is possible to specify the items (data features) included in the learning dataset 17 that will be used to train the predictive model 50. Here, items related to the customer such as "age", "gender", "customer rank", "past purchase amount", "number of coupons used", "email address registration", and "option purchase" are presented as selectable items. In addition, the data type and unique number for each item are also displayed.

「予測タイプ」の設定欄では、予測モデル５０のタイプを指定可能である。ここでは、「二値分類」、「多値分類」、「数値予測」（回帰予測）の項目が選択可能に表示される。ここでは、二値分類が予測モデル５０のタイプとして選択される。
「予測値」の設定欄では、予測モデル５０の予測対象（対象項目）を指定することが可能である。ここでは、"購入あり"及び"購入なし"の項目のうち"購入あり"が予測対象として選択される。なお項目名（"購入あり"及び"購入なし"）の隣には、学習データセット１７における各項目の割合が表示される。 In the "Prediction type" setting field, the type of prediction model 50 can be specified. Here, the items "binary classification", "multiple-value classification", and "numerical prediction" (regression prediction) are displayed as selectable items. Here, binary classification is selected as the type of prediction model 50.
In the "Predicted value" setting field, it is possible to specify the prediction target (target item) of the prediction model 50. Here, of the items "Purchase made" and "No purchase", "Purchase made" is selected as the prediction target. Note that the percentage of each item in the learning dataset 17 is displayed next to the item names ("Purchase made" and "No purchase").

設定画面３５において点線で示したエリアは、部分データセット１８を用いた学習を選択するための選択エリア３６である。図７に示す例では、選択エリア３６には、「使用するデータの割合」の設定欄が設けられる。この設定欄では、部分データセット１８として用いられるデータの割合をいくつかの候補から選択して指定することが可能である。
例えば学習データセット１７に対する部分データセット１８の割合が０％～１００％の範囲で選択可能に提示される（ここでは１０％が選択される）。このＵＩでは、部分データセット１８の割合が０％より大きい有限値である場合、部分データセット１８を用いた学習が選択されることになる。なお、部分データセット１８の割合が０％である場合には、部分データセット１８を用いた学習は選択されない。 The area indicated by the dotted line on the setting screen 35 is a selection area 36 for selecting learning using the partial dataset 18. In the example shown in Fig. 7, a setting field for "proportion of data to be used" is provided in the selection area 36. In this setting field, it is possible to select and specify the proportion of data to be used as the partial dataset 18 from several candidates.
For example, the ratio of the partial data set 18 to the learning data set 17 is presented selectable in the range of 0% to 100% (10% is selected here). In this UI, when the ratio of the partial data set 18 is a finite value greater than 0%, learning using the partial data set 18 is selected. Note that when the ratio of the partial data set 18 is 0%, learning using the partial data set 18 is not selected.

各設定欄に必要な情報を入力した後で、"学習及び評価を実行"と書かれた実行ボタンを押すと、予測モデル５０についての学習処理等が開始される。また"キャンセル"と書かれたキャンセルボタンを押すと、設定画面３５での各設定値の入力がキャンセルされ、ひとつ前の画面が表示される。 After entering the necessary information in each setting field, pressing the execute button labeled "Execute learning and evaluation" will start the learning process for the predictive model 50. Pressing the cancel button labeled "Cancel" will cancel the input of each setting value on the setting screen 35 and display the previous screen.

図８は、選択エリア３６のインターフェースの一例を示す模式図である。
図８Ａに示す選択エリア３６には、「使用するデータの割合」の設定欄が設けられる。この設定欄では、部分データセット１８として用いられるデータの割合を０％～１００％の範囲で自由に入力して指定することが可能である。この場合、入力値が０よりも大きい場合に、部分データセット１８を用いた学習が選択される。 FIG. 8 is a schematic diagram showing an example of an interface of the selection area 36. As shown in FIG.
8A is provided with a setting field for "proportion of data to be used." In this setting field, the proportion of data to be used as the partial data set 18 can be freely input and specified within the range of 0% to 100%. In this case, when the input value is greater than 0, learning using the partial data set 18 is selected.

図８Ｂに示す選択エリア３６には、「学習モード」の設定欄が設けられる。この設定欄では、"クイックモード"という項目と"全データで学習"という項目とがそれぞれ選択可能に提示される。
ここで、クイックモードとは、部分データセット１８を用いた学習を行い、本番の学習の前に予測精度の向上幅を短時間で算出するモードである。クイックモードでは、例えば予め設定されたデフォルトの割合で部分データセット１８が選択されて用いられる。なお部分データセット１８の割合が選択可能であってもよい。このように、学習モードを選択させることで、部分データセット１８での学習の有無が設定されてもよい。 8B, a setting field for "learning mode" is provided. In this setting field, the items "quick mode" and "learning with all data" are presented as selectable items.
Here, the quick mode is a mode in which learning is performed using the partial data set 18, and the improvement in prediction accuracy is calculated in a short time before actual learning. In the quick mode, for example, the partial data set 18 is selected and used at a preset default ratio. Note that the ratio of the partial data set 18 may be selectable. In this way, by having the user select the learning mode, it may be set whether or not learning is performed using the partial data set 18.

図８Ｃに示す選択エリア３６には、「学習を行う端末」の設定欄が設けられる。この設定欄では、"この端末で学習"という項目と"クラウド上で学習"という項目とがそれぞれ選択可能に提示される。
"この端末で学習"という項目は、部分データセット１８（一部データ）を用いた学習を端末装置１０で実行する場合に選択される。また"クラウド上で学習"という項目は、全ての学習データセット１７を用いた学習をサーバ装置３０で実行する場合に選択される。このように、学習処理を行う装置を選択させることで、部分データセット１８での学習の有無が設定されてもよい。 8C includes a setting field for "Device to study on." In this setting field, the options "Study on this device" and "Study on the cloud" are presented as selectable items.
The item "learn on this terminal" is selected when learning using the partial data set 18 (part of data) is executed on the terminal device 10. The item "learn on the cloud" is selected when learning using the entire learning data set 17 is executed on the server device 30. In this way, by selecting the device that will perform the learning process, it may be possible to set whether or not to learn using the partial data set 18.

このように、ＵＩ生成部２０は、部分データセット１８を用いた第１の予測モデル５１の生成処理の実行を選択するための設定画面３５を生成する。本実施形態では、設定画面３５は、選択画面に相当する。
これにより、ユーザは部分データセット１８での学習を行うか否か、すなわち向上幅αを推定するか否かを適宜選択することが可能となる。 In this manner, the UI generation unit 20 generates the setting screen 35 for selecting execution of the generation process of the first prediction model 51 using the partial data set 18. In this embodiment, the setting screen 35 corresponds to a selection screen.
This allows the user to appropriately select whether or not to perform learning with the partial data set 18, that is, whether or not to estimate the improvement range α.

図６に戻り、設定画面３５から入力された設定値が読み込まれると、部分データセット１８での学習処理を開始するか否かが判定される（ステップ１０２）。
例えば選択エリア３６に表示されたＵＩにおいて、部分データセット１８での学習を行う旨が選択されたとする。この状態で、図７に示す実行ボタンが押された場合、部分データセット１８での学習を行うと判定され（ステップ１０２のＹｅｓ）、部分データセット１８での学習及びその学習結果を用いた向上幅αの推定処理が開始される。
また例えば、部分データセット１８での学習を行う旨が選択されていない状態で実行ボタンが押された場合、部分データセット１８での学習は行わないと判定され（ステップ１０２のＮｏ）、後述するステップ１０７が実行される。 Returning to FIG. 6, when the setting values inputted from the setting screen 35 are read, it is determined whether or not to start the learning process for the partial data set 18 (step 102).
For example, assume that learning with the partial data set 18 is selected in the UI displayed in the selection area 36. In this state, when the execute button shown in Fig. 7 is pressed, it is determined that learning with the partial data set 18 is to be performed (Yes in step 102), and learning with the partial data set 18 and a process of estimating the improvement range α using the learning result are started.
Also, for example, if the execute button is pressed when learning with partial data set 18 has not been selected, it is determined that learning with partial data set 18 will not be performed (No in step 102), and step 107 described below is executed.

部分データセット１８での学習を行うと判定された場合、予測モデル生成部２１により、部分データセット１８を用いた第１の予測モデル５１の生成処理が実行される（ステップ１０３）。この処理は、図４を参照して説明したステップ１の予測モデルの生成処理に相当する。
例えば、設定画面３５の設定値で選択された入力項目から予測値を出力するようなモデルが構成され、部分データセット１８を用いた学習処理・検証処理・テスト処理等が実行され、学習済みの第１の予測モデル５１が構築される。 When it is determined that learning is to be performed using the partial data set 18, the prediction model generation unit 21 executes a process of generating a first prediction model 51 using the partial data set 18 (step 103). This process corresponds to the process of generating a prediction model in step 1 described with reference to FIG.
For example, a model is constructed that outputs a predicted value from the input item selected by the setting value of the setting screen 35, and learning processing, verification processing, test processing, etc. are performed using the partial dataset 18, and a learned first prediction model 51 is constructed.

第１の予測モデル５１が構築されると、メタ特徴量算出部２２により、部分データセット１８のメタ特徴量Ｆが算出される（ステップ１０４）。この処理は、図４を参照して説明したステップ２のメタ特徴量の算出処理に相当する。
例えば、第１の予測モデル５１のデータと、その学習に用いられた部分データセット１８とが読み込まれ、既に用意されている推定モデル４０の入力となるメタ特徴量Ｆがそれぞれ算出される。 Once the first prediction model 51 is constructed, the meta-feature calculation unit 22 calculates the meta-feature F of the partial data set 18 (step 104). This process corresponds to the meta-feature calculation process of step 2 described with reference to FIG. 4.
For example, data of the first prediction model 51 and the partial dataset 18 used for learning the first prediction model 51 are read, and meta-features F that serve as inputs to the estimation model 40 that has already been prepared are calculated.

第１の予測モデル５１が構築されると、メタ特徴量算出部２２により、部分データセット１８のメタ特徴量Ｆが算出される（ステップ１０４）。この処理は、図４を参照して説明したステップ２のメタ特徴量の算出処理に相当する。
例えば、第１の予測モデル５１のデータと、その学習に用いられた部分データセット１８とが読み込まれ、既に用意されている推定モデル４０の入力となるメタ特徴量Ｆがそれぞれ算出される。
このように、本実施形態では、第１の予測モデル５１の生成処理の実行が選択された場合に、その生成処理を実行して部分データセット１８のメタ特徴量Ｆが算出される。 Once the first prediction model 51 is constructed, the meta-feature calculation unit 22 calculates the meta-feature F of the partial data set 18 (step 104). This process corresponds to the meta-feature calculation process of step 2 described with reference to FIG. 4.
For example, data of the first prediction model 51 and the partial dataset 18 used for learning the first prediction model 51 are read, and meta-features F that serve as inputs to the estimation model 40 that has already been prepared are calculated.
In this manner, in this embodiment, when execution of the generation process of the first prediction model 51 is selected, the generation process is executed to calculate the meta-feature value F of the partial data set 18 .

メタ特徴量Ｆが算出されると、精度推定部２３により、部分データセット１８のメタ特徴量Ｆに基づいて予測精度の向上幅α（精度情報）が推定される（ステップ１０５）。この処理は、図４を参照して説明したステップ３の向上幅の推定処理に相当する。
例えば、推定モデル４０に対して前のステップで算出された各メタ特徴量Ｆが入力され、予測精度の向上幅αの分類レベルや値が算出される。 After the meta-feature F is calculated, the accuracy estimation unit 23 estimates an improvement range α (accuracy information) of the prediction accuracy based on the meta-feature F of the partial data set 18 (step 105). This process corresponds to the improvement range estimation process of step 3 described with reference to FIG. 4.
For example, each meta-feature F calculated in the previous step is input to the estimation model 40, and the classification level and value of the improvement range α of the prediction accuracy are calculated.

向上幅αが算出されると、ＵＩ生成部２０により向上幅αを提示する画面が生成される（ステップ１０６）。この処理は、図４を参照して説明したステップ４のＵＩの提示処理に相当する。
本実施形態では、第１の予測モデル５１の評価結果とともに、向上幅αを提示する評価画面が生成され、表示部１１に表示される。 When the improvement range α is calculated, the UI generating unit 20 generates a screen for presenting the improvement range α (step 106). This process corresponds to the UI presenting process in step 4 described with reference to FIG.
In this embodiment, an evaluation screen presenting the improvement range α together with the evaluation result of the first prediction model 51 is generated and displayed on the display unit 11 .

図９は、第１の予測モデル５１に関する評価画面の一例を示す模式図である。
図９に示す評価画面３７の左側には、モデルの選択エリア３６が設けられる。選択エリアには、既に評価を行った第１の予測モデル５１が、その評価値、生成日時、使用したデータ名とともに、選択可能に提示される。
また評価画面３７の右側には、選択エリア３６で選択された第１の予測モデル５１の予測精度のレベルを示す「予測精度レベル」の表示欄と、「項目の寄与度」の表示欄とが設けられる。また評価画面３７の右側には、向上幅αの推定結果を提示する表示エリア３８が設けられる。 FIG. 9 is a schematic diagram showing an example of an evaluation screen for the first prediction model 51. As shown in FIG.
A model selection area 36 is provided on the left side of the evaluation screen 37 shown in Fig. 9. In the selection area, first prediction models 51 that have already been evaluated are presented in a selectable manner together with their evaluation values, generation dates and times, and the names of the data used.
Also provided on the right side of the evaluation screen 37 are a display field for "prediction accuracy level" indicating the level of prediction accuracy of the first prediction model 51 selected in the selection area 36, and a display field for "contribution of item." Also provided on the right side of the evaluation screen 37 is a display area 38 presenting the estimated result of the improvement range α.

「予測精度レベル」の表示欄には、第１の予測モデル５１の性能を示す評価指標として、例えばＲＯＣ（Receiver Operating Characteristic）曲線のＡＵＣ（Area Under the Curve）が表示される。ＡＵＣは、分類モデルの分類精度を示す指標である。この他、評価指標に関連する説明項目（モデルの精度についてのコメント等）が表示される。
また「項目の寄与度」の表示欄には、分類に影響した項目ごとの寄与度を示す棒グラフが表示される。これにより、例えば"購入あり"という分類に影響した項目や、"購入なし"という分類に影響した項目を、項目間で比較することが可能となる。 In the "Prediction accuracy level" display field, for example, the AUC (Area Under the Curve) of a ROC (Receiver Operating Characteristic) curve is displayed as an evaluation index indicating the performance of the first prediction model 51. The AUC is an index indicating the classification accuracy of a classification model. In addition, explanatory items related to the evaluation index (comments on the accuracy of the model, etc.) are displayed.
In addition, the "Item Contribution" display field displays a bar graph showing the contribution of each item that influenced the classification. This makes it possible to compare items, for example, items that influenced the classification "Purchase" and items that influenced the classification "No Purchase".

図９に示す表示エリア３８には、向上幅αについて説明するテキストが提示される。ここでは、第１の予測モデル５１の学習に用いられたデータ（部分データセット１８）の割合とともに、全データ（全ての学習データセット１７）を用いることで期待される予測精度の向上幅αを提示する説明分が用いられる。ここでは、向上幅αを具体値（Ｘ％）で提示する説明文が用いられているが、向上幅αを複数のレベル（例えば大、中、小等）にわけて提示する説明分が用いられてもよい。
このように、説明文等の文章で推定結果を提示することで、全ての学習データセット１７を用いた学習を行うべきか否かについてのアドバイスを明示的に行うことが可能となる。 9, text explaining the improvement range α is presented. Here, an explanation is used that presents the improvement range α of the prediction accuracy expected by using all the data (all the learning data sets 17) together with the proportion of the data (partial data sets 18) used in training the first prediction model 51. Here, an explanation is used that presents the improvement range α as a specific value (X%), but an explanation that presents the improvement range α in multiple levels (e.g., large, medium, small, etc.) may also be used.
In this way, by presenting the estimation results in text such as an explanatory sentence, it is possible to explicitly give advice as to whether or not learning should be performed using all of the learning datasets 17.

図１０は、向上幅αの表示エリア３８のインターフェースの一例を示す模式図である。
図１０Ａに示す表示エリア３８には、全ての学習データセット１７を用いて行われる第２の予測モデル５２の生成処理を実行する実行ボタン３９が設けられる。そして実行ボタン３９の近くに第２の予測モデル５２の生成処理に要する処理時間と、期待される予測精度の向上幅αとが提示される。
これより、ユーザは、向上幅αと処理時間とを参照して、第２の予測モデル５２の生成処理を行うか否かを判断することが可能となる。また、実行ボタン３９を選択することで、そのまま全ての学習データセット１７を用いた生成処理が開始可能であるため、設定値を再度入力する必要等はない。 FIG. 10 is a schematic diagram showing an example of an interface of a display area 38 for the improvement width α.
10A includes an execute button 39 for executing a process for generating a second prediction model 52 using all of the learning data sets 17. The processing time required for the process for generating the second prediction model 52 and the expected improvement α in prediction accuracy are displayed near the execute button 39.
This allows the user to refer to the improvement range α and the processing time to determine whether or not to perform the generation process of the second prediction model 52. In addition, by selecting the execute button 39, the generation process using all the learning data sets 17 can be started as is, so there is no need to re-enter the setting values.

図１０Ｂに示す表示エリア３８には、向上幅αの推定結果がそのまま提示される。ここでは、複数のレベルに分類された向上幅αの推定結果が文字データを用いて提示される。
推定結果を表す方法は限定されず、例えば複数のレベルを表すグラフィックス等を用いて向上幅αのレベルが表されてもよい。あるいは向上幅αの値を表すゲージやグラフ等が用いられてもよい。
これより、ユーザは、向上幅αのレベルや値を容易に把握することが可能となる。
このように、ＵＩ生成部２０は、予測精度の向上幅αを複数のレベルにわけて提示する評価画面３７や、予測精度の向上幅αの値を提示する評価画面３７を生成する。 10B, the estimation result of the improvement width α is displayed as it is in a display area 38. Here, the estimation result of the improvement width α classified into a plurality of levels is displayed using character data.
For example, the level of the improvement range α may be represented using graphics or the like that represent a plurality of levels. Alternatively, a gauge or a graph that represents the value of the improvement range α may be used.
This allows the user to easily grasp the level and value of the improvement amount α.
In this manner, the UI generating unit 20 generates the evaluation screen 37 that presents the improvement range α of the prediction accuracy divided into a plurality of levels, and the evaluation screen 37 that presents the value of the improvement range α of the prediction accuracy.

図６に戻り、向上幅α（評価画面３７）が提示されると、全ての学習データセット１７（全データセット）での学習処理を開始するか否かが判定される（ステップ１０７）。すなわち、第２の予測モデル５２を生成するか否かが判定される。
例えば、向上幅αが高くユーザが全ての学習データセット１７での学習を選択した場合（ステップ１０７のＹｅｓ）、全ての学習データセット１７での学習や評価が開始される（ステップ１０８）。本実施形態では、例えばサーバ装置３０に学習データセット１７及び設定値等が出力され、サーバ装置３０により第２の予測モデル５２を生成する一連の処理が実行される。あるいは、端末装置１０の予測モデル生成部２１により、第２の予測モデル５２が生成されてもよい。なお、第２の予測モデル５２が生成された後は、その評価画面等が表示される。
また例えば、向上幅αが低くユーザが全ての学習データセット１７での学習を選択しなかった場合（ステップ１０７のＹｅｓ）、予測モデル５０を生成する処理が終了する。 6, when the improvement range α (evaluation screen 37) is presented, it is determined whether or not to start a learning process for all the learning data sets 17 (all data sets) (step 107). In other words, it is determined whether or not to generate a second prediction model 52.
For example, if the improvement range α is high and the user selects learning with all the learning data sets 17 (Yes in step 107), learning and evaluation with all the learning data sets 17 is started (step 108). In this embodiment, for example, the learning data set 17 and the setting values are output to the server device 30, and the server device 30 executes a series of processes for generating the second prediction model 52. Alternatively, the second prediction model 52 may be generated by the prediction model generation unit 21 of the terminal device 10. After the second prediction model 52 is generated, an evaluation screen or the like is displayed.
Also, for example, if the improvement range α is low and the user does not select learning with all of the learning data sets 17 (Yes in step 107), the process of generating the prediction model 50 ends.

これにより、例えば大規模なデータセットから学習する際にローカルＰＣ（端末装置１０）を長時間占有することなく、またクラウド上の従量課金制サーバ（サーバ装置３０）等を長時間使用することなく、短時間で予測モデル５０と予測精度及び全データセットで学習した際の精度の目安を知ることが可能となる。これにより、予測モデル５０を効率的に生成することが可能となる。 This makes it possible to know the prediction model 50 and its prediction accuracy, as well as an estimate of the accuracy when learning from the entire data set, in a short time, for example, without occupying a local PC (terminal device 10) for a long time when learning from a large data set, and without using a pay-as-you-go server (server device 30) on the cloud for a long time. This makes it possible to efficiently generate the prediction model 50.

本技術に係るモデル生成システム１００（予測分析ツール）の適用例について具体的な事例を挙げて説明する。
［適用例１］
大規模データで予測モデルを学習させる際に、有用な特徴量の組合せを特定した後に、クラウド上の従量課金制のサーバ装置３０を用いて全データでの学習を実行する事例。
ここでは、保険会社において、顧客がどのような保険商品を好むか予測する予測モデル５０を構築するものとする。 An application example of the model generation system 100 (predictive analysis tool) according to the present technology will be described below using a specific example.
[Application Example 1]
In this example, when training a predictive model using large-scale data, after identifying a useful combination of features, training is performed on all data using a pay-as-you-go server device 30 on the cloud.
Here, it is assumed that an insurance company constructs a prediction model 50 that predicts what type of insurance products customers will prefer.

このケースでは、例えば学習データセット１７として用いる顧客データは、顧客の入金や保険商品に対する手続きのログが含まれる膨大なデータである。このため、全ての学習データセット１７を用いて学習を行うと６時間～１２時間程度の時間がかかってしまい、業務時間内に完了させるといったことが難しい。
さらに、手続きのログの種類は数百種類あり多岐に渡る。このため、学習に用いる特徴量として設定する行動の種類（特徴量の組合せ）を特定することが難しい。 In this case, for example, the customer data used as the learning dataset 17 is a huge amount of data that includes logs of customer deposits and procedures for insurance products. For this reason, if learning is performed using all of the learning dataset 17, it takes about 6 to 12 hours, making it difficult to complete the process within business hours.
Furthermore, there are hundreds of different types of procedure logs, making it difficult to specify the type of behavior (combination of features) to set as features for learning.

例えば、ユーザにより、普段の業務の仮説に基づいて、入力データとして用いる特徴量の組合せが１０パターン程度用意される。その上で、モデル生成システム１００を用いて、ユーザが用意した各パターンについて、部分データセット１８での予備的な学習が実行される。これにより、各パターンについて、全ての学習データセット１７で学習した時の予測精度（向上幅α）が推定される。
この予備的な学習（図６のステップ１０３等）は、学習データセット１７からサンプリングされた部分データセット１８に対して行うため、１回あたり３０分程度で完了する。１０パターンの特徴量の組合せを３０分おきに学習することで、業務時間内に特に有用な特徴量の組合せのパターンを３つ程度に絞ることが可能である。 For example, a user prepares about 10 patterns of combinations of features to be used as input data based on hypotheses about daily work. Then, using the model generation system 100, preliminary learning is performed on the partial data set 18 for each pattern prepared by the user. As a result, the prediction accuracy (improvement range α) when learning with all the learning data sets 17 is estimated for each pattern.
This preliminary learning (e.g., step 103 in FIG. 6 ) is performed on the partial data set 18 sampled from the learning data set 17, and therefore takes about 30 minutes to complete each time. By learning 10 patterns of feature combinations every 30 minutes, it is possible to narrow down the patterns of feature combinations that are particularly useful during business hours to about three.

上記の予備的な学習で絞り込んだ３つ程度の有用な特徴量の組合せのパターンの各々について、全ての学習データセット１７を用いて第２の予測モデル５２が生成される。この処理は、例えば夜間や土日といった時間を利用して、従量課金制のサーバ装置３０を用いて長時間かけて実行される。
例えば、翌朝（あるいは週明け）に出勤したユーザにより、サーバ装置３０で学習させた第２の予測モデル５２の学習結果が確認される。そして、予測精度等が最も優れたモデルが、最終的に使用するモデルとして決定される。 For each of the approximately three useful feature combination patterns narrowed down in the above preliminary learning, a second prediction model 52 is generated using all of the learning data sets 17. This process is executed over a long period of time, for example, during nighttime or on weekends, using a pay-as-you-go server device 30.
For example, when a user comes to work the next morning (or at the beginning of the next week), the learning result of the second prediction model 52 trained by the server device 30 is confirmed. Then, the model with the highest prediction accuracy, etc. is determined to be the model to be finally used.

このように、モデル生成システム１００を用いることで、見込みのあるパラメータの候補等を短時間で絞り込むことが可能である。これにより、業務時間を無駄にすることなく、予測モデル５０を効率的に生成することが可能となる。In this way, by using the model generation system 100, it is possible to narrow down promising parameter candidates in a short time. This makes it possible to efficiently generate the predictive model 50 without wasting work time.

［適用例２］
顧客のログから顧客がサービスに払う金額が予測出来るかどうか試行錯誤する事例。
ここでは、ウェブ上で提供されるサービスにおいて、顧客が１カ月の間にサービスで使用する金額を予測する予測モデルを構築するものとする。このような予測モデルを構築することで、例えば、使用金額の少ないユーザに対してクーポンを発行するといった対策を行うことが可能となり、顧客にサービスの使用を促すことが可能になると期待される。 [Application Example 2]
An example of trial and error to see if it is possible to predict the amount a customer will pay for a service from customer logs.
Here, we will build a prediction model that predicts the amount of money that a customer will spend on a service provided on the web in a month. By building such a prediction model, it is expected that measures such as issuing coupons to users who spend a small amount of money will be taken, and that this will encourage customers to use the service.

このケースでは、例えばアクセス時間等を記録した顧客のログデータが学習データセット１７として用いられる。
なおログデータは、膨大なデータ量であり、またノイズも多く混ざっていると考えられる。このため、ログデータを学習したからと言って、そこから顧客が使用する金額を実際に予測出来るどうかは不明である。
一方で、顧客の使用する金額は、サービスにおいてＫＰＩ（Key Performance Indicator）の１つである。このため、もし顧客の使用する金額が予測可能であるならば、ビジネス的な価値は大きく、可能な限り予測モデルの構築を試みるものとする。 In this case, customer log data that records, for example, access times, etc., is used as the learning data set 17 .
However, log data is a huge amount of data and is thought to contain a lot of noise. For this reason, it is unclear whether learning from log data can actually predict the amount of money a customer will spend.
On the other hand, the amount of money used by customers is one of the KPIs (Key Performance Indicators) for a service. Therefore, if the amount of money used by customers can be predicted, it will have great business value, and we will try to build a prediction model as much as possible.

例えば、数十ギガバイトのログデータが存在し、全てのログデータ（学習データセット１７）を使用して学習を行うと、学習処理に数日程度時間を要する事が判明したとする。
そこで、モデル生成システム１００を用いて、まずは全てのログデータのうち、サンプリングした一部のログデータ（部分データセット１８）で学習処理等が実行される。この学習処理は、例えば６時間程度で行われる。 For example, assume that there are several tens of gigabytes of log data, and it is found that if learning is performed using all of the log data (learning dataset 17), the learning process will take several days.
Therefore, first, learning processing and the like are executed on a sampled portion of the entire log data (partial data set 18) using the model generation system 100. This learning processing is executed for, for example, about six hours.

部分データセット１８による学習結果を参照すると、第１の予測モデル５１での誤差率中央値が１２０％となり、十分な予測精度が出ていないことが判明した。さらに、全てのデータセットを使用した時の精度も、誤差率中央値が１００％程度になると示唆されており、期待した精度が得られないことが判明した。
このような場合は、データ数を増やしてローカルの端末装置１０やクラウド上のサーバ装置３０で処理を実行したとしても、時間や費用が無駄になってしまう。このため、ログデータから顧客の使用する金額を予測する予測モデルの構築は断念されることになる。 With reference to the learning results using the partial data set 18, it was found that the median error rate for the first prediction model 51 was 120%, indicating that sufficient prediction accuracy was not achieved. Furthermore, it was suggested that the accuracy when all data sets were used would also have a median error rate of approximately 100%, indicating that the expected accuracy could not be achieved.
In such a case, even if the amount of data is increased and processing is performed on the local terminal device 10 or the cloud server device 30, it will be a waste of time and money. For this reason, the construction of a prediction model for predicting the amount of money used by customers from log data is abandoned.

上記したように、顧客が使用する金額を直接予測できないことがわかった。そこで問題設定を変更し、顧客が１カ月に１０００円以上のお金をサービスで支払うかどうかを分類する二値分類を行う予測モデルについて検討した。
具体的には、モデル生成システム１００を用いて、上記の二値分類を行う予測モデルについて、全てのログデータからサンプリングした一部のデータセットで学習したところ、ＡＵＣが０．６５となった。さらに、全てのログデータで学習することで、ＡＵＣが０．７まで上がることが示唆されたとする。
これは、実用可能な精度であるため、実際にクラウド上のサーバ装置３０を用いて全てのログデータを用いて学習処理を実行し、ＡＵＣが０．７１の予測モデルが得られた。 As mentioned above, we found that it is not possible to directly predict the amount of money a customer will spend. Therefore, we changed the problem setting and considered a predictive model that performs binary classification to classify whether a customer spends more than 1,000 yen per month on the service.
Specifically, when the model generation system 100 was used to train a prediction model that performs the above binary classification using a partial data set sampled from all log data, the AUC was 0.65. Furthermore, it was suggested that the AUC could be increased to 0.7 by training using all log data.
Since this is a practically feasible level of accuracy, a learning process was actually performed using all the log data using the server device 30 on the cloud, and a prediction model with an AUC of 0.71 was obtained.

このように、問題設定（予測モデルのターゲット）を顧客が１カ月に１０００円以上のお金をサービスで支払うかどうかの二値分類に変更する事で、実用可能な予測が出来る事がわかった。これにより、例えば月に１０００円以下しかサービスにお金を払わない確率の高い顧客に対して、クーポンや割引等を発行し、顧客の消費金額を促す施策を開始することが可能となる。
この適用例では、問題設定を試行錯誤して適切な問題設定を見つける間に、モデル生成システム１００が用いられる。これにより、実際に全てのデータを使った学習を行わなくても予測精度が推定されるため、不要な学習時間や費用を費やすことなくモデルを構築することが可能となっている。 In this way, we found that practical predictions can be made by changing the problem setting (the target of the prediction model) to a binary classification of whether or not a customer spends more than 1,000 yen a month on services. This makes it possible to start measures to encourage customers to spend more, such as issuing coupons or discounts to customers who are likely to spend less than 1,000 yen a month on services.
In this application example, the model generation system 100 is used while finding an appropriate problem setting through trial and error. This allows prediction accuracy to be estimated without actually performing learning using all the data, making it possible to build a model without spending unnecessary learning time and money.

［適用例３］
大規模データでの学習を行う際に、初めにローカルの端末装置１０を用いて全ての学習データセット１７で学習した時の精度を見積もり、実用可能な見込みが得られた場合に従量課金制のサーバ装置３０で全ての学習データセット１７での学習を実行する事例。 [Application Example 3]
When training on large-scale data, the accuracy of training on all training data sets 17 is first estimated using a local terminal device 10, and if it is deemed feasible, training on all training data sets 17 is performed on a pay-as-you-go server device 30.

例えば、端末装置１０を用いて、部分データセット１８での予測モデルの学習が行われる。その後、学習結果等が提示され、全ての学習データセット１７で学習したときに実用に耐えうる予測精度が出るかどうかがユーザにより確認される。例えば、全ての学習データセット１７で学習するとＡＵＣが０．７２であることが予測されたとする。この場合、期待される予測精度は実用に達しているとして、クラウド上のサーバ装置３０を用いて、全ての学習データセット１７を使用した学習処理を行うことが決定される。For example, a predictive model is trained on a partial data set 18 using the terminal device 10. After that, the learning results etc. are presented, and the user checks whether a practical prediction accuracy is obtained when learning is performed on all of the learning data sets 17. For example, assume that it is predicted that the AUC will be 0.72 when learning is performed on all of the learning data sets 17. In this case, it is determined that the expected prediction accuracy has reached a practical level, and a decision is made to perform a learning process using all of the learning data sets 17 using the server device 30 on the cloud.

実際に、サーバ装置３０を用いて学習処理を行った結果、ＡＵＣが０．７１となる想定通りの予測モデルが構築されたとする。この場合、予測モデルは実用に耐えうるモデルであるとして、本番環境に投入する事が決定される。
このように、モデル生成システム１００では、大規模データでの学習を行う際に予め全データ使用時の精度を推定可能である。この推定結果を参照することで、ユーザは、サーバ装置３０等の演算リソースを効率的に利用することが可能となる。 In fact, it is assumed that a prediction model with an AUC of 0.71 is constructed as expected as a result of performing a learning process using the server device 30. In this case, it is determined that the prediction model is a model suitable for practical use and is to be introduced into a production environment.
In this way, the model generation system 100 can estimate the accuracy when all data is used beforehand when learning with large-scale data. By referring to the estimation result, the user can efficiently use the computing resources of the server device 30 and the like.

図１１は、サーバ装置３０での演算を含む学習処理の一例を示すタイムチャートである。図１１には、例えば適用例３で説明した大規模データでの学習を行う事例における、モデル生成システム１００での処理の流れが示されている。
まず、学習に使用する学習データセット１７が読み込まれた状態で、ユーザにより学習ボタンが押下され、端末装置１０に学習処理を開始する旨の指示が入力される（ステップ２０１）。
このとき、端末装置１０では、学習データセット１７のデータ容量が算出され、学習時間等が算出される。そして、データ容量や学習時間が閾値を超えて大きい場合等には、データが巨大であるため一部のデータ（部分データセット１８）で学習する旨を伝えるメッセージが表示される（ステップ２０２）。 Fig. 11 is a time chart showing an example of a learning process including calculations in the server device 30. Fig. 11 shows a flow of processing in the model generation system 100 in a case where learning is performed using large-scale data as described in Application Example 3, for example.
First, with the learning data set 17 to be used for learning loaded, the user presses the learning button and inputs an instruction to start the learning process to the terminal device 10 (step 201).
At this time, the terminal device 10 calculates the data capacity of the learning data set 17, the learning time, etc. Then, if the data capacity or the learning time exceeds a threshold, a message is displayed to inform the user that learning will be performed using only a portion of the data (partial data set 18) because the data is huge (step 202).

端末装置１０において部分データセット１８での学習処理が実行される（ステップ２０３）。このように、図１１に示す例では、端末装置１０により、学習データセット１７のデータのサイズ等に応じて、部分データセット１８での学習処理が自動的に選択され実行される。なお、部分データセット１８での学習は、ユーザの確認後に実行されてもよい。
部分データセット１８での学習処理が完了すると、その学習結果（第１の予測モデル５１の評価結果）と、全データで学習した場合に想定される推定予測精度（向上幅α）とが表示される（ステップ２０４）。 The terminal device 10 executes a learning process on the partial data set 18 (step 203). As described above, in the example shown in Fig. 11, the terminal device 10 automatically selects and executes the learning process on the partial data set 18 depending on the size of the data in the learning data set 17, etc. Note that the learning on the partial data set 18 may be executed after confirmation by the user.
When the learning process on the partial data set 18 is completed, the learning results (evaluation results of the first prediction model 51) and the estimated prediction accuracy (improvement range α) expected when learning on all data are displayed (step 204).

推定予測精度が高く、ユーザが全ての学習データセット１７を用いた学習処理を実行すると判断したとする。この場合、所定の実行ボタンが押下され、端末装置１０にクラウド（サーバ装置３０）での学習を実行させる旨の指示が入力される（ステップ２０５）。そして端末装置１０により、全ての学習データセット１７と予測モデルの設定値等のデータとがサーバ装置３０にアップロードされる（ステップ２０６）。Assume that the estimated prediction accuracy is high and the user decides to execute a learning process using all of the learning data sets 17. In this case, a predetermined execution button is pressed and an instruction is input to the terminal device 10 to execute learning in the cloud (server device 30) (step 205). Then, the terminal device 10 uploads all of the learning data sets 17 and data such as the setting values of the prediction model to the server device 30 (step 206).

サーバ装置３０では、全ての学習データセット１７での学習処理が実行される（ステップ２０７）。サーバ装置３０は、一般に高い演算能力を有するため、端末装置１０で行うよりも短時間で学習処理を完了することが可能である。なお、サーバ装置３０で学習処理が実行されている間、端末装置１０には演算負荷がかからない。従って、ユーザはこの時間を利用して端末装置１０に他の処理等を実行させることが可能である。The server device 30 executes the learning process for all the learning data sets 17 (step 207). The server device 30 generally has high computing power, and therefore is able to complete the learning process in a shorter time than the terminal device 10. Note that while the server device 30 is executing the learning process, no computational load is placed on the terminal device 10. Therefore, the user can use this time to have the terminal device 10 execute other processes, etc.

全ての学習データセット１７での学習処理が完了すると、その学習結果（第２の予測モデル５２の評価結果）がサーバ装置３０から端末装置１０に送信される（ステップ２０８）。そして端末装置１０により、全ての学習データセット１７での学習結果を含む評価画面が生成され、表示部に表示される（ステップ２０９）。
このように、全てのデータを使った本番の学習を行う前に、予測精度の推定結果が提示される。これによりユーザは、本番の学習を行うべきか否かを判断することが可能である。特に大規模なデータでの学習を行う場合等には不要な演算時間や費用を抑制し、必要な演算のみを実行させることが可能となる。これにより、予測モデルの生成処理の効率を大幅に向上することが可能となる。 When the learning process is completed for all the learning data sets 17, the learning results (the evaluation results of the second prediction model 52) are transmitted from the server device 30 to the terminal device 10 (step 208). Then, the terminal device 10 generates an evaluation screen including the learning results for all the learning data sets 17, and displays it on the display unit (step 209).
In this way, the estimated prediction accuracy is presented before actual learning using all the data is performed. This allows the user to determine whether or not to perform actual learning. In particular, when learning using large amounts of data, it is possible to reduce unnecessary calculation time and costs and execute only necessary calculations. This makes it possible to significantly improve the efficiency of the generation process of the prediction model.

以上、本実施形態に係る制御部１５では、学習データセット１７のうち、部分データセット１８のメタ特徴量Ｆが取得される。このメタ特徴量Ｆに基づいて、学習データセット１７を用いて予測モデル５０（第１の予測モデル５１）を生成した場合の予測精度を表す精度情報（向上幅α）が推定される。これにより、例えば学習データセット１７を用いるべきか否かを判断することが可能となり、予測モデル５０を効率的に生成することが可能となる。As described above, in the control unit 15 according to this embodiment, the meta-feature F of the partial dataset 18 from the learning dataset 17 is acquired. Based on this meta-feature F, accuracy information (improvement range α) representing the prediction accuracy when a prediction model 50 (first prediction model 51) is generated using the learning dataset 17 is estimated. This makes it possible to determine, for example, whether or not the learning dataset 17 should be used, and makes it possible to efficiently generate the prediction model 50.

機械学習では、一般に学習データ数を増やすほど予測精度が向上することが知られている。一方で、データ数が増えるにつれて学習に必要な時間が増加してしまう。
一例として、パラメータ探索や特徴量探索を行うような場合には、学習時間の増大が問題となる場合が多い。例えば、非専門家向けに提供されている予測分析サービス等では、パラメータや特徴量の探索が必須である。このため、例えば数百メガバイトを超える大きなデータセットを学習する際には、パラメータ探索等の過程で多くの時間を要してしまうことが考えられる。 In machine learning, it is generally known that the more training data there is, the more accurate the predictions become. However, as the amount of data increases, the time required for training also increases.
For example, when performing parameter search or feature search, the increase in learning time is often a problem. For example, in predictive analysis services provided to non-experts, parameter and feature search is essential. For this reason, when learning a large data set exceeding several hundred megabytes, for example, it is considered that a lot of time is required for the parameter search process.

データ数を増やすことによる予測精度の向上の度合いを推定する方法として、異なるサイズの複数のデータセットに対して学習を行う方法が挙げられる。この場合、各データセットのデータ数とテストデータに対する予測精度との関係を調べることで、データ数を増やしたときの予測精度の向上幅が推定される。しかしながら、この方法では、複数のデータセットを対象とするため、複数回（例えば５－１０回程度）の学習を行う必要がある。このため、短い学習時間で予測精度を把握するという目的にも関わらず、予測精度を推定すること自体に時間がかかってしまう恐れがある。One method for estimating the degree of improvement in prediction accuracy by increasing the amount of data is to train on multiple data sets of different sizes. In this case, the improvement in prediction accuracy when the amount of data is increased can be estimated by examining the relationship between the amount of data in each data set and the prediction accuracy for the test data. However, this method requires multiple training sessions (e.g., 5-10 times) to train on multiple data sets. For this reason, despite the goal of grasping prediction accuracy in a short training time, there is a risk that estimating prediction accuracy itself will take a long time.

本実施形態では、学習データセット１７の一部である部分データセット１８のメタ特徴量Ｆから、学習データセット１７で学習させた予測モデルの予測精度の向上幅αが推定される。メタ特徴量Ｆは、部分データセット１８を用いた一度の学習から算出される。In this embodiment, the improvement α in the prediction accuracy of the prediction model trained on the training dataset 17 is estimated from the meta-feature F of the partial dataset 18, which is a part of the training dataset 17. The meta-feature F is calculated from a single training run using the partial dataset 18.

これにより、短い時間で、全ての学習データセット１７で学習させた場合の予測モデルの側精度が推定可能である。従ってユーザは、ローカルの端末装置１０ですぐに予測結果の目安を知ることが可能となり、全データでの学習を実行するか否かを適切に判断することが可能となる。
例えばデータが大規模な場合、パラメータや特徴量を探索する場合、あるいは問題設定を試行錯誤する場合等には、不要な学習を行わずに、短時間で全データセットから学習した際の予測精度を見積もることが可能となる。 This makes it possible to estimate in a short time the side accuracy of the prediction model when trained using all of the training data sets 17. This allows the user to immediately know an indication of the prediction result on the local terminal device 10, and thus makes it possible to appropriately determine whether or not to perform training using all of the data.
For example, when the data is large, when parameters or features are being explored, or when problem setting is being done through trial and error, it is possible to estimate the prediction accuracy when learning from the entire data set in a short time without performing unnecessary learning.

またユーザは、端末装置１０を長時間占有することなく、全データで学習した際の精度の見積もりを知ることが可能である。これにより、例えば業務中は一部のデータ（部分データセット１８）で学習を実行して全データ使用時のおおよその予測精度を把握し、夜間や休日に全データでの学習を実行するなどの使い方が可能となる。 In addition, the user can know the estimated accuracy when learning with all the data without occupying the terminal device 10 for a long period of time. This makes it possible to use the system in such a way that, for example, learning is performed with a portion of the data (partial data set 18) during work hours to grasp the approximate prediction accuracy when all the data is used, and then learning is performed with all the data at night or on holidays.

また、あらかじめ学習が上手くいかない（予測精度が低い、向上が見込めない等）と推定されるデータセットに関してはクラウドで学習を回す必要がなくなる。従って、ユーザは、効果があると推定された時だけ、サーバ装置３０での学習を実行するといったことが可能となる。これにより、従量課金制のサーバ装置３０に無駄な費用を払う必要がなくなり、開発コストを抑えることが可能となる。 In addition, there is no need to run learning on the cloud for data sets that are estimated in advance for which learning will not go well (low prediction accuracy, no improvement expected, etc.). Therefore, users can run learning on the server device 30 only when it is estimated to be effective. This eliminates the need to pay unnecessary fees to the pay-as-you-go server device 30, making it possible to reduce development costs.

このように、本実施形態では、ローカルの端末装置１０を長時間占有することなく、もしくはクラウド上のサーバ装置３０を長時間占有することなく、予測精度の見積もりを得ることが出来る。
これにより、例えば全ての学習データセット１７で半日～１日の長時間の学習を行ったが、想定した精度が出ずに時間やサーバ代を無駄に使用するといった事態を回避することが可能となる。
また、データセットの精度を改善するにあたりデータ数を増やしたときの予測精度の見積もりが得られれば精度改善の指針を得ることも可能である。すなわち、予測精度の向上幅α等を参照して、向上幅αが高くなるようなデータセットを開発するといったことも可能である。 In this manner, in this embodiment, it is possible to obtain an estimate of prediction accuracy without occupying the local terminal device 10 for a long period of time or the server device 30 on the cloud for a long period of time.
This makes it possible to avoid a situation in which, for example, learning is performed for a long period of time, for example, half a day to a day, on all learning data sets 17, but the expected accuracy is not achieved, resulting in a waste of time and server fees.
In addition, if an estimate of the prediction accuracy when the number of data items is increased can be obtained to improve the accuracy of the data set, a guideline for improving the accuracy can be obtained. In other words, it is possible to develop a data set that increases the improvement α of the prediction accuracy by referring to the improvement α of the prediction accuracy.

＜その他の実施形態＞
本技術は、以上説明した実施形態に限定されず、他の種々の実施形態を実現することができる。 <Other embodiments>
The present technology is not limited to the above-described embodiment, and various other embodiments can be realized.

上記では、本技術に係る情報処理装置の一実施形態として、単体の制御部１５（端末装置１０）を例に挙げた。しかしながら、制御部１５とは別に構成され、有線又は無線を介して制御部１５に接続される任意のコンピュータにより、本技術に係る情報処理装置が実現されてもよい。例えばクラウドサーバにより、本技術に係る情報処理方法が実行されてもよい。あるいは制御部１５と他のコンピュータとが連動して、本技術に係る情報処理方法が実行されてもよい。 In the above, a standalone control unit 15 (terminal device 10) has been given as an example of one embodiment of an information processing device related to the present technology. However, the information processing device related to the present technology may be realized by any computer that is configured separately from the control unit 15 and connected to the control unit 15 via a wired or wireless connection. For example, the information processing method related to the present technology may be executed by a cloud server. Alternatively, the control unit 15 may be linked to another computer to execute the information processing method related to the present technology.

すなわち本技術に係る情報処理方法、及びプログラムは、単体のコンピュータにより構成されたコンピュータシステムのみならず、複数のコンピュータが連動して動作するコンピュータシステムにおいても実行可能である。なお本開示において、システムとは、複数の構成要素（装置、モジュール（部品）等）の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、１つの筐体の中に複数のモジュールが収納されている１つの装置は、いずれもシステムである。 In other words, the information processing method and program related to the present technology can be executed not only in a computer system composed of a single computer, but also in a computer system in which multiple computers operate in conjunction with each other. In this disclosure, a system means a collection of multiple components (devices, modules (parts), etc.), regardless of whether all the components are in the same housing. Therefore, multiple devices housed in separate housings and connected via a network, and a single device in which multiple modules are housed in a single housing, are both systems.

コンピュータシステムによる本技術に係る情報処理方法、及びプログラムの実行は、例えば部分データセットの特徴量の取得、精度情報の推定等が、単体のコンピュータにより実行される場合、及び各処理が異なるコンピュータにより実行される場合の両方を含む。また所定のコンピュータによる各処理の実行は、当該処理の一部または全部を他のコンピュータに実行させその結果を取得することを含む。 The information processing method and program execution related to the present technology by a computer system includes both cases where, for example, the acquisition of features of a partial dataset and the estimation of accuracy information are executed by a single computer, and cases where each process is executed by a different computer. Furthermore, the execution of each process by a specific computer includes having another computer execute part or all of the process and obtaining the results.

すなわち本技術に係る情報処理方法及びプログラムは、１つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成にも適用することが可能である。 In other words, the information processing method and program related to the present technology can also be applied to a cloud computing configuration in which a single function is shared and processed collaboratively by multiple devices via a network.

以上説明した本技術に係る特徴部分のうち、少なくとも２つの特徴部分を組み合わせることも可能である。すなわち各実施形態で説明した種々の特徴部分は、各実施形態の区別なく、任意に組み合わされてもよい。また上記で記載した種々の効果は、あくまで例示であって限定されるものではなく、また他の効果が発揮されてもよい。 It is also possible to combine at least two of the characteristic features of the present technology described above. In other words, the various characteristic features described in each embodiment may be combined in any manner, without distinction between the embodiments. Furthermore, the various effects described above are merely examples and are not limiting, and other effects may be achieved.

本開示において、「同じ」「等しい」「直交」等は、「実質的に同じ」「実質的に等しい」「実質的に直交」等を含む概念とする。例えば「完全に同じ」「完全に等しい」「完全に直交」等を基準とした所定の範囲（例えば±１０％の範囲）に含まれる状態も含まれる。In this disclosure, the terms "same," "equal," "orthogonal," etc. are concepts that include "substantially the same," "substantially equal," "substantially orthogonal," etc. For example, this also includes a state that falls within a specified range (e.g., a range of ±10%) based on "completely the same," "completely equal," "completely orthogonal," etc.

なお、本技術は以下のような構成も採ることができる。
（１）予測モデルの生成に用いる全データセットの一部である部分データセットの特徴量を取得する取得部と、
前記部分データセットの特徴量に基づいて、前記全データセットを用いて生成される前記予測モデルの予測精度を表す精度情報を推定する推定処理部と
を具備する情報処理装置。
（２）（１）に記載の情報処理装置であって、
前記推定処理部は、前記精度情報として、前記部分データセットを用いて生成される前記予測モデルの予測精度に対する前記全データセットを用いて生成される前記予測モデルの予測精度の変化を推定する
情報処理装置。
（３）（２）に記載の情報処理装置であって、
前記推定処理部は、前記予測精度の変化を推定する推定モデルを用いて構成される
情報処理装置。
（４）（３）に記載の情報処理装置であって、
前記推定モデルは、所定のデータセットの一部のデータセットの特徴量と、所定の予測モデルを前記所定のデータセットの全部及び一部を用いて生成した場合に生じる予測精度の変化との関係を学習したモデルである
情報処理装置。
（５）（３）又は（４）に記載の情報処理装置であって、
前記推定モデルは、前記予測精度の変化量を複数のレベルに分類する分類モデルである
情報処理装置。
（６）（３）又は（４）に記載の情報処理装置であって、
前記推定モデルは、前記予測精度の変化量を複数のレベルに分類する分類モデルをルールベースで近似したモデルである
情報処理装置。
（７）（３）又は（４）に記載の情報処理装置であって、
前記推定モデルは、前記予測精度の変化量を推定する回帰モデルである
情報処理装置。
（８）（１）から（７）のうちいずれか１つに記載の情報処理装置であって、
前記部分データセットの特徴量は、前記部分データセットの内容に応じた第１の特徴量を含み、
前記取得部は、前記部分データセットを解析することで前記第１の特徴量を算出する
情報処理装置。
（９）（８）に記載の情報処理装置であって、
前記第１の特徴量は、前記部分データセットに含まれるデータの数、前記データに含まれる特徴量の数、前記データの数と前記データに含まれる特徴量の数との比率の少なくとも１つを含む
情報処理装置。
（１０）（１）から（９）のうちいずれか１つに記載の情報処理装置であって、
前記部分データセットの特徴量は、前記部分データセットを用いて生成される前記予測モデルに応じた第２の特徴量を含み、
前記取得部は、前記部分データセットを用いた前記予測モデルの生成処理を実行することで前記第２の特徴量を算出する
情報処理装置。
（１１）（１０）に記載の情報処理装置であって、
前記部分データセットは、互いに用途の異なる複数のデータグループを含み、
前記第２の特徴量は、前記複数のデータグループの各々に対する前記部分データセットを用いて生成される前記予測モデルの予測値を評価する評価値、又は前記評価値を比較した比較値の少なくとも一方を含む
情報処理装置。
（１２）（１１）に記載の情報処理装置であって、
前記複数のデータグループは、学習データのグループと、検証データのグループと、テストデータのグループとを含む
情報処理装置。
（１３）（１１）又は（１２）に記載の情報処理装置であって、
前記評価値は、前記部分データセットを用いて生成される前記予測モデルの予測値に関する誤差中央値、平均二乗誤差、及び誤差率中央値の少なくとも１つを含む
情報処理装置。
（１４）（１１）から（１３）のうちいずれか１つに記載の情報処理装置であって、
前記比較値は、前記複数のデータグループのうち２つのデータグループについて算出された前記評価値の差分又は比率の少なくとも一方を含む
情報処理装置。
（１５）（１）から（１４）のうちいずれか１つに記載の情報処理装置であって、さらに、
前記精度情報を提示する画面を生成する画面生成部を具備する
情報処理装置。
（１６）（１５）に記載の情報処理装置であって、
前記推定処理部は、前記精度情報として、前記部分データセットを用いて生成される前記予測モデルの予測精度に対する前記全データセットを用いて生成される前記予測モデルの予測精度の変化を推定し、
前記画面生成部は、前記予測精度の変化量を複数のレベルにわけて提示する画面、または前記予測精度の変化量の値を提示する画面の少なくとも一方を生成する
情報処理装置。
（１７）（１５）又は（１６）に記載の情報処理装置であって、
前記画面生成部は、前記部分データセットを用いた前記予測モデルの生成処理の実行を選択するための選択画面を生成し、
前記取得部は、前記生成処理の実行が選択された場合に、前記生成処理を実行して前記部分データセットの特徴量を算出し、
前記推定処理部は、前記部分データセットの特徴量に基づいて前記精度情報を推定する
情報処理装置。
（１８）予測モデルの生成に用いる全データセットの一部である部分データセットの特徴量を取得し、
前記部分データセットの特徴量に基づいて、前記全データセットを用いて生成される前記予測モデルの予測精度を表す精度情報を推定する
ことをコンピュータシステムが実行する情報処理方法。
（１９）予測モデルの生成に用いる全データセットの一部である部分データセットの特徴量を取得するステップと、
前記部分データセットの特徴量に基づいて、前記全データセットを用いて生成される前記予測モデルの予測精度を表す精度情報を推定するステップと
をコンピュータシステムに実行させるプログラム。 The present technology can also be configured as follows.
(1) an acquisition unit that acquires features of a partial dataset that is a part of an entire dataset used to generate a predictive model;
and an estimation processing unit that estimates accuracy information representing a prediction accuracy of the prediction model generated using the entire data set, based on the feature amount of the partial data set.
(2) The information processing device according to (1),
The estimation processing unit estimates, as the accuracy information, a change in prediction accuracy of the prediction model generated using the entire data set relative to the prediction accuracy of the prediction model generated using the partial data set.
(3) The information processing device according to (2),
The information processing device, wherein the estimation processing unit is configured using an estimation model that estimates a change in the prediction accuracy.
(4) The information processing device according to (3),
The information processing device, wherein the estimation model is a model that learns the relationship between features of a portion of a specified dataset and a change in prediction accuracy that occurs when a specified prediction model is generated using all or a portion of the specified dataset.
(5) The information processing device according to (3) or (4),
The information processing device, wherein the estimation model is a classification model that classifies the amount of change in the prediction accuracy into a plurality of levels.
(6) The information processing device according to (3) or (4),
The information processing device, wherein the estimation model is a rule-based approximation of a classification model that classifies the amount of change in prediction accuracy into a plurality of levels.
(7) The information processing device according to (3) or (4),
The information processing device, wherein the estimation model is a regression model that estimates a change in the prediction accuracy.
(8) An information processing device according to any one of (1) to (7),
the feature amount of the partial data set includes a first feature amount according to the content of the partial data set;
The acquisition unit calculates the first feature amount by analyzing the partial data set.
(9) The information processing device according to (8),
the first feature amount includes at least one of a number of data items included in the partial data set, a number of feature amounts included in the data, and a ratio between the number of data items and the number of feature amounts included in the data.
(10) An information processing device according to any one of (1) to (9),
the feature amount of the partial data set includes a second feature amount corresponding to the prediction model generated using the partial data set;
The information processing device, wherein the acquisition unit calculates the second feature amount by executing a generation process of the prediction model using the partial data set.
(11) The information processing device according to (10),
The partial data set includes a plurality of data groups each having a different purpose,
the second feature amount includes at least one of an evaluation value that evaluates a predicted value of the prediction model generated using the partial data set for each of the plurality of data groups, or a comparison value that compares the evaluation values.
(12) The information processing device according to (11),
The information processing device, wherein the plurality of data groups include a group of training data, a group of validation data, and a group of test data.
(13) The information processing device according to (11) or (12),
The information processing device, wherein the evaluation value includes at least one of a median error, a mean squared error, and a median error rate regarding a predicted value of the prediction model generated using the partial data set.
(14) An information processing device according to any one of (11) to (13),
The information processing apparatus, wherein the comparison value includes at least one of a difference or a ratio of the evaluation values calculated for two of the plurality of data groups.
(15) The information processing device according to any one of (1) to (14), further comprising:
An information processing device comprising: a screen generating unit that generates a screen for presenting the accuracy information.
(16) The information processing device according to (15),
The estimation processing unit estimates, as the accuracy information, a change in prediction accuracy of the prediction model generated using the entire data set relative to a prediction accuracy of the prediction model generated using the partial data set;
The screen generation unit generates at least one of a screen presenting the amount of change in prediction accuracy divided into a plurality of levels, or a screen presenting a value of the amount of change in prediction accuracy.
(17) The information processing device according to (15) or (16),
The screen generation unit generates a selection screen for selecting execution of a generation process of the prediction model using the partial data set;
the acquisition unit, when execution of the generation process is selected, executes the generation process to calculate features of the partial data set;
The information processing device, wherein the estimation processing unit estimates the accuracy information based on feature amounts of the partial data set.
(18) acquiring features of a partial dataset that is a part of the entire dataset used to generate a predictive model;
An information processing method executed by a computer system, comprising: estimating accuracy information representing a prediction accuracy of the prediction model generated using the entire dataset, based on the feature amounts of the partial dataset.
(19) acquiring features of a partial dataset that is a part of the entire dataset used to generate a predictive model;
and estimating accuracy information representing a prediction accuracy of the prediction model generated using the entire dataset, based on the feature amounts of the partial dataset.

Ｆ…メタ特徴量
１０…端末装置
１４…記憶部
１５…制御部
１６…制御プログラム
１７…学習データセット
１８…部分データセット
２０…ＵＩ生成部
２１…予測モデル生成部
２２…メタ特徴量算出部
２３…精度推定部
３０…サーバ装置
３５…設定画面
３７…評価画面
４０…推定モデル
５０…予測モデル
５１…第１の予測モデル
５２…第２の予測モデル
１００…モデル生成システム F... meta-feature 10... terminal device 14... storage unit 15... control unit 16... control program 17... learning dataset 18... partial dataset 20... UI generation unit 21... prediction model generation unit 22... meta-feature calculation unit 23... accuracy estimation unit 30... server device 35... setting screen 37... evaluation screen 40... estimation model 50... prediction model 51... first prediction model 52... second prediction model 100... model generation system

Claims

An acquisition unit that acquires features of a partial dataset that is a part of an entire dataset used to generate a predictive model;
and an estimation processing unit that estimates accuracy information representing a prediction accuracy of the prediction model generated using the entire data set based on the feature amount of the partial data set,
the feature amount of the partial data set includes a first feature amount according to the content of the partial data set;
The acquisition unit calculates the first feature amount by analyzing the partial data set.
Information processing device.

2. The information processing device according to claim 1,
The estimation processing unit estimates, as the accuracy information, a change in prediction accuracy of the prediction model generated using the entire data set relative to the prediction accuracy of the prediction model generated using the partial data set.

3. The information processing device according to claim 2,
The information processing device, wherein the estimation processing unit is configured using an estimation model that estimates a change in the prediction accuracy.

4. The information processing device according to claim 3,
The information processing device, wherein the estimation model is a model that learns the relationship between features of a portion of a specified dataset and a change in prediction accuracy that occurs when a specified prediction model is generated using all or a portion of the specified dataset.

4. The information processing device according to claim 3,
The information processing device, wherein the estimation model is a classification model that classifies the amount of change in the prediction accuracy into a plurality of levels.

4. The information processing device according to claim 3,
The information processing device, wherein the estimation model is a rule-based approximation of a classification model that classifies the amount of change in prediction accuracy into a plurality of levels.

4. The information processing device according to claim 3,
The information processing device, wherein the estimation model is a regression model that estimates a change in the prediction accuracy.

2. The information processing device according to claim 1 ,
the first feature amount includes at least one of a number of data items included in the partial data set, a number of feature amounts included in the data, and a ratio between the number of data items and the number of feature amounts included in the data.

2. The information processing device according to claim 1,
the feature amount of the partial data set includes a second feature amount corresponding to the prediction model generated using the partial data set;
The information processing device, wherein the acquisition unit calculates the second feature amount by executing a generation process of the prediction model using the partial data set.

The information processing device according to claim 9 ,
The partial data set includes a plurality of data groups each having a different purpose,
the second feature amount includes at least one of an evaluation value that evaluates a predicted value of the prediction model generated using the partial data set for each of the plurality of data groups, or a comparison value that compares the evaluation values.

The information processing device according to claim 10 ,
The information processing device, wherein the plurality of data groups include a group of training data, a group of validation data, and a group of test data.

The information processing device according to claim 10 ,
The information processing device, wherein the evaluation value includes at least one of a median error, a mean squared error, and a median error rate regarding a predicted value of the prediction model generated using the partial data set.

The information processing device according to claim 10 ,
The information processing apparatus, wherein the comparison value includes at least one of a difference or a ratio of the evaluation values calculated for two of the plurality of data groups.

The information processing device according to claim 1 , further comprising:
An information processing device comprising: a screen generating unit that generates a screen for presenting the accuracy information.

The information processing device according to claim 14 ,
The estimation processing unit estimates, as the accuracy information, a change in prediction accuracy of the prediction model generated using the entire data set relative to a prediction accuracy of the prediction model generated using the partial data set;
The screen generation unit generates at least one of a screen presenting the amount of change in prediction accuracy divided into a plurality of levels, or a screen presenting a value of the amount of change in prediction accuracy.

The information processing device according to claim 14 ,
The screen generation unit generates a selection screen for selecting execution of a generation process of the prediction model using the partial data set;
the acquisition unit, when execution of the generation process is selected, executes the generation process to calculate features of the partial data set;
The information processing device, wherein the estimation processing unit estimates the accuracy information based on feature amounts of the partial data set.

A step of acquiring features of a partial dataset that is a part of an entire dataset used to generate a predictive model;
estimating accuracy information representing a prediction accuracy of the prediction model generated using the entire data set based on the feature amount of the partial data set ;
An information processing method executed by a computer system , comprising:
the feature amount of the partial data set includes a first feature amount according to the content of the partial data set;
The step of acquiring the feature amount of the partial data set includes calculating the first feature amount by analyzing the partial data set.
Information processing methods.

A step of acquiring features of a partial dataset that is a part of an entire dataset used to generate a predictive model;
and estimating accuracy information representing a prediction accuracy of the prediction model generated using the entire data set based on the feature amount of the partial data set ,
the feature amount of the partial data set includes a first feature amount according to the content of the partial data set;
The step of acquiring the feature amount of the partial data set includes calculating the first feature amount by analyzing the partial data set.
program.