Regularized Greedy Forest in C++: User Guide

Contents

1 Introduction
2 Get started: A quick tour of functionality
3 Input/output file format
4 Parameters
References

1 Introduction

Regularized greedy forest (RGF) [2] is a tree ensemble learning method. The purpose of this document is to guide you through the software package of RGF implementation. [2] should be referred to in order to understand RGF. With the code provided in the package, you can do the following:

Using the stand-alone executable:

Conduct RGF training for regression tasks and binary classification tasks.
Apply trained models to new data.

It is possible to call RGF functions from your code, but documentation for this purpose is not included in this guide.

This guide is organized as follows. Section 2 quickly goes through the functionality. Data formats are described in Section 3. Section 4 explains all the parameters.

2 Get started: A quick tour of functionality

The first thing to do is to download the package (, extract the content, and create the executable, if it's needed). Please follow the instructions in the README file.

The top directory of the downloaded content is rgf and this software package is located into RGF subfolder. Make sure that the executable rgf (or rgf.exe on Windows) is at rgf/RGF/bin. To go through the examples in this section, always set the current directory to rgf/RGF/examples because the resource files configured for using sample toy data assume that the current directory is rgf/RGF/examples. For the same reason, the path expressions in this section should be understood to be relative to rgf/RGF/examples unless it begins with rgf.

2.1 Get started: `train` – training

To get started, set the current directory to rgf/RGF/examples and enter in the command line:

perl  call_exe.pl  ../bin/rgf  train  sample/train

If successful, the last several lines of the screen should look like:

Generated 5 model file(s):
output/sample.model-01
output/sample.model-02
output/sample.model-03
output/sample.model-04
output/sample.model-05

Sat Dec 10 12:17:39 2011: Done ...
elapsed: 0.115

What happened is: the Perl script call_exe.pl read parameters in the configuration file sample/train.inp and passed them to the executable ../bin/rgf along with the action to take (train); as a result, training was conducted and five models were saved to files. Here is the part of the configuration file sample/train.inp that call_exe.pl was given:

#### sample input to "train" ####
train_x_fn=sample/train.data.x   # Training data points
train_y_fn=sample/train.data.y   # Training targets
model_fn_prefix=output/sample.model
reg_L2=1                         # Regularization parameter
algorithm=RGF                    # RGF with L2 regularization with leaf-only models
loss=LS                          # Square loss
test_interval=100                # Save models every time 100 leaves are added
max_leaf_forest=500              # Stop training when #leaf reaches 500
Verbose                          # Display info during training

It essentially says: conduct training with the training data points in sample/train.data.x and the training targets in sample/train.data.y and save the trained models to the files. Any texts from # to the end of line are comments. The line model_fn_prefix... indicates that the system should generate model path names using the string output/sample.model as a prefix and attaching sequential numbers -01, -02, ... to it. It also says: training should proceed until the number of leaf nodes in the forest reaches 500 (max_leaf_forest=500); and every time another 100 leaf nodes are added to the forest, the system should simulate the end of training and save the model for later testing (test_interval=100). As a result, you should obtain five models each of which contains approximately 100, 200, 300, 400, or 500 leaf nodes. We delay explanation of other parameters until Section 4 where complete lists of parameters are given. The format of training data files is described in Section 3.1.

2.2 Models: Why one call of training results in several model files

As seen above, one call of RGF training typically results in several model files. This fact may deserve some explanation as it differs from typical boosting methods.

For example, suppose that you conducted Gradient Boosting [1] training using regression trees as base learners; stopped training when you obtained 500 trees; and saved them to a file for later testing. Then in principle, using these 500 trees, you can test 500 distinct models, each of which consists of the first trees where , by simply changing the number of trees to be used for making predictions. In principle, you do not have to save 500 models individually. This is because Gradient Boosting does not change the previously-generated trees; it only adds new trees as it proceeds. The same can be said about other typical boosting methods such as AdaBoost.

By contrast, RGF performs fully-corrective update of weights, which updates the weights of all the leaf nodes of all the trees, in the designated interval and at the end of training. For this reason, if we save the model of, for example, 500 trees, then these 500 trees can be used only for testing the additive model of 500 trees. Unlike the Gradient Boosting example above, the first trees of these 500 trees do not constitute a meaningful model. If we stopped training when trees were obtained, the weights assigned to the nodes of the trees would be totally different from those of the first trees of the 500 trees.

It might be simpler if the system let a user only specify when to stop training and only return one model, but it would be very inefficient to train several models of different sizes this way. For efficiency, our implementation trains several models of different sizes in one call by simulating the end of training in the interval designated by test_interval. More precisely, training branches into two in the designated interval. One continues training as if nothing happened, and the other ends training, which triggers weight optimization (if it has not been triggered by the designated optimization interval), and tests or saves the model. That is how one call of training produces several models.

2.3 `predict`: To apply a model to new data

The next example reads a model from one of the five model files generated in Section 2.1 and applies the model to new data. Set the current directory to rgf/RGF/examples and enter:

perl  call_exe.pl  ../bin/rgf  predict  sample/predict

If successful, after parameters are displayed, something similar to the following should be displayed:

output/sample.pred:   output/sample.model-03,#leaf=301,#tree=73
Sat Dec 10 13:20:54 2011: Done ...

which indicates that the prediction values were saved to output/sample.pred; the model was read from the file output/sample.model-03 and it contained 301 leaf nodes and 73 trees.

The configuration file sample/predict.inp we used is:

#### sample input to "predict"
test_x_fn=sample/test.data.x       # Test data points
model_fn=output/sample.model-03    # Model file
prediction_fn=output/sample.pred   # Where to write prediction values

It says: read the model from output/sample.model-03; apply it to the data points in sample/test.data.x; and save the prediction values to output/sample.pred. The format of the prediction file is described in Section 3.3.

2.4 Executable `rgf` and Perl script `call_exe.pl`

The executable rgf, called through the Perl script in the examples above, takes two arguments:

rgf  action  parameters

action	`train` \| `predict` \| `train_test` \| `train_predict`
	`train`	Conduct training and save the trained models to files. Input: training data; Output: models.
	`predict`	Apply a model to new data. Input: a model and test data; Output: predictions.
	`train_test`	Train and test the models in one call. Input: training data and test data; Output: performance results; Optional output: models.
	`train_predict`	Train and apply the models to new data in one call. Input: training data and test data; Output: predictions, model information, and models.
parameters	Parameters are in the form of: keyword1=value1, keyword2=value2, Option1, ... Example: `algorithm=RGF,train_x_fn=data.x,train_y_fn=data.y,...`

Although what is done by train_test or train_predict can also be done by combining train and predict, use of train_test or train_predict has advantages in some situations as discussed in Sections 2.5 and 2.6.

To get help on parameters, call rgf with action but without parameters, for example:

rgf  train
rgf  predict

Since parameters could be long and tedious to type in, the Perl script call_exe.pl introduced above is provided to ease the job. It essentially reads parameters from a configuration file and concatenates them with delimiter , to pass to rgf. The syntax is as follows:

perl  call_exe.pl  executable  action  config_pathname

executable	Typically, `../bin/rgf`, i.e., `rgf/RGF/bin/rgf`.
action	`train` \| `predict` \| `train_test` \| `train_predict`
config_pathname	Path name to the configuration file without extension. The extension of configuration files must be `.inp`.

In the configuration files, any text from # to the end of line is considered to be a comment.

Additionally, call_exe.pl provides an interface to perform several runs in one call with one configuration file. This is convenient, for example, for testing different degrees of regularization with other parameters fixed. sample/regress_train_test.inp provides a self-explaining example.

2.5 `train_test`: train, apply, and evaluate models

train_test performs training and test in one call. What train_test does can also be done by combining train and predict and writing an evaluation routine by yourself. One advantage of train_test other than convenience is that it can save disk space by not having to write the models to files. To try the example configuration for train_test, set the current directory to rgf/RGF/examples, and enter:

perl  call_exe.pl  ../bin/rgf  train_test  sample/train_test

If successful, the last several lines of the screen should look like:

Generated 5 model file(s):
output/m-01
output/m-02
output/m-03
output/m-04
output/m-05

Sat Dec 10 10:17:50 2011: Done ...
elapsed:   0.135

The part of the configuration file sample/train_test.inp is:

#### sample input to "train test" ####
train_x_fn=sample/train.data.x   # Training data points
train_y_fn=sample/train.data.y   # Training targest
test_x_fn=sample/test.data.x     # Test data points
test_y_fn=sample/test.data.y     # Test targets
evaluation_fn=output/sample.evaluation
                                 # Where to write evaluation results
model_fn_prefix=output/m         # Save models. This is optional
algorithm=RGF                    # RGF with L2 regularization on leaf-only models
reg_L2=1                         # Regularization parameter
loss=LS                          # Square loss
test_interval=100                # Test (and save) models every time 100 leaves are added
max_leaf_forest=500              # Stop training when #leaf reaches 500
Verbose                          # Display info during training

It is mostly the same as the configuration file for train in Section 2.1 except that test data is specified by test_x_fn (data points) and test_y_fn (targets) and evaluation_fn indicates where the performance evaluation results should be written. In this example, model files are saved to files, as model_fn_prefix is specified. If model_fn_prefix is omitted, the models are not saved.

Now check the evaluation file (output/sample.evaluation) that was just generated. It should look like the following except that the items following cfg are omitted here:

#tree,29,#leaf,100,acc,0.61,rmse,0.9886,sqerr,0.9773,#test,100,cfg,...,output/m-01
#tree,52,#leaf,200,acc,0.66,rmse,0.9757,sqerr,0.952,#test,100,cfg,...,output/m-02
#tree,73,#leaf,301,acc,0.66,rmse,0.9824,sqerr,0.9651,#test,100,cfg,...,output/m-03
#tree,94,#leaf,400,acc,0.69,rmse,0.9767,sqerr,0.9539,#test,100,cfg,...,output/m-04
#tree,115,#leaf,501,acc,0.67,rmse,0.985,sqerr,0.9702,#test,100,cfg,...,output/m-05

Five lines indicate that five models were trained and tested. For example, the first line says: a model with 29 trees and 100 leaf nodes was applied to 100 data points and classification accuracy was found to be 61%, and the model was saved to output/m-01.

The evaluation file format is described in Section 3.5. The format of training data and test data files is described in Section 3.1.

2.6 `train_predict`: train and apply the models and save predictions

The primary function of train_predict is to perform training; apply the trained models to test data; and write predictions to files. What train_predict does can also be done by combining train and predict. One advantage of train_predict is that it can save disk space by not having to write model files. (Typically, predictions take up much less disk space than models.) In particular, train_predict can be used for one-vs-all training for multi-class categorization, whereas train_test cannot since testing (evaluation) of predictions has to wait until training of all the one-vs-all models for classes is done.

Note that by default all the models are written to files, and to save disk space as discussed above, the option switch SaveLastModelOnly needs to be turned on. With this switch on, only the last (and largest) model is written to the file to enable future warm-start. (Warm-start resumes training from where the training stopped before, which is also explained in Section 4.3.1.)

Model information such as sizes is also written to files. The original purpose is to save information that would be disposed of otherwise with SaveLastModelOnly on. But for simplicity, train_predict always generates model information files irrespective of on/off of the switch. The part of the provided sample configuration file for train_predict, sample/train_predict.inp is as follows.

#### sample input to "train predict" ####
train_x_fn=sample/train.data.x   # Training data points
train_y_fn=sample/train.data.y   # Training targets
test_x_fn=sample/test.data.x     # Test data points
model_fn_prefix=output/m
SaveLastModelOnly                # Only the last (largest) model will be saved
:
test_interval=100                # Test (and save) models every time 100 leaves are added
max_leaf_forest=500              # Stop training when #leaf reaches 500

In this example, the model path names will be output/m-01, ..., output/m-05, but the only last one output/m-05 is actually written to the file, as SaveLastModelOnly is turned on. The path names for saving the predictions and model information are generated by attaching .pred and .info to the model path names, respectively. Therefore, after entering the following in the command line,

perl  call_exe.pl  ../bin/rgf  train_predict  sample/train_predict

we should have the following 11 files at the output directory:

Five prediction files: m-01.pred, ..., m-05.pred
Five model information files: m-01.info, ..., m-05.info
One model file m-05

The data format is described in Section 3.1.

3 Input/output file format

This section describes the format of input/output files.

3.1 Data file format

3.1.1 Data points

The data points (or feature vectors) should be given in a plain text file of the following format. Each line represents one data point. In each line, values should be separated by one or more white space characters. All the lines should have exactly the same number of values. The values should be in the format that is recognized as valid floating-point number expressions by atof of C libraries. The following example represents three data points of five dimensions.

0.3    -0.5  1  0    2
1.555   0    0  2.8  0
0       0    0  3    0

(NOTE) Currently, there is no support for categorical values. All the values must be numbers. This means that categorical attributes, if any, need to be converted to indicator vectors in advance.

Alternative data format for sparse data points For sparse data which has many zero components (e.g., bag-of-word data), the following format can be used instead. The first line should be "sparse d" where d is the feature dimensionality. Starting from the second line, each line represents one data point. In each line, non-zero components should be specified as feature#:value where feature# begins from 0 and goes up to d - 1. For example, the three data points above can be expressed as:

sparse   5
0:0.3    1:-0.5  2:1  4:2
0:1.555  3:2.8
3:3

3.1.2 Targets

The target values should be given in a plain text file of the following format. Each line contains the target value of one data point. The order must be in sync with the data point file. If the data is for the classification task, the values must be in {1,-1}, for example:

+1
-1
-1

If paired with the data point file example above, this means that the target value of the first data point [0.3, -0.5, 1, 0, 2] is 1 and the target value of the second data point [1.555, 0, 0, 2.8, 0] is -1, and so on.

For regression tasks, the target values could be any real values, for example:

0.35
1.23
-0.0028

3.2 Data point weight file format

As introduced later, training optionally takes the user-specified weights of data points as input. The data point weights should be given in a plain text file of the same format as the target file. That is, each line should contain the user-specified weight of one data point, and the order must be in sync with the data point file of training data.

3.3 Prediction file format

predict and train_predict output prediction values to a file. The prediction file is a plain text file that contains one prediction value per line. The order of the values is in sync with the data point file of test data.

3.4 Model information file format

train_predict outputs model information to files. The model information file is a plain text file that has one line, for example:

#tree,378,#leaf,5000,sign,-___-_RGF_,cfg,reg_L2=0.1;loss=LS...

This example means that the model consists of 378 trees and 5000 leaf nodes; and the model was trained with RGF with the parameter setting following cfg.

3.5 Evaluation file format

train_test outputs performance evaluation results to a file in the CSV format. Here is an example:

#tree,115,#leaf,500,acc,0.64,rmse,0.9802,sqerr,0.9607,#test,100,cfg,...
#tree,213,#leaf,1000,acc,0.65,rmse,0.9721,sqerr,0.945,#test,100,cfg,...

In the evaluation file each line represents the evaluation results of one model. In each line, each value is preceded by its descriptor; e.g., #tree,115 indicates that the number of trees is 115 in the tested model. In the following, and are the target value and prediction value of the -th data point, respectively; is the indicator function so that if is true and 0 otherwise; and is the number of test data points.

Descriptor	Meaning
`#tree`	Number of trees in the model.
`#leaf`	Number of leaf nodes in the model.
`acc`	Accuracy regarding the task as a classification task. .
`rmse`	RMSE regarding the task as a regression task. .
`sqerr`	Square error. RMSERMSE.
`#test`	Number of test data points .
`cfg`	Some of training parameters.

In addition, if models were saved to files, the last item of each line will be the model path name.

(NOTE) Although performances are shown in several metrics, depending on the task some are obviously meaningless and should be ignored, e.g., accuracy should be ignored on the regression task; RMSE and square error should be ignored on the classification task especially when exponential loss is used.

3.6 Model files

The model files generated by train or train_test are binary files. Caution is needed if you wish to share model files between the environments with different endianness. By default the code assumes little-endian. To share model files between environments with different endians the executable used in the big-endian environment needs to be compiled in a certain way; see README for detail.

4 Parameters

4.1 Overview of RGF training

Since many of the parameters are for controlling training, let us first give a brief overview of RGF training, focusing on the things that can be controlled via parameters. [2] should be referred to for more precise and complete definition.

Suppose that we are given training data points and targets . The additive model obtained by RGF training is in the form of: where goes through all the leaf nodes in the forest , is the basis function associated with node , and is its weight or coefficient. Initially, we have an empty forest with . As training proceeds, the forest obtains more and more nodes so the model obtains more and more basis functions. The training objective of RGF is to find the model that minimizes regularized loss, which is the sum of loss and a regularization penalty term:

$https://latex.codecogs.com/svg.latex?%5Clarge%20%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D1%7D%5En%5Cell%28h_%7B%5Cmathcal%20F%7D%28%7B%5Cmathbf%20x%7D_i%29%2Cy_i%29+%7B%5Ccal%20G%7D%28%7B%5Cmathcal%20F%7D%29%2C%5Chspace%7B6em%7D%281%29$

where is a loss function; and is the regularization penalty term. RGF grows the forest with greedy search so that regularized loss is minimized, while it performs fully-corrective update of weights to minimize the regularized loss in the designated interval. The loss function and the interval of weight optimization can be specified by parameters.

There are three methods of regularization discussed in [2]. One is regularization on leaf-only models in which the regularization penalty term is:

where is a constant. This is equivalent to standard regularization and penalizes larger weights. The other two are called min-penalty regularizers. Their definition of the regularization penalty term over each tree is in the form of:

where is the depth of node ; and and are constants. While [2] should be consulted for precise definition of min-penalty regularizers, one thing to note here is that a larger penalizes deeper nodes (corresponding to more complex basis functions) more severely. Parameters are provided to choose the regularizer or to specify the degree of regularization through or .

Optionally, it is possible to have different regularization parameters for the process of growing a forest and the process of weight correction.

On the regression tasks, it is sensible to normalize targets so that the average becomes zero since regularization shrinks weights towards zero. An option switch NormalizeTarget is provided for this purpose. When it is turned on, the model is fitted to the normalized targets where and the final model is set to .

The regularized loss in (1) can be customized not only by specifying a loss function but also by specifying user-defined weights. Let be the user-defined weight assigned to the -th data point. Then instead of (1) the system will minimize the following:

$https://latex.codecogs.com/svg.latex?%5Clarge%20%5Cfrac%7B1%7D%7B%5Csum_%7Bi%3D1%7D%5En%20w_i%7D%20%5Csum_%7Bi%3D1%7D%5En%20w_i%20%5Ccdot%20%5Cell%28h_%7B%5Cmathcal%20F%7D%28%7B%5Cmathbf%20x%7D_i%29%2Cy_i%29+%7B%5Ccal%20G%7D%28%7B%5Cmathcal%20F%7D%29.%5Chspace%7B6em%7D%282%29$

Finally, in our implementation, fully-corrective weight update is done by coordinate descent as follows:

$https://latex.codecogs.com/svg.latex?%5Clarge%20%5Cbegin%7Balign*%7D%20%5Cmathbf%7Bfor%7D%5Censpace%20%26%20j%3D1%5Censpace%20%5Cmathbf%7Bto%7D%5Censpace%20J%5Censpace%5Cmathbf%7Bdo%7D%5C%5C%20%26%5Cmathbf%7Bforeach%7D%5Censpace%20%5Ctextnormal%7Bleaf%20node%7D%5Censpace%20v%5Censpace%5Cmathbf%7Bdo%7D%5C%5C%20%26%5Cqquad%5Ctextnormal%7B//%20Update%7D%5Censpace%20%5Calpha_v%5Censpace%20%5Ctextnormal%7Bby%20the%20Newton%20step%20with%20step%20size%7D%5Censpace%20%5Ceta%5Censpace%20%5Ctextnormal%7Bto%20minimize%20regularized%20loss%7D%5Censpace%20%7B%5Ccal%20Q%7D.%5C%5C%20%26%5Cqquad%5Calpha_v%20%5Cleftarrow%20%5Calpha_v%20-%20%5Ceta%20%5Ccdot%20%5Cfrac%7B%5Cpartial%20%7B%5Ccal%20Q%7D/%5Cpartial%20%5Cdelta_v%7C%5Cdelta_v%3D0%7D%7B%5Cpartial%5E2%7B%5Ccal%20Q%7D/%5Cpartial%20%5Cdelta_v%5E2%7C%5Cdelta_v%3D0%7D%5Censpace%20//%5Censpace%20%5Cdelta_v%5Censpace%20%5Ctextnormal%7Bis%20the%20additive%20change%20to%7D%5Censpace%20%5Calpha_v.%5C%5C%20%26%5Cmathbf%7Bend%7D%5C%5C%20%5Cmathbf%7Bend%7D%5C%5C%20%5Cend%7Balign*%7D$

The constants and above can be changed using the parameters described below, although, in our experiments, we never used them as the default values worked fine on a number of datasets.

4.2 Naming conventions and notation

There are two types of parameters: keyword-value pairs and options. The keywords begin with a lower-case letter and should be followed by =value, e.g., loss=LS. The options begin with an upper-case letter, e.g., Verbose, and indicate to turn on corresponding option switches, which are off by default.

In the lists below, * in front of the keyword indicates that the designated keyword-value pair is required and cannot be omitted.

4.3 Parameters for `train`

The parameters described in this section are used by the train function. train trains models and saves them to files.

4.3.1 Parameters to control input and output for `train`

A pair of training data files (one contains the data points and the other contains the target values) are required. Another required parameter is the prefix of model path names, which is used to generate model path names by attaching to it sequential numbers "-01", "-02", and so on. The reason why one call of training typically produces multiple model files is explained in Section 2.2.

Optionally, training can be resumed from the point where training was ended last time, which we call warm-start. To do warm-start, the model file from which training should be resumed needs to be specified. Also optionally, user-defined weights of training data points can be specified through train_w_fn. They are used as in (2).

Required parameters to control input and output for `train`
* `train_x_fn=`	Path to the data point file of training data.
* `train_y_fn=`	Path to the target file of training data.
* `model_fn_prefix=`	To save models to files, path names are generated by attaching "-01", "-02", ... to this value.

Optional parameters to control input and output for `train`
`train_w_fn=`	Path to the file of user-defined weights assigned to training data points.
`model_fn_for_warmstart=`	Path to the model file from which training should do warm-start.

4.3.2 Parameters to control training

In the list below, the first group of parameters are most important in the sense that they would affect either accuracy of the models or speed of training directly, and they were actually used in the experiments reported in [2]. The second group of parameters never needed to be specified in our experiments, as the default values worked fine on a number of datasets, but they may be useful in some situations. The third group is for displaying information and specifying the memory allocation policy.

The variables below refer to the corresponding variables in the overview in Section 4.1.

Parameters to control training
`algorithm=`	`RGF` \| `RGF_Opt` \| `RGF_Sib` (Default: `RGF`) `RGF`: RGF with regularization on leaf-only models. `RGF_Opt`: RGF with min-penalty regularization. `RGF_Sib`: RGF with min-penalty regularization with the sum-to-zero sibling constraints.
`loss=`	Loss function . `LS` \| `Expo` \| `Log` (Default: `LS`) `LS`: square loss . `Expo`: exponential loss . `Log`: logistic loss .
`max_leaf_forest=`	Training will be terminated when the number of leaf nodes in the forest reaches this value. It should be large enough so that a good model can be obtained at some point of training, whereas a smaller value makes training time shorter. Appropriate values are data-dependent and in [2] varied from 1000 to 10000. (Default: 10000)
`NormalizeTarget`	If turned on, training targets are normalized so that the average becomes zero. It was turned on in all the regression experiments in [2].
* `reg_L2=`	. Used to control the degree of regularization. Crucial for good performance. Appropriate values are data-dependent. Either 1.0, 0.1, or 0.01 often produces good results though with exponential loss (`loss=Expo`) and logistic loss (`loss=Log`) some data requires smaller values such as 1e-10 or 1e-20.
`reg_sL2=`	. Override regularization parameter for the process of growing the forest. That is, if specified, the weight correction process uses and the forest growing process uses . If omitted, no override takes place and is used throughout training. On some data, works well.
`reg_depth=`	. Must be no smaller than 1. Meant for being used with `algorithm=RGF_Opt\|RGF_Sib`. A larger value penalizes deeper nodes more severely. (Default: 1)
`test_interval=`	Test interval in terms of the number of leaf nodes. For example, if `test_interval=500`, every time 500 leaf nodes are newly added to the forest, end of training is simulated and the model is tested or saved for later testing. For efficiency, it must be either multiple or divisor of the optimization interval (`opt_interval`: default 100). If not, it may be changed by the system automatically. (Default: 500)

Parameters that are probably rarely used
`min_pop=`	Minimum number of training data points in each leaf node. Smaller values may slow down training. Too large values may degrade model accuracy. (Default: 10)
`num_iteration_opt=`	. Number of iterations of coordinate descent to optimize weights. (Default: 10 for square loss; 5 for exponential loss and the likes)
`num_tree_search=`	Number of trees to be searched for the nodes to split. The most recently-grown trees are searched first. (Default: 1)
`opt_interval=`	Weight optimization interval in terms of the number of leaf nodes. For example, if `opt_interval=100`, weight optimization is performed every time approximately 100 leaf nodes are newly added to the forest. (Default: 100)
`opt_stepsize=`	. Step size of Newton updates used in coordinate descent to optimize weights. (Default: 0.5)

Other parameters
`Verbose`	Print information during training.
`Time`	Measure and display elapsed time for node search and weight optimization.
`memory_policy=`	`Conservative` \| `Generous` (Default: `Generous`)

4.4 Parameters for `predict`

predict reads a model saved by train, train_test, or train_predict, applies it to new data, and saves prediction values to a file.

Parameters for `predict`
* `test_x_fn=`	Path to the data point file of test data.
* `model_fn=`	Path to the model file.
* `prediction_fn=`	Path to the prediction file to write prediction values to.

4.5 Parameters for `train_test`

train_test trains models with training data and evaluates them on test data in one call.

4.5.1 Parameters to control input and output for `train_test`

train_test requires a pair of training data files (one contains the data points and the other contains the target values) and a pair of test data files.

Optionally, the models can be saved to files by specifying model_fn_prefix. The value specified with model_fn_prefix is used to generate model path names by attaching to it sequential numbers "-01", "-02", and so on. The reason why one call of training typically produces multiple model files is explained in Section 2.2. If SaveLastModelOnly is turned on, only the last (and largest) model will be saved, which enables warm-start later on. Other things that can be done optionally are the same as train. That is, optionally, training can be resumed from the point where training was ended last time (warm-start). Also optionally, user-defined weights of training data points can be specified through train_w_fn; see Section 4.1 for how they are used.

Parameters to control input and output for `train_test`
* `train_x_fn=`	Path to the data point file of training data.
* `train_y_fn=`	Path to the target file of training data.
* `test_x_fn=`	Path to the data point file of test data.
* `test_y_fn=`	Path to the target file of test data.
`evaluation_fn=`	Path to the file to write performance evaluation results to. If omitted, the results are written to stdout.
`Append_evaluation`	Open the file to write evaluation results to with the append mode.
`model_fn_prefix=`	If omitted, the models are not saved to files. Model path names are generated by attaching "-01", "-02", ... to this value to save models.
`train_w_fn=`	Path to the file of user-defined weights assigned to training data points.
`model_fn_for_warmstart=`	Path to the input model file from which training should do warm-start.

4.5.2 Parameters to control training

The parameters to control training for train_test are the same as those for train; see Section 4.3.2.

4.6 Parameters for `train_predict`

train_predict trains models with training data; applies the models to test data; and saves the obtained predictions and model information to files in one call. Model files are also saved to files, but whether all the models should be saved or only the last one is obtional.

4.6.1 Parameters to control input and output for `train_predict`

train_predict requires a pair of training data files (one contains the data points and the other contains the target values) and a test data file that has data points. The target values of test data is not required.

The value specified with model_fn_prefix is used to generate model path names by attaching to it sequential numbers "-01", "-02", and so on. The reason why one call of training typically produces multiple model files is explained in Section 2.2. To write predictions to files, the path names are generated by attaching .pred to the corresponding model path names.

When the SaveLastModelOnly switch is turned on, only the last (and largest) model is written to a file. This option is useful for reducing the amount of disk space needed while enabling warm-start later on. See Section 2.6 for more on the situations train_predict is suitable.

Information on models such as sizes are also written to files, and their path names are generated by attaching .info to the model path names. The original purpose is to save information that would be disposed of otherwise with SaveLastModelOnly on. But for simplicity, train_predict always generates model information files irrespective of on/off of SaveLastModelOnly.

Parameters to control input and output for `train_predict`
* `train_x_fn=`	Path to the data point file of training data.
* `train_y_fn=`	Path to the target file of training data.
* `test_x_fn=`	Path to the data point file of test data.
* `model_fn_prefix=`	Model path names are generated by attaching "-01", "-02", ... to this value to save models. Prediction path names and model information path names are generated by attaching `.pred` and `.info` to the model path names, respectively.
`SaveLastModelOnly`	If turned on, only the last model is saved to the file.
`train_w_fn=`	Path to the file of user-defined weights assigned to training data points.
`model_fn_for_warmstart=`	Path to the input model file from which training should do warm-start.

4.6.2 Parameters to control training

The parameters to control training for train_predict are the same as those for train; see Section 4.3.2.

References

[1] Jerome Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29, 2001.

[2] Rie Johnson and Tong Zhang. Learning Nonlinear Functions Using Regularized Greedy Forest. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(5):942-954, 2014, also available as arXiv:1109.0887.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgf-guide.rst

rgf-guide.rst

Regularized Greedy Forest in C++: User Guide

1 Introduction

2 Get started: A quick tour of functionality

2.1 Get started: `train` – training

2.2 Models: Why one call of training results in several model files

2.3 `predict`: To apply a model to new data

2.4 Executable `rgf` and Perl script `call_exe.pl`

2.5 `train_test`: train, apply, and evaluate models

2.6 `train_predict`: train and apply the models and save predictions

3 Input/output file format

3.1 Data file format

3.1.1 Data points

3.1.2 Targets

3.2 Data point weight file format

3.3 Prediction file format

3.4 Model information file format

3.5 Evaluation file format

3.6 Model files

4 Parameters

4.1 Overview of RGF training

4.2 Naming conventions and notation

4.3 Parameters for `train`

4.3.1 Parameters to control input and output for `train`

4.3.2 Parameters to control training

4.4 Parameters for `predict`

4.5 Parameters for `train_test`

4.5.1 Parameters to control input and output for `train_test`

4.5.2 Parameters to control training

4.6 Parameters for `train_predict`

4.6.1 Parameters to control input and output for `train_predict`

4.6.2 Parameters to control training

References

Files

rgf-guide.rst

Latest commit

History

rgf-guide.rst

File metadata and controls

Regularized Greedy Forest in C++: User Guide

1 Introduction

2 Get started: A quick tour of functionality

2.1 Get started: train – training

2.2 Models: Why one call of training results in several model files

2.3 predict: To apply a model to new data

2.4 Executable rgf and Perl script call_exe.pl

2.5 train_test: train, apply, and evaluate models

2.6 train_predict: train and apply the models and save predictions

3 Input/output file format

3.1 Data file format

3.1.1 Data points

3.1.2 Targets

3.2 Data point weight file format

3.3 Prediction file format

3.4 Model information file format

3.5 Evaluation file format

3.6 Model files

4 Parameters

4.1 Overview of RGF training

4.2 Naming conventions and notation

4.3 Parameters for train

4.3.1 Parameters to control input and output for train

4.3.2 Parameters to control training

4.4 Parameters for predict

4.5 Parameters for train_test

4.5.1 Parameters to control input and output for train_test

4.5.2 Parameters to control training

4.6 Parameters for train_predict

4.6.1 Parameters to control input and output for train_predict

4.6.2 Parameters to control training

References

2.1 Get started: `train` – training

2.3 `predict`: To apply a model to new data

2.4 Executable `rgf` and Perl script `call_exe.pl`

2.5 `train_test`: train, apply, and evaluate models

2.6 `train_predict`: train and apply the models and save predictions

4.3 Parameters for `train`

4.3.1 Parameters to control input and output for `train`

4.4 Parameters for `predict`

4.5 Parameters for `train_test`

4.5.1 Parameters to control input and output for `train_test`

4.6 Parameters for `train_predict`

4.6.1 Parameters to control input and output for `train_predict`