CN109034398B

CN109034398B - Gradient lifting tree model construction method and device based on federal training and storage medium

Info

Publication number: CN109034398B
Application number: CN201810918867.3A
Authority: CN
Inventors: 成柯葳; 范涛; 刘洋; 陈天健; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2018-08-10
Filing date: 2018-08-10
Publication date: 2023-09-12
Anticipated expiration: 2038-08-10
Also published as: CN109034398A

Abstract

The invention discloses a feature selection method based on federal training, which comprises the following steps: performing federal training on two aligned training samples by adopting an XGboost algorithm to construct a gradient lifting tree model, wherein the gradient lifting tree model comprises a plurality of regression trees, and one split node of each regression tree corresponds to one feature of the training sample; counting average profit values of split nodes corresponding to the same feature in the gradient lifting tree model, and taking the average profit values as scores of the corresponding features; and carrying out feature sorting based on the scores of the features, and outputting sorting results for feature selection, wherein if the features which do not correspond to the split nodes exist in the training sample, the default scores are used for the features. The invention also discloses a feature selection device based on federal training and a computer readable storage medium. The invention realizes federal training modeling by using training samples of different data parties, thereby realizing the feature selection of multi-party sample data.

Description

Gradient lifting tree model construction method and device based on federal training and storage medium

Technical Field

The invention relates to the technical field of machine learning, in particular to a gradient lifting tree model construction method and device based on federal training and a computer readable storage medium.

Background

In the current information age, some behaviors of people can be shown through data, such as consumption behaviors, so that big data analysis is derived, a corresponding behavior analysis model is built through machine learning, and then the behaviors of people can be classified or predicted based on the behavior characteristics of users.

In the existing machine learning technology, a party is usually used for independently training sample data, namely unilateral modeling. Meanwhile, based on the established mathematical model, the characteristics with relatively high importance degree in the sample characteristic set can be determined. However, in many cross-domain big data analysis scenarios, such as where the user has both consumption behavior and loan behavior, the user consumption behavior data is generated at the consumption service provider, and the user loan behavior data is generated at the financial service provider, if the financial service provider needs to predict the user's loan behavior based on the user's consumption behavior characteristics, then the consumption behavior data of the consumption service provider needs to be used and machine-learned together with the loan behavior data of the present party to construct a prediction model.

Therefore, aiming at the application scene, a new modeling mode is needed to realize the joint training of sample data of different data providers, and further realize that both parties participate in modeling together.

Disclosure of Invention

The invention mainly aims to provide a gradient lifting tree model construction method and device based on federal training and a computer readable storage medium, and aims to solve the technical problem that the prior art cannot realize joint training of sample data of different data providers and further cannot realize joint participation modeling of both parties.

In order to achieve the above object, the present invention provides a gradient lifting tree model construction method based on federal training, which includes the following steps:

performing federal training on two aligned training samples by adopting an XGboost algorithm to construct a gradient lifting tree model, wherein the gradient lifting tree model comprises a plurality of regression trees, and one split node of each regression tree corresponds to one feature of the training sample;

counting average profit values of split nodes corresponding to the same feature in the gradient lifting tree model, and taking the average profit values as scores of the corresponding features;

And carrying out feature sorting based on the scores of the features, and outputting sorting results for feature selection, wherein if the features which do not correspond to the split nodes exist in the training sample, the default scores are used for the features.

Optionally, the two aligned training samples are a first training sample and a second training sample, respectively;

the first training sample attribute comprises a sample ID and a part of sample characteristics, and the second training sample attribute comprises a sample ID, another part of sample characteristics and a data tag;

the first training sample is provided by and stored locally to the first data party and the second training sample is provided by and stored locally to the second data party.

Optionally, the performing federal training on the two aligned training samples using the XGboost algorithm to construct the gradient-lifting tree model includes:

on the second data side, acquiring a first-order gradient and a second-order gradient of each training sample in a sample set corresponding to the node splitting of the round;

if the first-round node splitting is the first-round node splitting for constructing a regression tree, encrypting the first-order gradient and the second-order gradient, and then sending the first-order gradient and the second-order gradient to the first data side together with the sample ID of the sample set so as to calculate the benefit value of the split node of the training sample locally corresponding to the sample ID in each splitting mode on the first data side based on the encrypted first-order gradient and the second-order gradient;

If the node splitting of the round is the non-first round node splitting for constructing the regression tree, a sample ID of the sample set is sent to the first data side so as to be used for calculating the benefit value of the split node of the training sample which locally corresponds to the sample ID in each splitting mode along the first-order gradient and the second-order gradient used for the first round node splitting at the first data side;

the second data party receives the encryption gain values of all the split nodes returned by the first data party and decrypts the encryption gain values;

on the second data side, calculating the benefit value of splitting nodes of a training sample corresponding to the sample ID locally in each splitting mode based on the first-order gradient and the second-order gradient;

based on the profit values of all the split nodes calculated by the two parties, determining the global optimal split node for splitting the node of the round;

based on the global optimal splitting node of the current round of node splitting, splitting is carried out on a sample set corresponding to the current node, and a new node is generated to construct a regression tree of the gradient lifting tree model.

Optionally, before the step of obtaining the first-order gradient and the second-order gradient of each training sample in the sample set corresponding to the node splitting of the present round at the second data side, the method further includes:

When node splitting is carried out, judging whether the node splitting of the round corresponds to constructing a first regression tree;

if the node splitting of the current round corresponds to the first regression tree construction, judging whether the node splitting of the current round is the first node splitting of the first regression tree construction;

if the node splitting of the round is the first round node splitting for constructing the first regression tree, initializing a first-order gradient and a second-order gradient of each training sample in the sample set corresponding to the node splitting of the round at the second data side; if the node splitting of the first round is the node splitting of the first round which is not the first round for constructing the regression tree, the first-order gradient and the second-order gradient used by the node splitting of the first round are used;

if the node splitting of the round corresponds to the construction of the non-first regression tree, judging whether the node splitting of the round is the first round node splitting of the construction of the non-first regression tree;

if the node splitting of the round is the first round node splitting for constructing a non-first regression tree, updating a first-order gradient and a second-order gradient according to the last round of federal training; if the node splitting of the first round is the node splitting of the first round, which constructs a regression tree of the first round, the first-order gradient and the second-order gradient used by the node splitting of the first round are used.

Optionally, the gradient lifting tree model building method based on federal training further comprises:

When generating new nodes to construct a regression tree of the gradient lifting tree model, judging whether the depth of the regression tree of the round reaches a preset depth threshold value or not at the second data side;

if the depth of the regression tree of the current round reaches the preset depth threshold, stopping node splitting to obtain a regression tree of the gradient lifting tree model, and if not, continuing the node splitting of the next round.

when node splitting is stopped, judging whether the total number of the round of regression trees reaches a preset number threshold value at the second data side;

if the total number of the regression trees in the round reaches the preset number threshold, stopping the federal training, otherwise, continuing the next round of federal training.

on the second data side, recording the related information of the global optimal splitting node determined by each round of node splitting;

wherein the related information includes: a provider of the corresponding sample data, a signature encoding of the corresponding sample data, and a benefit value.

Optionally, the counting the average benefit value of the split nodes corresponding to the same feature in the gradient-lifted tree model includes:

And on the second data side, taking each global optimal splitting node as a splitting node of each regression tree in the gradient lifting tree model, and counting the average benefit value of the splitting nodes corresponding to the same feature codes.

Further, in order to achieve the above object, the present invention further provides a gradient lifting tree model building device based on federal training, where the gradient lifting tree model building device based on federal training includes a memory, a processor, and a gradient lifting tree model building program stored on the memory and capable of running on the processor, where the gradient lifting tree model building program, when executed by the processor, implements the steps of the gradient lifting tree model building method based on federal training as set forth in any one of the above.

Further, to achieve the above object, the present invention further provides a computer readable storage medium having stored thereon a gradient-lifting tree-model building program which, when executed by a processor, implements the steps of the federally-trained gradient-lifting tree-model building method according to any one of the above.

The method adopts an XGboost algorithm to perform federal training on two aligned training samples to construct a gradient lifting tree model, wherein the gradient lifting tree model is a regression tree set and comprises a plurality of regression trees, and a split node of each regression tree corresponds to one feature of the training sample; the average profit value of the split nodes corresponding to the same feature in the tree model is promoted through statistical gradient, the average profit value is used as the score of the corresponding feature, and further the scoring of the features of the two training sample data is achieved; and finally, carrying out feature sorting based on the scores of the features, and outputting sorting results for feature selection, wherein the higher the score is, the higher the importance of the features is. The invention realizes federal training modeling by using training samples of different data parties, thereby realizing the feature selection of multi-party sample data.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment involved in an embodiment of a federally trained gradient-lifting tree model building apparatus of the present invention;

FIG. 2 is a schematic flow chart of an embodiment of a federally trained gradient-lifted tree model construction method according to the present invention;

FIG. 3 is a detailed flowchart of the embodiment of step S10 in FIG. 2;

FIG. 4 is a schematic diagram of training results of an embodiment of a federally trained gradient-lifted tree model construction method according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a gradient lifting tree model building device based on federal training.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware operation environment related to an embodiment scheme of a gradient lifting tree model building device based on federal training.

The gradient lifting tree model building device based on federal training can be a personal computer or a server and other equipment with calculation processing capacity.

As shown in fig. 1, the gradient-lifting tree model building apparatus based on federal training may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the federally trained gradient-lifting tree model building apparatus structure shown in fig. 1 is not limiting of the apparatus and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a file copy program may be included in the memory 1005, which is a type of computer storage medium.

In the gradient lifting tree model building device based on federal training shown in fig. 1, the network interface 1004 is mainly used for connecting a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call a gradient-lifting tree model building program stored in the memory 1005 and perform the following operations:

Further, the two aligned training samples are a first training sample and a second training sample, respectively; the first training sample attribute comprises a sample ID and a part of sample characteristics, and the second training sample attribute comprises a sample ID, another part of sample characteristics and a data tag; the first training sample is provided by a first data party and is stored locally to the first data party, and the second training sample is provided by a second data party and is stored locally to the second data party; the processor 1001 calls the gradient-lifting tree-model building program stored in the memory 1005 to further perform the following operations:

Further, the processor 1001 calls the gradient-lifting tree model building program stored in the memory 1005 to further perform the following operations:

on the first data side, calculating the benefit value of splitting nodes of a training sample locally corresponding to the sample ID in each splitting mode based on the encrypted first-order gradient and second-order gradient;

or on the first data side, calculating the benefit value of splitting nodes of a training sample corresponding to the sample ID locally in each splitting mode along with a first-order gradient and a second-order gradient used for splitting the first-round nodes;

encrypting the benefit values of all the split nodes and then sending the encrypted benefit values to the second data party.

Based on the hardware operation environment related to the embodiment scheme of the gradient lifting tree model building device based on the federal training, the following embodiments of the gradient lifting tree model building method based on the federal training are provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating an embodiment of a federally trained gradient-lifted tree model construction method according to the present invention. In this embodiment, the method for constructing the gradient lifting tree model based on federal training includes the following steps:

step S10, performing federal training on two aligned training samples by adopting an XGboost algorithm to construct a gradient lifting tree model, wherein the gradient lifting tree model comprises a plurality of regression trees, and one split node of each regression tree corresponds to one feature of the training sample;

the XGboost (eXtreme Gradient Boosting) algorithm is an improvement on the Boosting algorithm based on the GBDT (Gradient Boosting Decision Tree, gradient lifting tree) algorithm, wherein the internal decision tree is a regression tree, the algorithm output is a set of regression trees and comprises a plurality of regression trees, the basic idea of training and learning is to traverse all segmentation methods of all features of a training sample (namely, a node splitting mode), select a segmentation method with the minimum loss, obtain two leaves (namely, splitting nodes to generate new nodes), and then continue traversing until:

(1) If the splitting stopping condition is met, outputting a regression tree;

(2) And if the iteration stopping condition is met, outputting a regression tree set.

In this embodiment, the training samples used by the XGboost algorithm are two independent training samples, that is, each training sample belongs to a different data party. If two training samples are regarded as one integral training sample, the two training samples belong to different data parties, so that the integral training sample can be regarded as being segmented, and the training samples are different features (sample longitudinal cutting) of the same sample.

Furthermore, it is possible to provide a device for the treatment of a disease. Because the two training samples belong to different data parties respectively, in order to realize federal training modeling, sample alignment needs to be carried out on original sample data provided by the two parties.

In this embodiment, federal training means that the sample training process is completed by cooperation of two data parties, and the final training results in a regression tree contained in the gradient lifting tree model, where the split nodes of the regression tree correspond to the features of the training samples of both parties.

Step S20, counting average profit values of split nodes corresponding to the same feature in the gradient lifting tree model, and taking the average profit values as scores of the corresponding features;

In the XGboost algorithm, when all segmentation methods of all features of a training sample are traversed, the advantages and disadvantages of the segmentation methods are evaluated through the benefit value, and the segmentation method with the minimum loss is selected for each split node. Therefore, the profit value of the split node can be used as the evaluation basis of the feature importance, and the bigger the profit value of the split node is, the smaller the node segmentation loss is, and the bigger the importance of the feature corresponding to the split node is.

In this embodiment, since the gradient lifting tree model obtained by training includes multiple regression trees, and different regression trees may use the same feature to perform node segmentation, it is necessary to count average benefit values of split nodes corresponding to the same feature in all regression trees included in the gradient lifting tree model, and use the average benefit values as scores of the corresponding features.

And step S30, feature sorting is performed based on the scores of the features, and sorting results are output for feature selection, wherein if the features of the non-corresponding split nodes exist in the training sample, default scores are used for the features.

In this embodiment, the degree of importance of the feature is represented by the degree of importance of the feature, and after the score of each feature is obtained, the feature is ranked and the ranking result is output, for example, from high to low, so that the importance of the feature ranked in front is higher than that of the feature ranked in back. Thus, by feature selection, features in the sample that are not related to sample prediction or classification can be eliminated. For example, the student samples include: gender, school achievements, attendance rate and number of times of surfacing, if the classification target is a three-good student and a non-three-good student, the characteristic gender is obviously irrelevant to or not relevant to the three-good student, so that the students can be removed.

In the embodiment, two aligned training samples are subjected to federal training by adopting an XGboost algorithm to construct a gradient lifting tree model, wherein the gradient lifting tree model is a regression tree set and comprises a plurality of regression trees, and one splitting node of each regression tree corresponds to one characteristic of the training sample; the average profit value of the split nodes corresponding to the same feature in the tree model is promoted through statistical gradient, the average profit value is used as the score of the corresponding feature, and further the scoring of the features of the two training sample data is achieved; and finally, carrying out feature sorting based on the scores of the features, and outputting sorting results for feature selection, wherein the higher the score is, the higher the importance of the features is. The embodiment realizes federal training modeling by using training samples of different data parties, and further realizes feature selection of multi-party sample data.

Further, to facilitate description of a specific implementation of the joint training of the present invention, this embodiment is specifically illustrated with two independent training samples.

In this embodiment, the first data party provides a first training sample, where the first training sample attribute includes a sample ID and a portion of sample features; the second data party provides a second training sample, the second training sample attribute including a sample ID, another portion of sample characteristics, and a data tag.

Where a sample characteristic refers to a characteristic that the sample represents or has, such as the sample is a person, the corresponding sample characteristic may be age, gender, income, academic, etc. The data tag is used for classifying a plurality of different samples, and the classification result is obtained by judging according to the characteristics of the samples.

The federal training modeling method has the main significance of realizing the bidirectional privacy protection of the sample data of both parties. Thus, during the federal training process, the first training sample is stored locally on the first data party and the second training sample is stored locally on the second data party, e.g., the data in Table 1 below is provided by the first data party and stored locally on the first data party, and the data in Table 2 below is provided by the second data party and stored locally on the second data party.

TABLE 1

As shown in table 1 above, the first training sample attributes include sample IDs (X1 to X5), age feature, gender feature, and Amount of given credit feature.

TABLE 2

Sample ID	Bill Payment	Education	Lable
				X1	3102	2	24
X2	17250	3	14
				X3	14027	2	16
X4	6787	1	10
				X5	280	1	26

As shown in Table 2 above, the second training sample attribute includes sample IDs (X1-X5), bill Payment feature, reduce feature, and data tag Lable.

Further, referring to fig. 3, fig. 3 is a schematic diagram of a refinement process of an embodiment of step S10 in fig. 2. Based on the above embodiment, in this embodiment, the step S10 specifically includes:

Step S101, at the side of the second data side, acquiring a first-order gradient and a second-order gradient of each training sample in a sample set corresponding to the node splitting of the round;

the XGboost algorithm is a machine learning modeling method that requires the use of a classifier (i.e., a classification function) to map sample data to one of a given class, and thus can be applied to data prediction. In learning classification rules with a classifier, a loss function is required to determine the magnitude of a fitting error of machine learning.

In this embodiment, when node splitting is performed each time, a first-order gradient and a second-order gradient of each training sample in the sample set corresponding to the node splitting of the present round are obtained at the second data side.

The gradient lifting tree model needs to be subjected to multiple rounds of federal training, each round of federal training correspondingly generates a regression tree, and the generation of the regression tree needs to be subjected to multiple node splitting.

Therefore, in each round of federal training, the first node splitting uses the training sample stored initially, the next node splitting uses the training sample of the new node corresponding sample set generated by the last node splitting, and in the same round of federal training, each round of node splitting uses the first-order gradient and the second-order gradient used by the first-round node splitting. The next federal training will update the first and second gradients used in the previous federal training round using the results of the previous federal training round.

The XGboost algorithm supports a self-defined loss function, and uses the self-defined loss function to calculate a first-order partial derivative and a second-order partial derivative of the objective function, so as to correspondingly obtain a first-order gradient and a second-order gradient of sample data to be trained locally.

Based on the description of the XGboost algorithm and the gradient-lifting tree model in the above embodiments, the split nodes need to be determined for constructing the regression tree, and the split nodes can be determined by the benefit values. The calculation formula of the gain value gain is as follows:

wherein I is _L An inclusive sample set representing a left child node after a current node split, I _R An inclusive sample set, g, representing the right child node after the current node split _i Represents the first order gradient of sample i, h _i The second order gradient of sample i is represented, and λ and γ are constants.

Because the sample data to be trained respectively have the first data party and the second data party, the profit value of the splitting node of the respective sample data under each splitting mode needs to be calculated on the first data party side and the second data party side respectively.

In this embodiment, since the first data side and the second data side are aligned in advance, both sides have the same gradient characteristics, and since the data tag exists in the sample data of the second data side, the benefit value of splitting the node in each splitting mode of both sides sample data is calculated based on the first-order gradient and the second-order gradient of the sample data of the second data side.

Step S102, if the round of node splitting is the first round of node splitting for constructing a regression tree, encrypting the first-order gradient and the second-order gradient, and then sending the first-order gradient and the second-order gradient to the first data side together with the sample ID of the sample set so as to calculate the benefit value of the split node of the training sample locally corresponding to the sample ID in each splitting mode based on the encrypted first-order gradient and the second-order gradient on the first data side;

in this embodiment, in order to realize bidirectional privacy protection of sample data of both parties in the federal training process, if the node splitting of the present round is first node splitting of constructing a regression tree, the first-order gradient and the second-order gradient of the sample data are obtained by calculation on the second data party side, then encrypted, and then sent to the first data party.

On the first data side, the profit value of the split node of the local sample data of the first data side is calculated based on the first-order gradient and the second-order gradient of the received sample data and the calculation formula of the profit value gain, and the profit value obtained by calculation is also an encryption value because the first-order gradient and the second-order gradient are encrypted, so that encryption of the profit value is not needed.

After the profit value of the split node under various splitting modes of the sample data is calculated, the split node can be split to generate a new node so as to construct a regression tree. The embodiment preferably uses the second data party with the sample data with the data label to master and construct the regression tree of the gradient lifting tree model. Thus, the benefit value of the splitting node in each splitting mode needs to be sent to the second data party by the first data party local sample data calculated at the first data party side.

Step S103, if the node splitting of the present round is a non-first round node splitting of constructing a regression tree, a sample ID of the sample set is sent to the first data side so as to be used for using a first-order gradient and a second-order gradient used for the first round node splitting on the first data side, and a benefit value of splitting nodes of a training sample locally corresponding to the sample ID in each splitting mode is calculated;

in this embodiment, if the present round of node splitting is a non-first round of node splitting for constructing the regression tree, only the sample ID of the sample set corresponding to the present round of node splitting is required to be sent to the first data party, and the first data party continues to use the first-order gradient and the second-order gradient used during the first round of node splitting, so as to calculate the benefit value of splitting the node of the training sample corresponding to the received sample ID locally in each splitting mode.

Step S104, the second data party receives the encryption income values of all the split nodes returned by the first data party and decrypts the encryption income values;

step S105, on the second data side, calculating a benefit value of splitting nodes of the training sample corresponding to the sample ID locally in each splitting mode based on the first-order gradient and the second-order gradient;

on the second data side, based on the first-order gradient and the second-order gradient of the sample data obtained by calculation and the calculation formula of the gain value gain, the gain value of the splitting node of the sample data to be trained locally on the second data side is calculated in each splitting mode.

Step S106, determining the global optimal splitting node of the splitting of the round of nodes based on the profit values of all the splitting nodes calculated by the two parties respectively;

since the initial sample data of the two parties are aligned by the samples, the profit value of all the split nodes calculated by the two parties can be regarded as the profit value of the split nodes in each split mode for the whole data sample of the two parties, and therefore, the split node with the largest profit value is taken as the global optimal split node for splitting the nodes of the round by comparing the sizes of the profit values.

It should be noted that, the sample feature corresponding to the global best split node may belong to the training sample of the first data party or the training sample of the second data party.

Optionally, since the regression tree construction of the gradient lifting tree model is dominated by the second data party, on the second data party side, the relevant information of the global optimal splitting node determined by each round of node splitting needs to be recorded; the related information includes: a provider of the corresponding sample data, a signature encoding of the corresponding sample data, and a benefit value.

For example, if the data party a holds the feature f corresponding to the global optimal partition point _i This record is then (Site A, E _A (f _i ) Gain). Otherwise, if the data party B holds the feature f corresponding to the global optimal partition point _i Then this record is (Site B, E _B (f _i ) Gain). Wherein E is _A (f _i ) Representing the characteristics f of the data party A _i Coding E _B (f _i ) Representing the characteristic f of the data side B _i Coding is carried out, by which the feature f can be marked _i Without revealing its original characteristic data.

Optionally, when selecting the features in the foregoing embodiment, it is preferable to use each global optimal splitting node as a splitting node of each regression tree in the gradient lifting tree model, and calculate an average benefit value of the splitting nodes corresponding to the same feature code.

Step S107, splitting a sample set corresponding to the current node based on the global optimal splitting node of the current node splitting, and generating a new node to construct a regression tree of the gradient lifting tree model.

If the sample feature corresponding to the global optimal splitting node of the current round of node splitting belongs to the training sample of the first data party, the sample data corresponding to the current node of the current round of splitting belongs to the first data party. Correspondingly, if the sample feature corresponding to the global optimal splitting node of the current round of node splitting belongs to the training sample of the second data party, the sample data corresponding to the current node of the current round of splitting belongs to the second data party.

By node splitting, new nodes (left child node and right child node) can be generated, thereby constructing a regression tree. Through multiple rounds of node splitting, new nodes can be continuously generated, so that a regression tree with deeper tree depth is obtained, and if node splitting is stopped, a regression tree of the gradient lifting tree model can be obtained.

In this embodiment, since the data of the calculation communication of both parties are encrypted data of the model intermediate result, the training process will not leak the original feature data. And meanwhile, an encryption algorithm is used in the whole training process so as to ensure the privacy of data. Preferably, a partial homomorphic encryption algorithm is used to support the addition homomorphism.

Further, in an embodiment, based on the difference of the node splitting conditions, the first-order gradient and the second-order gradient of the training sample for node splitting are obtained specifically by:

1. the round of node splitting corresponds to the construction of a first regression tree

1.1, if the node splitting of the round is the first round node splitting for constructing a first regression tree, initializing a first-order gradient and a second-order gradient of each training sample in a sample set corresponding to the node splitting of the round at a second data side;

1.2, if the node splitting of the first round is the node splitting of the first round which is not the first round for constructing the regression tree, the first-order gradient and the second-order gradient used by the node splitting of the first round are used.

2. Non-initial regression tree is correspondingly constructed through node splitting of the round

2.1, if the node splitting of the round corresponds to the node splitting of the first round of constructing a non-first regression tree, updating a first-order gradient and a second-order gradient according to the federal training of the previous round;

2.2, if the node splitting of the round is the node splitting of the non-first round for constructing the non-first regression tree, the first-order gradient and the second-order gradient used by the node splitting of the first round are used.

Further, in one embodiment, to reduce the complexity of the regression tree, a depth threshold of the regression tree is preset for node splitting limitation.

In this embodiment, when each round of generating new nodes to construct a regression tree of the gradient lifting tree model, determining, on the second data side, whether the depth of the regression tree of the round reaches a preset depth threshold;

if the depth of the regression tree of the current round reaches a preset depth threshold, stopping node splitting, further obtaining a regression tree of the gradient lifting tree model, and if not, continuing the node splitting of the next round.

It should be noted that, the condition for limiting node splitting may also be stopping node splitting when the node cannot continue splitting, for example, if the node cannot continue splitting if the node corresponds to a sample corresponding to the current node.

Further, in another embodiment, to avoid overfitting during training, a number threshold of regression trees is preset to limit the number of regression trees generated.

In this embodiment, when node splitting is stopped, on the second data side, it is determined whether the total number of the present round of regression trees reaches a preset number threshold;

if the total number of the regression trees in the round reaches a preset number threshold, stopping the federal training, otherwise, continuing the next round of federal training.

It should be noted that the condition for limiting the number of regression trees to be generated may be to stop constructing the regression tree when the node cannot continue to split.

For a better understanding of the present invention, the federal training and modeling process of the present invention is illustrated below based on the sample data in tables 1 and 2 in the above examples.

First pass federal training: training a first regression tree

(1) First round node splitting

1.1 on the second data side, a first order gradient (g) of the sample data in Table 2 was calculated _i ) And second order gradient (h _i ) The method comprises the steps of carrying out a first treatment on the surface of the For g _i And h _i After encryption, the encrypted data is sent to a first data party;

1.2 on the first data side, based on g _i And h _i Calculating gain values gain of split nodes in all possible split modes of the sample data in the table 1; sending the benefit value gain to the second party;

since the Age feature in table 1 has 5 sample data division modes, the Gender feature has 2 sample data division modes, and the Amount of given credit feature has 5 sample data division modes, the sample data in table 1 has 12 split modes in total, that is, the profit value of the split node corresponding to the 12 split modes needs to be calculated.

1.3, on the second data side, calculating the gain value gain of the split node in all possible splitting modes of the sample data in the table 2;

because the Bill Payment feature in Table 2 has 5 sample data dividing modes and the effect feature has 3 sample data dividing modes, the sample data in Table 2 has 8 splitting modes in total, that is, the profit value of the splitting node corresponding to the 8 dividing modes needs to be calculated.

1.4, selecting a feature corresponding to the maximum profit value from profit values of split nodes corresponding to 12 division modes calculated from a first data side and profit values of split nodes corresponding to 8 division modes calculated from a second data side as a global optimal split node for splitting the node of the round;

and 1.5, splitting the sample data corresponding to the current node based on the global optimal splitting node of the current node splitting, and generating a new node to construct a regression tree of the gradient lifting tree model.

1.6, judging whether the depth of the round of regression tree reaches a preset depth threshold value; if the depth of the regression tree of the current round reaches a preset depth threshold, stopping node splitting, further obtaining a regression tree of the gradient lifting tree model, and if not, continuing the node splitting of the next round;

1.7, judging whether the total number of the round of regression trees reaches a preset number threshold value; if the total number of the regression trees in the round reaches a preset number threshold, stopping the federal training, otherwise, continuing the next round of federal training.

(2) Second, third-round node splitting

2.1, assuming that the feature corresponding to the splitting node of the previous round is Bill Payment less than or equal to 3102, the feature is taken as a splitting node (corresponding samples are X1, X2, X3, X4 and X5), two new splitting nodes are generated, wherein the left node corresponds to a sample set (X1 and X5) which is less than or equal to 3102, the right node corresponds to a sample set (X2, X3 and X4) which is greater than 3102, and the sample set (X1 and X5) and the sample set (X2, X3 and X4) are taken as new sample sets respectively to continue splitting of the second and third rounds of nodes so as to split the two new nodes respectively, and generate new nodes.

2.2 since the second and third rounds of node splitting belong to the same round of federal training, the sample gradient values used for the first round of node splitting continue to be used. Assuming that a split node of the present round corresponds to a feature Amount of given credit which is less than or equal to 200, the feature is taken as a split node (corresponding samples are X1 and X5), two new split nodes are generated, wherein a left node corresponds to a sample X5 which is less than or equal to 200, and a right node corresponds to a sample X1 which is greater than 200; likewise, another split node of the present round corresponds to a feature with Age less than or equal to 35, and that feature is taken as a split node (corresponding samples X2, X3, X4), resulting in two new split nodes, with the left node corresponding to samples X2, X3 less than or equal to 35 and the right node corresponding to sample X4 greater than 35. The specific implementation flow refers to the first round of node splitting process.

Second round of federal training: training a second regression tree

And 3.1, as the node splitting of the round belongs to the next round of federal training, updating the first-order gradient and the second-order gradient used in the previous round of federal training by the result of the previous round of federal training, continuing the node splitting of the second round of federal training to generate a new node to construct the next regression tree, and particularly realizing the construction process of the previous regression tree with reference to the flow.

3.2 as shown in fig. 4, in the above embodiment, after two rounds of federal training, the sample data in tables 1 and 2 generate two regression trees, where the first regression tree includes three split nodes, respectively: bill Payment is less than or equal to 3102, amount of given credit is less than or equal to 200, and Age is less than or equal to 35; the second regression tree includes two split nodes, one for each: bill Payment is less than or equal to 6787, gender+=1.

3.3, based on two regression trees of the gradient lifting tree model as shown in fig. 4, the average benefit value corresponding to the characteristics of the sample data: bill Payment is (gain1+gain4)/2; the reduction is 0; age is gain3; gender is gain5; amount of given credit is gain2.

The invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention stores thereon a gradient-lifting tree-model building program which, when executed by a processor, implements the steps of the federally-trained gradient-lifting tree-model building method as described in any one of the embodiments above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM), comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method according to the embodiments of the present invention.

While the embodiments of the present invention have been described above with reference to the drawings, the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many modifications may be made thereto by those of ordinary skill in the art without departing from the spirit of the present invention and the scope of the appended claims, which are to be accorded the full scope of the present invention as defined by the following description and drawings, or by any equivalent structures or equivalent flow changes, or by direct or indirect application to other relevant technical fields.

Claims

1. A feature selection method based on federal training, characterized in that the feature selection method based on federal training comprises the steps of:

performing federal training on two aligned training samples by adopting an XGboost algorithm to construct a gradient lifting tree model, wherein the two aligned training samples are a first training sample and a second training sample respectively, the first training sample is provided by a first data party and stored in the first data party, the second training sample is provided by a second data party and stored in the second data party, the gradient lifting tree model comprises a plurality of regression trees, and one split node of each regression tree corresponds to one feature of the training sample;

feature sorting is carried out based on the scores of the features, and sorting results are output for feature selection, wherein if features which do not correspond to the split nodes exist in the training sample, default scores are used for the features;

wherein, the performing federal training on the two aligned training samples by using the XGboost algorithm to construct the gradient lifting tree model includes:

2. The federally trained based feature selection method according to claim 1, wherein the first training sample attribute comprises a sample ID and a portion of sample features and the second training sample attribute comprises a sample ID, another portion of sample features and a data tag.

3. The federal training-based feature selection method according to claim 1, wherein before the step of obtaining, on the second data side, a first-order gradient and a second-order gradient of each training sample in the sample set corresponding to the node splitting of the present round, the method further comprises:

4. The federally trained feature selection method according to claim 1, wherein the federally trained feature selection method further comprises:

5. The federally trained feature selection method according to claim 4, wherein the federally trained feature selection method further comprises:

6. A federally training based feature selection method according to any of claims 1-5, further comprising:

7. The federally trained feature selection method according to claim 6, wherein said counting average benefit values for split nodes corresponding to the same feature in the gradient-lifted tree model comprises:

8. A federally trained feature selection apparatus comprising a memory, a processor, and a feature selection program stored on the memory and executable on the processor, the feature selection program when executed by the processor implementing the steps of the federally trained feature selection method according to any of claims 1-7.

9. A computer readable storage medium, having stored thereon a feature selection program, which when executed by a processor, implements the steps of the federally trained feature selection method according to any of claims 1-7.